Citation
Knowledge discovery in flow cytometry data

Material Information

Title:
Knowledge discovery in flow cytometry data
Creator:
Siebert, Janet
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
x, 60 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Flow cytometry -- Data processing ( lcsh )
Immunology -- Data processing ( lcsh )
Computational biology ( lcsh )
Data mining ( lcsh )
Computational biology ( fast )
Data mining ( fast )
Flow cytometry -- Data processing ( fast )
Immunology -- Data processing ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 58-60).
Thesis:
Computer science
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Janet Siebert.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
61133860 ( OCLC )
ocm61133860
Classification:
LD1190.E52 2004m S53 ( lcc )

Downloads

This item has the following downloads:


Full Text
KNOWLEDGE DISCOVERY IN FLOW CYTOMETRY DATA
by
Janet Siebert
B.A., University of Montana, 1981
M.Ed., Vanderbilt University, 1984
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2004


This thesis for the Master of Science
degree by
Janet Siebert
has been approved by
Krzysztof Cios
M. Karen Newell
Date


Siebert, Janet (M.S., Computer Science)
Knowledge Discovery in Flow Cytometry Data
Thesis directed by Professor Krzysztof Cios
ABSTRACT
Flow cytometry data is stored in a published but esoteric format. Existing
analysis tools are proprietary, and limited in available functionality. Additionally,
they are designed to process a small number of samples at a time. Immunological
research requires analysis of multiple samples on multiple individuals. The selection
of samples for a particular type of analysis ideally is based on experiment metadata,
identifying the individual, characteristics of that individual, and characteristics of the
sample. This metadata is not available to existing tools. This work attempts to
overcome these shortcomings in the existing analytical environments. It offers four
distinct contributions:
1. Open architecture: Data is stored in a relational database, and can be accessed
and analyzed by many tools and techniques.
2. Addition of the experiments metadata: Each event from each sample is
associated with relevant information from the experiment, such as individual,
individuals strain, individuals diet type, stain or sample type, and sample
processing technique. This metadata is easily accessible within the database,
thereby supporting analysis of samples by any logical grouping.
3. Ability to derive new metrics from core data: Much of the analysis in this
work is based on Normalized Fluorescence, in which the measurement for
fluorescence is divided by the measurement for relative cell size.
4. Support for characteristic based analysis: The environment created in this
work allows analysis by characteristic, such as stain, diet, or sample
processing technique. Data can be grouped by experiment or project, instead
of simply by sample.
While these last two items are the most significant and new approaches presented by
this work at this point in time, the first two items may prove the most significant
in


contributions over the long run. The combination of the open environment and the
inclusion of the experiments metadata create a rich analytical environment in which
biologists can perform types of analysis that they simply could not have performed
before. This possibility may lead to powerful new analytical techniques, and to
significant biological findings and discoveries.
IV


ACKNOWLEDGEMENT
My great thanks to my husband, Wes Munsil, for his moral support, wisdom, and
technical guidance. I also thank my advisor, Dr. Krys Cios, for his guidance and
enthusiasm. My gratitude also goes to Ashleigh Reding, Dr. Suzi Schweitzer, Jeff
Rogers, and Dr. Karen Newell (all of the Institute of Bioenergetics, University of
Colorado at Colorado Springs) for their patience in teaching me about flow
cytometry, and for generating the data which inspired this work.


CONTENTS
Figures....................................................................viii
Tables.....................................................................ix
Chapter
L Introduction..........................................................1
2. The Fundamentals of Flow Cytometry....................................4
2.1 Instrumentation and Measurement.......................................4
2.2 A Brief History of Flow Cytometry.....................................9
2.3 Experimental Design..................................................10
2.4 The Analytical Process...............................................13
3. Building the Rich Analytical Environment.............................16
3.1 Data Flow............................................................17
3.2 FCS Parsing..........................................................19
3.3 Preparing the Flow Sample Key........................................23
3.4 Connecting the Flow Data to the Sample Key...........................26
3.5 Evolution of the Data Model..........................................27
3.5.1 The Entity-Relationship Model........................................27
3.5.2 The Dimensional Model................................................29
3.5.3 Data Volume..........................................................32
vi


4. Analytical Techniques in the Rich Analytical Environment..............34
4.1 Statistical Analyses...................................................36
4.1.1 Summary Statistics. Entire Experiment..................................36
4.1.2 Suspect Samples........................................................38
4.1.3 Study of a Particular Individual.......................................39
4.1.4 M5114 Fluoresence, Normal Process and CPCF Process.....................40
4.1.5 Fluorescence by Diet Type, Isotype and M5114 Stains....................42
4.1.6 Fluorescence by Experiment and Diet Type. M5114........................43
4.1.7 Simple Histograms Using Text...........................................45
4.1.8 Normalized Fluorescence by Forward Scatter Range.......................46
4.2 Graphical Analyses.....................................................48
4.3 Summary................................................................52
5. Conclusions and Future Work............................................54
References....................................................................58
vii


FIGURES
Figure
2-1 The Flow Column........................................................5
2-2 Hydrodynamic Focusing..................................................6
2-3 Forward and Side Scatter...............................................7
2-4 Representative Scatter gram...........................................14
2- 5 Representative Histogram............................................14
3- 1 First Generation Data Flow Diagram..................................18
3-2 Second Generation Data Flow Diagram...................................19
3-3 Entity Relationship Model.............................................28
3-4 Dimensional Model.....................................................30
3- 5 The Event Table.....................................................31
4- 1 Normalized FL2. Individual BL-6.....................................49
4-2 M5114 Samples. Experiment 1...........................................50
4-3 M5114 Samples. Experiment 1. Strain C.................................51
4-4 Forward Scatter. M5114 Samples........................................52
vm


TABLES
Table
2- 1 Sample Event Data..................................................9
3- 1 An Extract from a Param File.......................................22
3-2 An Extract from an Event File. First Generation.......................23
3-3 An Extract from an Event File. Second Generation......................23
3-4 Flow Sample Key......................................................24
3-5 Sample Key Data. Reformatted for Database............................25
3-6 Values of SRC Parameter..............................................26
3-7 Histogram-generating Query. Original Model...........................32
3-8 Histogram-generating Query. Dimensional Model........................32
3- 9 Selected Data Set Statistics.......................................33
4- 1 Result Set (Extract)Summary Statistics. Entire Experiment.........37
4-2 Result SetSuspect Samples...........................................39
4-3 Result SetStudy of Individual BH-3..................................40
4-4 Result Set (Extract-)M5114 Samples, Normal and CPCF Processing......41
4-5 Result SetFluorescence bv Diet Type, Normal Processing...............42
4-6 Result SetFluorescence by Diet Type. CPCF Processing.................43
4-7 Result SetFluorescence by Experiment bv Diet Type, Normal Processing.. 44
IX


4-8 Result SetFluorescence by Experiment by Diet Type, CPCF Processing.44
4-9 Result Set (Extract)Textual Histogram..............................46
4-10 Result SetFluorescence by Forward Scatter Range...................47
4-11 SOL Statement for Graphical Result.................................49
x


1.
Introduction
Flow cytometry is a common technique used by research biologists and
immunologists. A flow cytometer measures several attributes of individual cells
which are suspended in a solution. Flow cytometry processing generally collects
data on thousands of cells per sample. This technique is used to study cell behavior,
and investigate treatments for diseases such as cancer, HIV, and sickle cell anemia.
The data collected by the flow cytometer is written to a published but esoteric
format. Generally, biologists use proprietary software to access the data and perform
a fixed set of analyses. Unfortunately, these techniques are limited and limiting.
This work provides mechanisms and methods to dramatically improve the efficiency
and range of the data analysis techniques.
Research biologists at the University of Colorado Institute for Bioenergetics,
located at the University of Colorado at Colorado Springs, provided the experimental
data and guidance which inspired this work. Flow cytometry is integral to their
research into cellular metabolism and cellular communication. Chapter 2, The
Fundamentals of Flow Cytometry, presents background information, and provides an
overview of one project conducted by the researchers at the Institute for
1


Bioenergetics. It then describes the current state of the art in flow cytometry
analysis.
Chapter 3, Building the Rich Analytical Environment, explains the
methodology of moving the data from a limited environment to one in which a
myriad of analytical techniques are possible. We combine data created by the flow
cytometer and data sources maintained by the biologists in a relational database.
Additionally, we convert a sample key maintained by the biologists from a word
processor document into a database table. This sample key correlates the flow
cytometer run number to the particular individual and the particular preparation of
the sample. The chapter begins with a discussion of data flow for creating this
environment, and continues with details on processing the data generated by the flow
cytometer. The chapter also discusses the evolution of the data model for the rich
analytical environment.
Chapter 4, Analytical Techniques in the Rich Analytical Environment,
expands upon the types of analysis that can be performed with a large body of data
readily accessible in an open environment. Existing analytical processes limit the
analyst to considering one sample at a time. With all of the samples from an
experiment loaded into a relational database, the analyst can consider such groupings
as all of the samples on this particular individual or all of the samples of this
particular sample type. Analyses can be either statistical or graphical.
2


Chapter 5, Conclusion and Future Work, summarizes the discussions, and
offers suggestions for future work in the rich analytical environment. The
possibilities for future work include generalizing the parsing algorithms, automating
the identification of cell groups, incorporating additional information into the
environment, identifying additional analytical algorithms, and automating the
execution of the most valuable algorithms.
3


2. The Fundamentals of Flow Cytometry
This chapter presents the fundamentals of flow cytometry, a brief history of
the instrument, the experiment which generated the data considered in this study, and
the standard methods of analyzing flow cytometry data. The first section provides an
overview of the instrument and the data it collects. The second section highlights
selected aspects of the development of the instrument. The third section discusses
experimental design, methods, and materials. The final section considers the current
state-of-the-art techniques for analyzing the resulting data.
2.1 Instrumentation and Measurement
Articles and media segments about genetics and proteomics research abound
in the popular press (Kelleher, 2004; Harris, 2004). The casual reader thereby might
assume that all bioscientists are immersed in such research, and that the cell is no
longer an important research focus. However, the cell is the fundamental building
block of every organism. In the hierarchy of biological order, cells hold a special
place, for they alone have the capacity to make themselves autonomously, and to
multiply by division (Harold, 2001, p. 17). Flow cytometry provides a mechanism
for studying cell characteristics and behaviors.
4


One of the strengths of the flow cytometer is its ability to record multiple
independent and quantitative measurements on a large number of cells (Parks, 1996).
A flow cytometer takes in a sample of cells or cell particles suspended in solution,
sending them in a single file past a laser beam, as shown in Figure 2-1. The
alignment of the cells is achieved by hydrodynamic focusing, as shown in Figure 2-2
(BD Biosciences, 1999). The focusing of the solution into a thin channel forces the
cells to align in single file within the stream.
laser
forward scatter detector
Figure 2-1. The Flow Column
5


Figure 2-2. Hydrodynamic Focusing
The particular instrument used in this study was a Coulter Epics XL-MCL.
The four data points measured by this instrument for these studies are:
Forward Scatter (FS)
Side Scatter (SS)
Red Fluorescence (FL1)
Green Fluorescence (FL2)
Figure 2-3 highlights the high-level physics behind the measurement of
forward scatter and side scatter. Forward scatter represents cell size. Side scatter
represents internal structure of the cell, or granularity. Among cells with different
structures, light scatter provides a rough indicator of cell size. Among cells with
similar structures, forward scatter and side scatter increase monotonically with cell
size. Dead cells and cellular debris tend to have higher side scatter than live cells.
Taken together, forward scatter and side scatter can help identify and thereby
exclude dead cells and debris (Parks, 1996).
6


forward scatter
side scatter
Figure 2-3. Forward and Side Scatter
Fluorescence detectors measure the presence of cells or molecules that have
been dyed during the pre-processing of the sample. Each particle on which data is
recorded is called an event. Data collected by the flow cytometer during the
processing of a sample is written to an output file.
The flow cytometer analyzes a sample until a certain number of acceptable
events have been processed. An acceptable event is one in which the data falls into a
pre-configured range, based on typical size and granularity for a particular
population. Biologists can label this range live. Other ranges are sometimes
labeled dead or junk. Cells that fall into the live range have FS and SS
7


measurements such that they are probably intact cells. Those that fall into the dead
range have FS and SS measurements such that they probably have been
compromised during the processing, and are no longer whole. Sample runs are often
configured to process 5,000 live events. This processing generally takes 15 to 120
seconds, depending on how many cells are in the sample. Depending on the sample,
anywhere from approximately 6,000 to 30,000 particles must be processed to find
5000 particles in the acceptable range.
The data is written in a standard format, such as FCS 2.0 or FCS3.0, where
FCS stands for Flow Cytometry Standard. The formats are specified by the Data File
Standards Committee of the International Society for Analytical Cytology (ISAC).
The file format includes a header section, an ASCII text section specifying
parameters of the data run, and a section recording the data from the events. The
data section is often written in a binary encoding. Thus, the file must be processed
by a utility program for the event data to be translated to a human-readable form.
Table 2-1 shows sample event data, after processing.
8


Table 2-1. Sample Event Data
FS SS FL1 LOG FL2 LOG FS LOG SS LOG
190 274 0 0 836 877
266 206 0 0 874 846
245 265 0 0 865 873
34 43 172 0 645 672
84 206 0 0 746 846
85 72 0 0 747 729
86 124 113 0 748 789
247 252 0 0 865 868
229 206 73 0 857 846
2.2 A Brief History of Flow Cytometry
Two pioneers in the fields of flow cytometry, fluorescence activated cell
sorting (FACS), and immunology are Leonard and Leonore Herzenberg. Their
recent autobiography, published in the Annual Review of Immunology (2004),
provides insights into the development of the flow cytometer. The development
effort started in the late 1960s:
As I became more deeply involved in immunology, I became
increasingly aware of the need to characterize and isolate the
different kinds of lymphocytes that were beginning to be
visualized with fluorescent-labeled antibodies under the
microscope.. .So I started asking around to see whether anyone had
solved the problem. (Herzenberg and Herzenberg, 2004)
A group of researchers at Los Alamos had developed a machine to count and
sort cell-sized particles based on particle volume. Their goal was to be able to count
9


debris particles obtained from the lungs of mice and rats, set aloft in balloons after
atomic tests. Consequently, they had no need to measure fluorescence. However,
Herzenberg was able to convince the researchers to share their engineering drawings
and schematics with him. He returned to Stanford and assembled a team to build the
first device. The first cell-sorting paper was published in Science in 1969
(Herzenberg and Herzenberg, 2004).
Data was originally collected by photographing histograms displayed on
oscilloscope screens. Improvements in subsequent generations of the machine
included the addition of logarithmic amplifiers, thereby allowing the full range of
data to be displayed on a single data plot; the use of computers in data collection;
and software for data computation and display. The original computer used in the
architecture was the PDP-8, followed by the PDP-11. In 1983, the first dual laser
instrument was put into routine service. This allowed simultaneous measurements
on multiple fluorescences. In 1998, a three laser instrument was developed, allowing
up to eleven distinct fluorescence emissions (Herzenberg and Herzenberg, 2004).
2.3 Experimental Design
Researchers at the Institute for Bioenergetics are engaged in a project which
considers the
link between lipid availability, lysosomal/endosomal acidity, and
the expression of cell surface Major Histocompatibility (MHC)
class 13 molecules in human macrophage cell lines. This work is
significant because it provides evidence for a mechanistic link
10


between exogenous lipid availability, inflammation and the
immune response.
Because lysosomes are established sites for fatty acid
accumulation and MHC class II molecules must traffic through the
acidic lysosomal/endosomal compartment, we reasoned that fatty
acid availability might have a direct impact on the immune
response through fatty acid-dependent alteration in the expression
of MHC class II on the macrophage cell surface. (Schweitzer et al.,
2004)
Putting this work into context for the non-biologist, intracellular vesicles,
such as the lysosome or lipid raft, may support the transport of MHC class II
molecules to the surface of the cell. These molecules aid in resistance to certain
diseases. The data considered in this paper attempts to verify the lipid raft
hypothesis by showing that mice raised on a high-fat diet have more surface
expression of MHC class II than those raised on a low-fat diet. The presence of
MHC class II can be detected through cytometric analysis.
Expanding upon in vitro experiments in human cell lines, the researchers
designed in vivo experiments on mice. The project under consideration in this paper
started with 112 mice. Half of the mice were fed a high fat diet (5% coconut oil and
5% safflower oil), half a low fat diet (5% safflower oil). After approximately 16
weeks, the mice were killed. The spleens were removed, and a suspension of
splenocytes was prepared. The suspension contained approximately 1,000,000 cells
per milliliter of solution.
11


Sub-samples of the splenocyte suspension were then dyed with substances
designed to fluoresce in the flow cytometer. Lysosomal acidity was measured by the
fluorescence of LysoSensor stain. MHC class II expression was measured by the
fluorescence of a phycoerythrin conjugated rat anti-mouse I-A/I-E. In experimental
data, samples stained with this substance were labeled M5114. Additionally, some
samples were stained with an isotype (IgA/IgE) which binds to all non-specific
matter, thereby acting as a control. Samples were processed through the flow
cytometer, yielding the event data as discussed above.
The experimental data includes up to 13 samples or data sets on each
individual1 unstained, 1 isotype stained, 3 M5114 stained, and 3 LysoSensor
stained; with an additional set of samples treated with CytoPerm/CytoFix processing.
This process, also used by Kumar et al. (2002), permeates the cell membrane,
allowing intracellular staining. This treatment is labeled CPCF in the experimetnal
data. The CPCF treatment was performed on one no-stain, one isotype, and three
M5114 samples, per individual.
The project was divided into four experiments. The first experiment included
only females, while the second experiment included only males. The third and
fourth experiments included both males and females. Individuals in the third
experiment were wounded at an age of 16 weeks. Individuals in the fourth
experiment were wounded at an age of 12 weeks. The wounding provided some data
for research on diabetics and wound healing. Additionally, the wounding could
12


activate the immune system of the wounded individuals. This would be
demonstrated by increased MHC class II expression.
The project included four strains of miceBalb/c, C57/Black 6, UPC2
Knockout, and P6129. The P6129 species is the parent line of the UPC2 Knockout
mice. These species are represented in the data as B, C, U, and P individuals,
respectively.
2.4 The Analytical Process
CellQuest software, available from BD Biosciences, is the primary tool that
the researchers at the Institute for Bioenergetics use to analyze the flow cytometry
data. This software presents the data graphically, and provides summary statistics.
The software also lets users manually define subsets or clusters of data. The
graphical or statistical analysis can then be performed on those clusters.
The two main graphical presentations are the scattergram and the histogram.
The scattergram is a dot plot showing event values for two different parameters, as
shown in Figure 2-4. The dot plot is the most common and useful data
representation, particularly during real-time data collection (Parks and Bigos, 1996).
The histogram shows the value of one parameter on the horizontal axis, and the
count of events with that value on the vertical axis, as shown in Figure 2-5. The
simple one-dimensional histogram, plotting cell frequency as a function of signal
level, has been a venerable analysis tool since the beginnings of flow cytometry
13


Events
(Parks and Bigos, 1996, p. 50.6). Scattergrams and histograms can be found in
Desbarates (1999), Huber (2001), and Lee (2004).
Figure 2-5. Representative Histogram
14


Gating is a mechanism for graphically or numerically identifying a subset of
the data. The analyst can then view scattergrams or histograms on that subset of data
only. Additionally, certain summary statistics are available on the gates or regions.
These include number of events, percentage of gated events, percentage of total
events, arithmetic mean, and geometric mean.
15


3. Building the Rich Analytical Environment
Recall that the event data is written to a file in an esoteric format. Biologists
are accustomed to analyzing this data with proprietary software such as CellQuest.
CellQuest represents the data graphically, and via summary statistics. Consequently,
biologists are not necessarily aware of the possibilities for analyzing the data were it
in a more accessible format.
The current analytical environment of flow cytometry analysis is limited,
closed, and sample-centric. To bring more powerful analytical tools, techniques, and
mindset to the problem space, a different environment is required. First, a
mechanism must exist to combine event data from multiple samples. Second, the
analyst needs to be able to perform a variety of statistical and graphical analyses.
These analytical approaches should be constrained only by the fundamental nature of
the data, and by the analysts imagination. The approaches should not be artificially
constrained by a particular tool. Third, the analytical environment should be open,
allowing the data to be accessed and processed in a wide variety of means. Fourth,
the analytical environment should support rapid hypothesis testing.
This rich analytical environment (RAE) is created by parsing the FCS data
and loading the resulting data into a relational database. Additionally, the parsed
16


data is associated with information about the individual, the sample type, and the
experiment. Once the data is in a standard relational database, a variety of tools and
techniques can be employed. Viable tools for analyzing the data include SQL;
programming languages such as Java, Perl, and database stored procedures; and data
analysis and graphing programs. Such programs can have a mathematical or
scientific focus, such as MATLAB (www.mathworks.com) and Origin
(www.originlab.com). Alternatively, they can have a business intelligence focus,
such as Business Objects (www.businessobjects.com) and Cognos
(www.cognos.com). Many powerful tools for analyzing data in relational databases
are available. The research biologist can leverage these tools, once the data is
exposed. This chapter discusses the process of building the RAE.
3.1 Data Flow
Figure 3-1 shows the data flow for the first generation of the rich analytical
environment. The Java program written for this work, FCSParser, processes the
native flow cytometry into two files, events and parameters. The resulting files are
then loaded into a relational database. The sample key, maintained by the biologists
to correlate samples with individual mice and sample type, is also loaded into the
database.
17


Figure 3-1. First Generation Data Flow Diagram
Figure 3-2 shows the data flow of the second generation of the RAE. In this
case, the event data is combined with information about the sample with which it is
associated. This metadata includes individual, diet type, strain, and sample type.
The resulting environment is easier to use, because fewer database joins are required
Additionally, the reduction in joins can improve query performance. The modified
components of this solution are FCSParser', events.dat' and sample key'.
The following section provides more details on preparing data for the database.
18


Figure 3-2. Second Generation Data Flow Diagram
3.2 FCS Parsing
The FCS files, collected and written by the flow cytometer, are written in a
published format. The formats are specified by the Data File Standards Committee
of the International Society for Analytical Cytology (ISAC). The file format
includes a header section, a section recording parameters of the data run, and a
section recording the events. There is also an optional analysis section.
The header section is of a fixed format, noting the version of the data
standard (FCS2.0 or FCS3.0), and providing byte offsets to the remaining data.
Among the values in the header section are the byte offset to the beginning of the
text section, the byte offset to the end of the text section, the byte offset to the
beginning of the data section, and the byte offset to the end of data section. For
example, the standard states that the offset to the first byte of the text section is found
19


in positions 10-17, and the offset to the last byte of the text section is found in
positions 18-25. A representative header from the files considered in this work is:
FCS2.0 128 971 2048 90223 96512 96537 90240 96425
The file format is FCS2.0. The text section ranges from bytes 128-971, the data
section from 2048-90223, the analysis section from 96512-96537, and a user-defined
other section from 90240-96425. These last two sections are not processed in this
work.
The text section consists of ASCII name-value pairs. The first character of
the text section is the delimiter which separates the names and values. Names or
keywords which are required in the specification are prefaced by the $ character.
Additional keywords can be included in the text section, but they are not prefaced by
the character. Among the required keywords are SDATATYPE, SPAR, and
STOT. The values associated with these keywords represent the type of data in the
data segment (ASCII text, binary integer, binary floating point), the number of
parameters recorded in an event (where the parameters are such values as FS, SS,
FL1LOG and FL2LOG), and the total number of events in the data set, respectively.
A subset of the text section from one of the files considered in this work
follows:
!$FIL!G0036390.LMD!$SYS!DOS 6.22!$INST!UCCS Flow Center .Science Bldg. Rm.163
The file name is G0036390.LMD. It was created by a computer running DOS 6.22,
at the institution named UCCS Flow Center,Science Bldg. Rm. 163.
20


The data section consists of the raw data recorded by the flow cytometer.
Several of the keywords in the text section describe the data section. These
keywords include $MODE, SDATATYPE, SBYTEORD, and $PnB. $MODE
specifies whether the data is in list or histogram mode. The data considered in this
work was list mode, which means that data was recorded for each cell or event.
Additionally, in this work, the data was written as binary integers with a byte order
of 1,2, as specified by die SDATATYPE and SBYTEORD parameters. The byte
order specifies the order in which the binary data bytes are written to compose a data
word: 1 refers to the least significant digit, 2 to the most significant digit.
Recording the byte order allows data which is written under one operating system to
be analyzed under another operating system. SPnB specifies the number of bits in
each parameter, where n is the number of the parameter (i.e. 1,2, 3).
In this work, a Java program was written to read an individual file and write
two resulting files, params.dat and events.dat. The format of the data
section was determined by manual inspection of the appropriate parameters of the
text section. The parsing code was written with this knowledge. As such, it is not a
fully general parsing solution. The value of was determined by extracting the
first part of the input file name. Sample input file names, as generated by the flow
cytometer, are:
G0036500.LMD
G0036501.LMD
21


G0036502.LMD
G0036503.LMD
G0036504.LMD
Consequently, given an input file of G0036500.LMD, output files would be
paramsG0036500.dat and eventsG0036500.dat. Shell scripts were used to run the
parsing program on all of the files within a particular directory.
In the first generation of the parsing code, all rows of both output files were
prepended with the tag and the experiment number. The tag identifies the particular
sample. The experiment number allows identification and efficient processing of a
subset of the data. Table 3-1 and Table 3-2 show sections of data from each of these
two output files.
Table 3-1. An Extract from a Param File
G0036076| 1 |FIL|G0036076.LMD
G0036076|1|SYS|DOS 6.22
G0036076|l |INST|UCCS Flow Center .Science Bldg. Rm.163
G0036076|1|CYT|XL AB27219
G0036076| 1 |DATE| 10-Jul-03
G0036076|1|BTIM|16:59:34
G0036076|l|SRC|ss new keep 1 ns
G0036076|l |SMNO|G0036076
G0036076|1|OP|SLV
G0036076|1|COM|SYSTEM II Version 3.0
G0036076| 1 |TESTNAME|Lisa spleen FL1
G0036076| 1 |TESTFILE|G0000157.PRO
G0036076j 1 |B YTEORDI1,2
G003607611 IDATATYPEjl____________________________
22


Table 3-2. An Extract from an Event File, First Generation
G0036076|l|75|164|387|0|734|821
G0036076| 11197)203145110|8411844
G0036076) 11173| 129|432|0|826|794
G0036076| 1 j 1811116|38310|8311782
G0036076| 1118111611344|0|831)818
G0036076|l|201|212|399|0|842|848
G0036076|l 1193)2211464|0|838)853
As the data model evolved, the information added to each row of data in the
event file grew. In the second generation of the environment, each row in the event
table includes key information from the sample key. Representative data is shown in
Table 3-3.
Table 3-3. An Extract from an Event File, Second Generation
G0036076| 111 |BL-3|No Stain|0|B|L|NORMAL|75|164|387|0|734|821
G0036076| 111 |BL-3|No Stain|0|B|L|NORMALj 197)203145110|8411844
G0036076| 111 |BL-3|No Stain|0|B|L|NORMAL|173|129|432|0|826|794
G0036076| 111 |BL-3|No Stain|0|B|L|NORMAL| 1811116|38310|8311782
G0036076| 111 (BL-3 |No Stain|0|B|L|NORMAL| 18111611344|0|8311818
G0036076|l|l|BL-3|No Stain|0|B|L|NQRMAL12011212|399|0|8421848
3.3 Preparing the Flow Sample Key
For the experiments under discussion, the Flow Sample Key data was stored
in a Microsoft Word Document. An extract of that data is shown in Table 3-4.
23


Table 3-4. Flow Sample Key
Sample # Description Sample # Description
101. BL-7 No Stain 126. BH-12 no stain
102. BL-7 isotype 127. BH-12 isotype
103. BL-7 M5114 #1 128. BH-12 M5114 #1
104. BL-7 M5114 #2 129. BH-12 M5114 #2
105. BL-7 M5114 #3 130. BH-12 M5114 #3
106. BL-8 No stain 131. CL-8 no stain
107. BL-8 isotype 132. CL-8 isotype
108, BL-8 M5114 #1 133. CL-8 M5114 #1
109. BL-8 M5114 #2 134. CL-8 M5114 #2
Turning this data into a database-friendly format required several steps. The first
step was a parsing step, whereby the descriptions were split apart into Individual,
Sample Type, Replicate, Strain, Diet, and Process components. The
Individual values represent the particular mice. The Individual values shown in this
set are BL-7, BL-8, BH-12, and CL-8. The Sample Type values indicate how the
sample has been stained. The values shown in this set include No Stain, Isotype, and
M5114. Replicate values are 0, 1, 2, and 3. The first letter of the Individual
designation also represents the mouse strain. The second letter represents diet, either
high fat (H) or low fat (L). Finally, some of the samples were processed with a
24


CytoPerm/CytoFix (CPCF) treatment. The two process values are NORMAL and
CPCF.
Inspection of the sample key also shows that the input must be cleansed or
standardized. For example, No stain and no stain were transformed to No Stain,
and isotype to Isotype. Additionally, the Sample # value 108, was changed to 108.
Finally, a value for Experiment was prepended to the data set. Resulting data,
ready for loading into the database, is shown in Table 3-5.
Table 3-5. Sample Key Data, Reformatted for Database
Experiment Sample Numbi Individual Sample Type Replicate Strain Diet Process
1 14 BL-6 Lyso 1 B L NORMAL
1 15 BL-6 Lyso 2 B L NORMAL
1 16 BL-6 Lyso 3 B L NORMAL
1 17 BH-4 No Stain 0 B H NORMAL
1 18 BH-4 Lyso 1 B H NORMAL
1 19 BH-4 Lyso 2 B H NORMAL
In the future, the biologists may want to record their sample keys in a more
database-friendly format. One possibility is an Excel spreadsheet with column
headers corresponding to the database fields (Experiment, Sample_No, Individual,
Sample Type). Additionally, they might want to include validation rules on the
columns to keep the data clean.
25


3.4 Connecting the Flow Data to the Sample Key
The data collected by the flow cytometer needs to be associated with the data
from the Flow Sample Key. The parameter SRC contains the necessary link.
Sample SRC values can be found in Table 3-6. The non-numeric part of the value is
filtered out. The remaining numeric portion is written to the param file with the
name S AMPLE NO. Note that under certain circumstances, there are multiple runs
per sample. In the second generation of the RAE, this sample number was used to
retrieve information from the sample key.
Table 3-6. Values of SRC Parameter
ss 95
ss 96
ss 97 ns
ss 97 ns re
ss 97 ns re re
Parks and Bigos (1996) note that
the designers of most flow data systems have viewed
documentation as a poor stepchild to the glamour of graphical
display. Often, all they provide is a pedestrian editor for keyword
values to be stored with the data file. .. .These editors tend to be
oriented toward single samples and not taking sample grouping
and experiment structure into account. Once the annotations are
entered, there may be no facility for browsing or searching the
keywords to group or retrieve relevant files when doing later
analysis. ( p. 50.1)
The system used by the researchers at the Institute for Bioenergetics for
correlating flow files to their sample key is tedious. The SRC values discussed
26


above are entered into the flow cytometer at sample run time. The values are printed
automatically on a hardcopy analysis summary which is generated at run time.
These printouts are stored in a binder. When a researcher wants to find the flow file
associated with the sample, BL-6 Lyso #1, she first finds that entry in the sample
key, and mentally notes the associated key number of that sample (e.g. 14). She then
turns pages in her binder, scanning the printouts until she finds the number 14
embedded in the SRC field (A. Reding, personal communication, March 30,2004).
3.5 Evolution of the Data Model
The data model for the Rich Analytical Environment underwent several
iterations before reaching its current form. The initial version was normalized, with
tables for parameters, events, and the sample key. This version had several
shortcomings. First, the volume of data vis-a-vis machine resources created a need
for performance tuning. Second, multiple joins were required to connect the event
data with the sample key data. Both of these factors led to the consideration of a star
schema or dimensional model. However, since most of the dimensions are single-
attribute dimensions, the collapse of all requisite values into a single table made
sense. Each of these iterations of the data model is discussed in this section.
3.5.1 The Entity-Relationship Model
The entity-relationship model, as shown in Figure 3-3, accurately represents
the data coming from two data sources. The first data source is the FCS files. Recall
that there are two types of data that are relevant to the analyst. These are the
27


individual events that are measured in a particular sample, and the parameters that
describe each sample. The event data, translated from binary into ASCII, comprises
each row of the event table. Additionally, each event is prepended with the
experiment number and the tag or file name.
event
tag
experiment
fs
ss
fl1 log
f!2log
fslog
sslog
4 param M- + sample_key
tag experiment paramname param_value experiment sample_no individual sample_type
where
param_name=SAMPLE_NO
and param_value=sample_no
Figure 3-3. Entity Relationship Model
There were two options for modeling the parameter table. One option was to
create a table having one field for every parameter, 68 fields in this work. The other
option was to create a more general format in which each name-value pair had one
row. Since implemented of the FCS format can include their own name-value pairs
in the text section of the file, this second option is more flexible in the long run.
Additionally, this option supports the creation of additional name-value pairs that
support the analysis. For example, the sample number that corresponds to the
28


sample key is embedded in the SRC field. This field was parsed to extract the
numeric part of the data. This number, along with the name S AMPLE_NO, was
added to the param table. Again, each row includes the experiment and the tag.
The SAMPLEKEY table is simply a database representation of the Flow
Sample Key, maintained in a word processor. INDIVIDUAL, SAMPLE TYPE, and
REPLICATE are split apart into separate fields. DIET and STRAIN are derived
from the appropriate character of the individual value. Each row also includes the
experiment number.
3.5.2 The Dimensional Model
This environment is a reporting environment as opposed to a transactional
environment. Thus, techniques for performance and usability in reporting
environments are worth considering. One prominent technique is the dimensional
model, which is also known as the star schema. A dimensional model for the RAE is
shown in Figure 3-4. Visually represented, a fact is at the center of the model, with
dimensions or attributes surrounding that fact.
29


Figure 3-4. Dimensional Model
In this application, the event is the fact or the measurement at the center of
the star. The dimensions include INDIVIDUAL, TAG, EXPERIMENT, SAMPLE,
SAMPLE TYPE, REPLICATE, PROCESS, DIET, and STRAIN. Most of these
dimensions include only one attribute. As such, they are degenerate dimensions.
Kimball, Reeves, Ross and Thomthwaite (1998) recommend collapsing the
degenerate dimensions into the fact table, without a join to anything. INDIVIDUAL
is the only dimension that has multiple attributes. These attributes could include sex,
weight, age, food consumption, and various other attributes recorded by the
biologists. The resulting event table is shown in Figure 3-5.
30


event
tag
experiment
individual
strain
diet
sample
sample_type
replicate
process
fs
ss
f11 log
f!2log
fslog
ssiog
Figure 3-5. The Event Table
Because most queries are performed on the event table and the event table
only, this structure increases the ease with which the biologists can query the data.
Table 3-7 shows a histogram-generating query from the original implementation.
Table 3-8 shows the logically equivalent query from the second generation
implementation. Indexing on key values like INDIVIDUAL, DEET TYPE,
EXPERIMENT, SAMPLE TYPE, and REPLICATE will enhance performance.
31


Table 3-7. Histogram-generating Query, Original Model
select e.tag, sampletype, round(fs/25), count(*)
from event e,
param p,
samplekey s
where e.tag=p.tag
and paramname-SAMPLENO'
and param_value=to_char(sample_no)
and individual-BH-6'
and s.experiment=l
and p.experiment=l
group by e.tag, sample_type, round(fs/25)
having round(fs/25)<20____________________
Table 3-8. Histogram-generating Query, Dimensional Model
select tag, sample type, round(fs/25), count(*)
from event
where individual='BH-6'
group by tag ||sample_type, round(fs/25)
having round(fs/25)<20 _______________________
3.5.3 Data Volume
The volume of data created by cell biology research is significant. On
average, each individual generated 154,000 rows of event data. Compare this to a
business application, such as cable billing. Suppose that each customer has 3
products on his or her account. Twelve months worth of billing would create 36
billing line items. One would need to bill nearly 4300 customers for 1 year to create
32


as much data as one specimen generates! Selected statistics on the data set are
shown in Table 3-9.
Table 3-9. Selected Data Set Statistics
# of Files # of Events # Distinct Individuals Avg Events per Sample Avg Events Per Individual
Exp1 418 3,889,684 25 9,305 155,587
Exp2 360 2,791,145 24 7,753 116,298
Exp3 279 2,799,232 19 10,033 147,328
Exp4 362 3,119,628 22 8,618 141,484
TOTAL 1419 12,599,689 90
33


4. Analytical Techniques in the Rich Analytical
Environment
Once data from a flow cytometry experiment, which consists of multiple
samples on multiple individuals, has been parsed and loaded into the RAE, a variety
of analytical techniques can be employed. These techniques can be statistical or
graphical. Additionally, many of the techniques should be thought of as standard
techniques, performed each time an experiment is conducted. These standard
techniques are fundamentally analysis patterns, enabling the analyst to know much
more about his or her data, and to reach this knowledge very quickly. Furthermore,
the RAE supports analyses in which multiple samples are analyzed with the same
technique at the same time. The RAE also supports the derivation of new
measurements from the base measurements collected by the flow cytometer. This
chapter discusses some of the possible analytical techniques.
The RAE contains data on multiple samples. There are two broad groupings
of these samples, individual and sample type. In the data analyzed in this work,
there are four different sample types. The types are No Stain, Isotype, Lyso, and
M5114. Additionally, for certain individuals, the No Stain, Isotype, and M5114
samples are also processed with the CytoPerm/CytoFix technique, which perforates
the cell membrane, allowing the stain to adhere to the interior of the cell. Thus, one
34


approach to segment-based analysis is to perform standard analytical techniques on
all of the samples for a particular individual. This approach demonstrates the
similarities and the differences in the different sample types for that particular
individual. The other approach is to analyze all of the samples of a particular type
(e.g. M5114) for all individuals, or a particular subset of individuals. This approach
demonstrates similarities and differences in that particular sample type across a
population of individuals.
Additionally, new measurements can be derived from the data collected by
the flow cytometer. Recall that the flow cytometer measures at least four
parameters: forward scatter, side scatter, and two types of fluorescence. Recall also
that the fluorescence is created by staining the cell with compounds which will
adhere to certain molecule types, on or in the cell. The fluorescence is generated
by these stains. All other things being equal, the larger cells have more surface area
for the stains to adhere to, and consequently, more fluorescence. Thus, the analyst
might want to normalize the fluorescence as a function of cell size. Since forward
scatter is an approximate measurement of cell size, one viable normalization function
is FL2/FS. Many of the analysis techniques discussed in the following section refer
to this normalized fluorescence.
35


4.1 Statistical Analyses
This section presents a variety of statistical analyses on the data collected by
the project. The discussion of each type of analysis includes a brief overview of the
purpose of the analysis, the SQL statement which retrieves data, all or part of the
result set, and a brief comment about the findings.
4.1.1 Summary Statistics, Entire Experiment
A good starting point for analysis is high level statistics on an entire
experiment. These statistics provide an overall sense of results, and relative
consistency (or lack thereof) across samples. An appropriate SQL statement is
shown below.
SQL Statement
select tag, individual, sampletype, count(*),
avg(fs), stddev(fs), avg(ss), stddev(ss), avg(fl21og)
from event
where experiment=2
and process=,NORMAL'
group by tag, individual, sample type
36


Table 4-1. Result Set (Extract)Summary Statistics, Entire Experiment
ROW TAG INDIVIDUAL SAMPLE_TYPE i" z 13 o o AVG(FS) STDDEV(FS) AVG(SS) STDDEV(SS) AVG(FL2LOG)
1 G0036563 BL-9 No Stain 7681 220 113 309 211 3.5
2 G0036564 BL-9 No Stain 7558 216 108 299 204 3.6
3 G0036565 BL-9 No Stain 7663 219 112 307 212 3.4
4 G0036566 BL-9 Lyso 7484 198 92 267 185 1.5
5 G0036567 BL-9 Lyso 7108 212 100 281 188 0.6
6 G0036568 BL-9 Lyso 7220 212 101 283 193 0.6
7 G0036569 BL-10 No Stain 7102 224 103 305 216 0.1
8 G0036572 BL-10 No Stain 7880 242 125 357 240 2.5
9 G0036573 BL-10 No Stain 7749 244 120 359 239 2.9
10 GOO36574 BL-10 No Stain 7838 238 118 351 236 2.8
11 G0036575 BL-10 Lyso 7680 219 102 310 220 0.4
12 G0036576 BL-10 Lyso 8153 212 102 308 217 0.7
13 G0036577 BL-10 Lyso 7721 216 102 305 212 0.6
14 GOO36629 CH-8 No Stain 76511 135 133 292 238 0.1
15 G0036630 CH-8 Lyso 62706 140 135 287 234 1.2
16 G0036631 CH-8 Lyso 63176 136 128 277 226 1.3
17 G0036632 CH-8 Lyso 62629 139 130 286 231 0.9
18 G0036687 BL-9 No Stain 6078 235 107 178 148 148.7
19 G0036688 BL-9 Isotype 6332 222 110 170 146 135.7
20 G0036689 BL-9 M5114 6472 245 135 194 167 584.9
21 G0036690 BL-9 M5114 6378 240 119 189 157 566.7
22 G0036691 BL-10 No Stain 6291 238 118 183 156 567.2
23 G0036692 BL-10 No Stain 6761 224 108 192 155 105.8
24 G0036693 BL-10 Isotype 6773 228 116 199 169 130.5
25 G0036694 BL-10 M5114 7102 224 117 204 173 491.0
26 G0036695 BL-10 M5114 6964 226 119 203 174 524.3
27 G0036696 BL-10 M5114 6920 232 124 207 176 503.9
28 G0036747 CH-8 No Stain 26788 157 137 219 209 115.8
29 G0036748 CH-8 Isotype 26650 151 128 208 201 129.2
30 G0036749 CH-8 M5114 26203 151 123 208 196 409.0
31 G0036750 CH-8 M5114 27246 156 139 216 210 590.6
32 G0036751 CH-8 M5114 31303 156 149 220 219 583.9
The results show:
37


in general, a fairly consistent number of events in each sample;
certain suspect samples, as indicated by a large number of events and/or reruns of
the sample (e.g. CH-8);
relatively consistent measurements across replicates for a given individual and a
given sample type (e.g. rows 1-3,11-12); and
that no stain "families" have almost no FL2 (rows 1-17), while stain families
show some FL2 (rows 18-32), even on No Stain samples.
4.1.2 Suspect Samples
Another technique highlights suspect samples or mdividuals by identifying
the samples that have a large number of events, say more than 20,000. Additionally,
the query can identify how many suspect samples are associated with a particular
individual.
SQL Statement
select experiment, individual, count(*)
from
(
select experiment, tag, individual, count(*)
from event
group by experiment, tag, individual
having countf*) >20000
)
group by experiment, individual
38


Table 4-2. Result SetSuspect Samples
EXPERIMENT INDIVIDUAL COUNTD
1 BL-3 3
1 BL-4 6
1 BL-5 2
1 PH-4 6
1 UL-4 1
2 CH-8 9
3 CH-2 7
3 PH-3 5
3 PL-1 10
3 UL-1 2
Note that compared to Experiments 1 and 3, Experiment 2 has a small number of
suspect individuals. Depending on the sample type of the suspect samples, and on
the particular analysis performed, the suspect samples/suspect individuals could
skew comparisons made between Experiment 1 and Experiment 2.
4.1.3 Study of a Particular Individual
A variation of the analysis presented in Section 4.1.1 is summary statistics,
constrained to a particular individual.
SQL Statement
select tag, sampletype, replicate, count(*), avg(fs), avg(fl21og), avg (fl21og/fs)
from event
where individual=,BH-3'
and process=rNORMAL'
group by tag, sample type, replicate
39


Table 4-3. Result SetStudy of Individual BH-3
TAG SAMPLE_TYPE REPLICATE P" z rj o o AVG(FS) AVG(FL2LOG) AVG(FL2LOG/FS)
G0037080 No Stain 0 7352 198 2 0.0
G0037081 No Stain 0 7409 195 3 0.0
G0037082 Lyso 1 7968 189 3 0.0
G0037083 Lyso 2 8791 179 4 0.0
G0037084 Lyso 3 8332 185 4 0.0
G0037184 No Stain 0 6691 216 45 0.3
G0037185 Isotype 0 7127 214 59 0.4
G0037186 M5114 1 7097 211 535 3.3
G0037187 M5114 2 7002 212 544 3.3
G0037188 M5114 3 7207 212 534 3.2
The results show:
a relatively consistent number of events in each sample;
increased fluorescence on Isotype and M5114 samples; and
relative consistency across replicates of the same sample type.
4.1.4 M5114 Fluoresence, Normal Process and CPCF
Process
Another possible analysis is to compare M5114 fluorescence in normally
processed samples to that in CPCF processed samples.
40


SQL Statement
select tag, individual, sampletype, replicate, process, avg(fl21og), avg(fl21og/fs)
from event
where experiments
and sample_type='M5114'
and fs>20
group by tag, individual, sample type, replicate, process
order by individual, sample type
Table 4-4. Result Set (Extract)M5114 Samples, Normal and CPCF
Processing
TAG INDIVIDUAL SAMPLE_TYPE REPLICATE PROCESS AVG(FL.2LOG) AVG(FL2LOG/FS)
G0036266 BH-5 M5114 1 NORMAL 597 3.4
G0036267 BH-5 M5114 2 NORMAL 600 3.3
G0036268 BH-5 M5114 3 NORMAL 594 3.2
G0036404 BH-5 M5114 1 CPCF 442 3.5
G0036405 BH-5 M5114 2 CPCF 424 3.5
G0036406 BH-5 M5114 3 CPCF 562 4.3
G0036407 BH-5 M5114 3 CPCF 570 4.4
G0036271 BH-6 M5114 1 NORMAL 599 3.7
G0036272 BH-6 M5114 2 NORMAL 625 3.6
G0036273 BH-6 M5H4 3 NORMAL 595 3.5
G0036410 BH-6 M5114 1 CPCF 568 4.7
G0036411 BH-6 M5114 2 CPCF 576 4.8
G0036412 BH-6 M5114 3 CPCF 585 4.9
G0036241 BL-3 M5114 1 NORMAL 617 3.8
GOO36242 BL-3 M5114 2 NORMAL 623 4.0
G0036243 BL-3 M5114 3 NORMAL 639 3.9
G0036377 BL-3 M5114 1 CPCF 608 5.1
G0036378 BL-3 M5114 2 CPCF 611 5.0
G0036379 BL-3 M5114 3 CPCF 616 5.1
41


The expected finding of more FL2 on CPCF samples is not consistently shown.
However, if FL2 is normalized (FL2LOG/FS), a higher value is consistently shown
on the CPCF samples. This result highlights the value of the normalization
technique.
4.1.5 Fluorescence by Diet Type, Isotype and M5114
Stains
The types of statistics computed on individual samples can also be computed
on the entire project. This technique lumps the data into large buckets, not broken
out by sample.
SQL StatementNORMAL Process
select diet, sampletype, avg(fl21og/fs), stddev(fl21og/fs), count(*)
from event
where
sample type in ('M5114','Isotype')
and process='NORMAL'
group by diet, sample type
Table 4-5. Result SetFluorescence by Diet Type, Normal Processing
DIET SAMPLE TYPE AVG(FL2LOG/FS) STDDEV(FL2LOG/FS) COUNTD
H Isotype .7 1.1 237376
H M5114 3.4 2.8 717209
L Isotype .6 1.0 275953
L M5114 3.3 2.7 780993
The results show slightly more normalized fluorescence for the high fat samples, for
both isotype and M5114 stains, indicating support for the biologists hypothesis
regarding high-lipid environments being conducive to the presentation of MHC
Class II molecules.
42


Table 4-6. Result SetFluorescence by Diet Type, CPCF Processing
DIET SAMPLE TYPE AVG(FL2LOG/FS) STDDEV(FL2LOG/FS) COUNTC)
H Isotype 1.0 1.2 203028
H M5114 4.5 3.5 624292
L Isotype 1.1 1.3 268040
L M5114 4.6 3.5 729606
The results show slightly more normalized fluorescence for the low fat samples, both
isotype and M5114 stains. The CPCF processing alters the structure of the cell,
allowing staining of MHC class II molecules both inside of and on the surface of the
cell. Thus this result is consistent with experimental design.
4.1.6 Fluorescence by Experiment and Diet Type,
M5114
The analysis above is easily expanded by adding experiment as one of the
grouping variables. The select statement includes only M5114 samples, and
computes average normalized FL2. Result sets for both NORMAL and CPCF
processes are presented.
SQL StatementNORMAL process
select experiment, diet, avg(fl21og/fs), count(*)
from event
where
sample_type =rM5114'
and process=,NORMAL'
group by experiment, diet
43


Table 4-7. Result SetFluorescence by Experiment by Diet Type, Normal
Processing
EXPERIMENT DIET AVG(FL2LOG/FS) COUNTC)
1 H 3.4 215554
1 L 3.3 262408
2 H 3.6 279322
2 L 3.3 252318
3 H 3.1 222333
3 L 3.4 266267
The findings are as follows:
Experiment 2, Males, shows greater difference in normalized FL2 between high
fat and low fat diets than does Experiment 1, Females.
Experiment 2, Males, shows higher normalized FL2.
Experiment 3, Wounded Mice, shows greater normalized FL2 in low fat diets
than high fat diets; and high fat diets show the lowest normalized FL2 of any of
these result sets.
Table 4-8. Result SetFluorescence by Experiment by Diet Type, CPCF
Processing
EXPERIMENT DIET AVG(FL2LOG/FS) COUNTC)
1 H 4.3 225751
1 L 4.4 256878
2 H 4.4 189169
2 L 4.5 246198
3 H 4.9 209372
3 L 4.9 226530
The results show that:
in no cases does the high fat diet show more normalized FL2 than the low fat
diet; and
44


Experiment 3, Wounded Mice, shows higher normalized FL2 than do the other
two experiments.
Inter-experimental comparisons may be biased by differing machine calibration on
different days. Consequently, these results warrant further investigation to ascertain
validity.
4.1.7 Simple Histograms Using Text
Constructing simple histograms using text is helpful when the user does not
have easy access to a good graphical tool, and wants to do some quick graphical
exploration. The technique can be used to show similarity of histogram curves
across replicates, or for a quick identification of a local minimum.
Additionally, this technique can provide graphical results of a large number
of samples quickly. One of the challenges of graphical presentation on a two
dimensional surface, such as a piece of paper, is how to distinguish between many
samples. Only a certain number of samples can fit onto the graph before the
presentation becomes too cluttered to extract useful information. Thus, this
technique, while not showing multiple samples in the same data space, does provide
a quick visualization of a large number of samples.
SQL Statement
select tag, individual, tnmc(fl21og/50) t,
substr(lpad('o',round(count( *)/20,0),'o'),l ,40) graph
from event
where fs>100
and sampletype-M5114'
and process='NORMAL,
and replicate=l
group by tag, individual, trunc(fl21og/50)
45


Table 4-9. Result Set (Extract)-'Textual Histogram
G0037249 |UH 2 1 .0 | ooooooooooooooooooooooooo
G0037249 |UH 2 | 1.0 j oooooo
G0037249 lUH 2 1 2.0 joooooo
G0037249 lUH 2 | 3.0 Iooooooooooo
G003724 9 |UH 2 | 4.0|oooooooooooooo
G0037249 j UH 2 | 5.0 Ioooooooooooooooo
G0037249 | UH 2 I 6.0|ooooooooooooooooo
G0037249 | UH 2 | 7.0 I ooooooooooooo
G0037249 j UH 2 1 8.0 1 oooooooooo
G0037249 | UH 2 | 9.0|oooooooo
G0037249 j UH 2 I 10.0 | ooooooo
G0037249 j UH 2 1 11.0 j oooooooooo
G0037249 j UH 2 1 12.0 j ooooooooooooo
G0037249 j UH 2 | 13.0 Ioooooooooooooooooo
G0037249 (UH 2 1 14.0 j ooooooooooooooooooooooo
G0037249 j UH 2 | 15.0|oooooooooooooooooooooooooooo
G0037249 j UH 2 I 16.0] oooooooooooooooooooooooooooo
G0037249 j UH 2 | 17.0|ooooooooooooooooooooooo
G0037249 | UH 2 1 18.0 I ooooooooooooo
G0037249 j UH 2 j 19.0|ooooooo
G0037249 j UH 2 j 20.0|oooo
Note the two peaks of fluorescence, and the local minimum at trunc(fl21og/50)=10.
This translates to a fluorescence reading of 450-500.
4.1.8 Normalized Fluorescence by Forward Scatter
Range
The following technique was inspired by a Quartile based analysis
presented by Covas (2004). Covas presents forward scatter quartiles, and reports
fluorescence by each quartile. The technique discussed below divides forward
scatter into four equal ranges (0 to 255, 256 to 511, 512 to 767, 768 to 1023) and
computes the average normalized fluorescence in each of these quadrants.
SQL Statement
select tag, individual, sample type,
avg(decode(trunc(fs/256),0,fl21og/fs,null)) Q1,
avg(decode(trunc(fs/256), 1 ,fl21og/fs,null)) Q2,
avg(decode(trunc(fs/256),2,fl21og/fs,null)) Q3,
46


avg(decode(trunc(fs/256),3,fl21og/fs,null)) Q4
from event
where experiment=2
and process=tNORMAL'
and sample_type in (*M5114', 'Isotype')
group by tag, individual, sample type
Table 4-10. Result SetFluorescence by Forward Scatter Range
[LI
Q.
TAG INDIVIDUAL 5 -J Q_ 5 < CO 5 CM o CO O o
G0036688 BL-9 Isotype 1.0 0.4 0.5 0.5
G0036689 BL-9 M5114 4.1 1.7 1.4 1.0
G0036690 BL-9 M5114 3.9 1.7 1.3 1.0
G0036693 BL-10 isotype 0.9 0.4 0.4 0.5
G0036694 BL-10 M5114 3.5 1.4 1.1 0.9
G0036695 BL-10 M5114 3.7 1.5 1.3 1.0
G0036696 BL-10 M5114 3.6 1.5 1.3 1.0
G0036708 BH-7 Isotype 0.9 0.4 0.4 0.5
G0036709 BH-7 M5114 3.8 1.6 1.2 0.9
G0036710 BH-7 M5114 4.1 1.7 1.3 1.0
G0036711 BH-7 M5114 4.1 1.7 1.3 1.0
G0036713 BH-8 Isotype 0.8 0.4 0.4 0.5
G0036714 BH-8 M5114 3.5 1.5 1.1 0.9
G0036715 BH-8 M5114 3.8 1.6 1.2 0.9
G0036716 BH-8 M5114 2.1 0.9 0.8 0.6
These results show that the Isotype samples fluoresce primarily in the lower ranges
of FS (<256), while the M5114 samples also fluoresce in FS ranges from 256 to
1023.
47


4.2 Graphical Analyses
Recall that the mainstream graphical treatments of the data collected from a
single sample are two-dimensional scattergrams and histograms as discussed in
Chapter 2. The RAE allows the analyst to consider multiple samples
simultaneously. The two-dimensional scattergram might be able to support the
simultaneous presentation of two or three samples, with different samples indicated
by different colors. However, the more natural grouping of all samples, this
individual usually contains 15 samples; the grouping all individuals, this sample
type usually contains 20-25 samples. Consequently, a chart containing one
histogram for each sample is a better way to display more information in a compact
space.
In this work, these charts were created with a Java program incorporating
JFreeChart (www.jfree.org) libraries. One feature of this solution is that a SQL
statement submitted to the database returned a result set which was then displayed
graphically. No intermediate steps were required. Successive modifications of the
SQL statement led to reasonably clean visualizations. Modifications included such
steps as curve smoothing by creating wider bins for the histogram counts (e.g.
round(fl21og*4/fs)/4, count(*)), and constraining the result sets to a range of relevant
data activity. Table 4-11 shows a representative SQL statement. The expression
round(f!21og*4/fs)/4 creates a smoother curve. The constraint, having
48


round(fl21og*4/fs)/4 between .5 and 10, limits the result set and thus the display to
an appropriate range.
Table 4-11. SQL Statement for Graphical Result
select tag||sample_type||replicate, round(fl21og*4/fs)/4, count(*)
from event
where individual-BH-6'
group by tag||sample_type||replicate, round(fl21og*4/fs)/4
having round(fl21og*4/fs)/4 between .5 and 10__________________
Figure 4-1 shows a family of histograms of Normalized FL2 for all samples
for a particular individual.
03 1 3 1.8 2 3 23 3j0 *3 AO 43 63 53 63 63 73 73 13 83 93 93 103
Nomnaitzod FL2
I G0Q36410M51M1 "oOGMZeSHtoStaioO 0Q036412M51143 330361MNStoMD G0036272MS1142 j
I 00038119#*oSM*nO 00056117MS*Mr OOUJ840tootypaO I 000302711451141 <30038411101 M2
i 00036273*01143 GOOMlISLytol G00M12QLyK* 00038409NoSMtoO GD038imyao2 <3DO38270l1ypp0 j
I 00038114 No StoinO
Figure 4-1. Normalized FL2, Individual BL-6
49


This technique shows the similarity or dissimilarity of the replicate samples. It also
shows the similarities and the dissimilarities of different sample types. Note that the
real-time display includes both color and interactivity. Data values are displayed
when the mouse pointer hovers on a tick mark. Consequently, it is more informative
than the screenshot of Figure 4-1.
Figure 4-2 shows histograms of Normalized FL2 for all M5114 samples for
the individuals in Experiment 1. However, there are too many samples on this chart
for it to be useful. Figure 4-3 shows the same information, only constrained to all
individuals of the strain C.
s' fcirh Analytical Environment f jjew ^
Normalized FL2 Histograms: Experiment 1

0.3 15 15 2 5 25 35 35 45 45 55 55 65 65 75 75 15 §5 95 95 105
Norrraltzed FL2
B1-4 00036246 UL-6G0036351 a CL-SG0036281 CH-5G0036301 CH-1GD036296 * CL-400036276 * UH-600036361 |
PH-6O0036336 * 81-600036286 * UL-4G0O36341 BH-600036271 BH-5GD036266 PB-6OD036331 PL-400036311 |
-PH-400036326 * CL-6Q0036286 BH-400036261 CH-600036306 BL-300036241 UL-6Q0036346 PL-600036321 i
PL-500036316 UH-100036350 CL-7Q0036291 BL-5OD036251 !
Figure 4-2. MSI 14 Samples, Experiment 1
50


j^J*6iolo - > r, V:
Normalized FL2 Histograms: Experiment 1
S*r.ilni>d
Norrmtued FL2
[ CH.1CP036296 01-500036281 CH-S00036301 CL-7Q0036291 C1600036286 T Cl-400036278 CH.600036306 [
Figure 4-3. MSI 14 Samples, Experiment 1, Strain C
Figure 4-4 contains histograms of forward scatter for all M5114 samples of a
particular individual. Again, it shows the similarity of the replicate samples. It also
shows that different processes have a fundamentally different distribution of forward
scatter. Upon first thought, one would not expect the sample type to influence the
size of the cells. However, the CPCF process perforates the cell membrane, thereby
significantly altering the forward scatter.
51


f^Biuluqists* Rich Analytical Environment
Figure 4-4. Forward Scatter, M5114 Samples
4.3 Summary
In summary, the Rich Analytical Environment supports both statistical and
graphical analysis techniques. These techniques provide more information more
quickly than does the traditional flow cytometry analytical process. This speed of
response brings to mind another type of flow. In his book Flow: The Psychology of
Optimal Experience. Czikszentmihalyi (1990) suggests that as human beings, our
optimal experiences come when we are challenged. He calls these experiences
flow and argues that flow is just as real as being hungry, just as concrete as
walking into a wall. Two of the conditions required for experiencing flow are
52


becoming immersed in an activity with a sense of immediacy and timely feedback
(p.65). The data analysis task is fundamentally more satisfying when hypotheses can
be posed, and relevant data can be retrieved, within seconds or minutes, as opposed
to hours or days. Additionally, the speed of retrieval supports rapid hypothesis
refinement and extension.
53


5.
Conclusions and Future Work
The work presented in this thesis demonstrates that a rich analytical
environment can be created from flow cytometry data. This analytical environment
allows the analyst to perform types of analysis that were not previously possible, and
to gather more knowledge more quickly. The essential features of the environment
are:
an open architecture in which multiple analytical tools can be leveraged;
the combination of flow cytometry data with experimental metadata, such as
strain, diet, and sample type;
the ability to derive new metrics from core data, such as the calculation of
normalized fluorescence;
the ability to perform analyses across experiments, as opposed to on one or two
samples at a time; and
the ability to group multiple samples together to perform characteristic-based
analysis, where such characteristics include diet, strain, and process.
Furthermore, because the environment is so rich from an analytical perspective, the
possibilities for future work are significant.
54


Future work falls into five main categories: generalization of parsing
algorithms, automated identification of cell groups or phenotypes, incorporating
additional information into the RAE, additional analytical algorithms, and process
automation.
The code used to parse the FCS files in this work was written specifically for
the data collected by this particular instrument. It is not a general implementation of
a parsing algorithm for the FCS standard. In particular, the number and names of the
data values collected are assumed known. Additionally, the byte order of the data
section is also assumed known. A more general implementation would extract that
data from the text section, and adapt the parsing of the data section to those values.
Future work might also find the optional analysis section of the file to be valuable.
Another aspect of generalization is extending the solution to support multiple
or different relational databases. While the parsing algorithm writes the data to flat
files, this solution used Oracles bulk loading utility, SQLLOAD. Support for
additional relational databases would require support for the bulk loading utilities
associated with those databases. Additionally, the table and index creation syntax
might need slight modifications for other databases.
One of the challenges that face the biologist analyzing flow cytometry data is
identifying the cell clusters, gates, or phenotypes. In traditional techniques, that
identification is done manually. Automated clustering algorithms might prove to be
of great value. Assuming that a viable algorithm could be found, the event-level data
55


could be updated to include cluster name or number. Then, analysis could be
performed on a particular cluster. That analysis could be either statistical or
graphical, as discussed above. Additionally, that analysis could run across the
standard cross-sections (all samples, this individual; or all individuals, this sample
type) while constraining on a particular cluster, such as T cells or mature B cells.
Another extension of the environment is to include additional information
about the individuals. This information includes, but is not limited to, sex, weight,
age, and weight of selected organs. Then, future analysis could incorporate these
values. The analyst could research effect of age or weight on the flow cytometry
results.
The types of analysis discussed in Chapter 4, Analytical Techniques in the
Rich Analytical Environment, are but a starting point for the types of analysis that
can be performed once the event data is stored in an open and accessible
environment. Recall the various histograms presented above. Recall that in certain
cases, one of the replicate samples would not follow the expected curve. Visual
inspection suggests that the sample is an outlier, but a statistical goodness-of-fit
measure might be more appropriate. Numerous other statistically intensive
analytical techniques can be easily supported in this environment.
A final area for further work is process automation. As particular types of
analyses show themselves to be valuable, the processes to perform these analyses can
56


be automated. For example, the families of histograms could be automatically
generated.
In conclusion, the RAE empowers both the biologist and the analyst. Many
types of analysis are possible when all of the data from an experiment is made
available in an open environment. As biologists become more familiar with what
they can accomplish with this environment, certain processes will become standard.
Other processes will emerge as innovative and exciting.
57


REFERENCES
BD Biosciences. (1999). Introduction to Flow Cytometry: A Learning Guide.
[Manual Part Number: 11-11032-00].
Covas, D.T., De Lucena Angulo, I., Vianna Bonini Palma, P., & Zago, M.A. (2004).
Effects of hydroxyurea on the membrane of erythrocytes and platelets in
sickle cell anemia. Haematologica, 89(3), 237-280. Retrieved March 25,
2004 from http://www.haematologica.Org/joumal/2004/3/pdf7890273.pdf
Czikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. New
York: Harper & Row.
Desbarats, J., Wade, T., Wade, W. F., & Newell, M. K. (1999). Dichotomy between
naive and memory CD4+ T cell responses to Fas engagement. Proceedings
of the National Academy of Sciences, 96( 14), 8104-4109. Retrieved March
25,2004 from
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool:=pubmed&pubmed
id=l0393955
Harold, F. (2001/ The Way of the Cell: Molecules, Organisms and the Order of
Life. Oxford: Oxford University Press.
Harris, R. (2004, March 31). Human diseases mirrored in rat genome. All Things
Considered [Radio Broadcast]. Washington, DC: National Public Radio.
Retrieved on April 3,2004 from
http ://www.npr. org/features/feature.php? wfld=1804676
Herzenberg, L. A. & Herzenberg, L. A. (2004). Genetics, FACS, immunology, and
redox: a tale of two lives intertwined. Annual Review of Immunology, (22), 1-
31. Retrieved on March 31, 2004 from
http://herzenberg.stanford.edu/Publications/Reprints/LAH500.pdf
58


Huber S.A., Sakkinen P., David C., Newell M.K., & Tracy R.P. (2001). T helper-cell
phenotype regulates atherosclerosis in mice under conditions of mild
hypercholesterolemia. Circulation, 103,2610-2616. Retrieved March 25,
2004 from http://circ.ahajoumals.org/cgi/reprint/103/21/2610.pdf
International Society for Analytical Cytology (ISAC). Data File Standard for Flow
Cytometry, Version FCS3.0. Retrieved March 25,2004 from
http://www.isac-net.org/
Kelleher, K. (2004, April). The drug pipeline flows again. Business 2.0, 5(3), 50-
51.
Kimball, R., Reeves, R., Ross, M., & Thomthwaite, W. (1998). The Data
Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing,
and Deploying Data Warehouses. New York: John Wiley & Sons, Inc.
Kumar, L., Pivniouk, V., de la Fuente, M., Laouini, D., & Geha, R. (2002).
Differential role of SLP-76 domains in T cell development and function.
Proceedings of the National Academy of Sciences, 99(2), 884-889.
Retrieved April 17,2004 from http://www.pnas.Org/cgi/reprint/99/2/884.pdf
Lee, J., Shin, J., Kim, E., Kang, H., Yim, I., Kim, J., et al. (2004).
Immunomodulatory and antitumor effects in vivo by the cytoplasmic fraction
of Lactobacillus casei and Bifidobacterium longum. Journal of Veterinary
Science, 5(1), 41-48. Retrieved March 25,2004 from
http://www.vetsci.org/2004/pdf/41 .pdf
Murphy, R. (1996). Flow Cytometry and Sorting. Retrieved on October 4, 2003
from
http://flowcyt.cyto.purdue.edu/flowcyt/educate/theory/fluiopti/sld001.htm
Parks, D. R.(1996). Flow cytometry instrumentation and measurement. In L.
Herzenberg, L. Herzenberg, C. Blackwell, & D. Weir (Eds.), The Handbook
of Experimental Immunology (pp. 47.1-47.12). Boston: Blackwell Science.
Retrieved on March 31,2004 from
http://herzenberg.stanford.edu/Publications/Reprints/LAH413 .pdf
59


Parks, D. R. & Bigos, M. (1996). Collection, display and analysis of flow cytometry
data. In L. Herzenberg, L. Herzenberg, C. Blackwell, & D. Weir (Eds.), The
Handbook of Experimental Immunology (pp. 50.1-50.11). Boston: Blackwell
Science. Retrieved on March 31,2004 from
http://herzenberg.stanford.edu/Publications/Reprints/LAH414.pdf
Schweitzer, S.C., Reding, A. M., Ford, C. A., Villalobs-Menuey, E., Huber, S. A., &
Newell, M. K. (2004). Exogeneous Fatty Acids Affect the Expression of
MHC Class II in Macrophage Cell Lines. Unpublished manuscript,
University of Colorado at Colorado Springs.
60


Full Text

PAGE 1

KNOWLEDGE DISCOVERY IN FLOW CYTOMETRY DATA by Janet Siebert B.A. University of Montana, 1981 M.Ed., Vanderbilt University, 1984 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Computer Science 2004

PAGE 2

This thesis for the Master of Science degree by Janet Siebert has been approved by M. Karen Newell Date

PAGE 3

Siebert, Janet (M.S., Computer Science) Knowledge Discovery in Flow Cytometry Data Thesis directed by Professor Krzysztof Cios ABSTRACT Flow cytometry data is stored in a published but esoteric format. Existing analysis tools are proprietary, and limited in available functionality. Additionally, they are designed to process a small number of samples at a time. Immunological research requires analysis of multiple samples on multiple individuals. The selection of samples for a particular type of analysis ideally is based on experiment metadata, identifying the individual, characteristics of that individual, and characteristics of the sample. This metadata is not available to existing tools. This work attempts to overcome these shortcomings in the existing analytical environments. It offers four distinct contributions: 1. Open architecture: Data is stored in a relational database, and can be accessed and analyzed by many tools and techniques. 2. Addition of the experiment's metadata: Each event from each sample is associated with relevant information from the experiment, such as individual, individual's strain, individual's diet type, stain or sample type, and sample processing technique. This metadata is easily accessible within the database, thereby supporting analysis of samples by any logical grouping. 3. Ability to derive new metrics from core data: Much of the analysis in this work is based on ''Normalized Fluorescence," in which the measurement for fluorescence is divided by the measurement for relative cell size. 4. Support for characteristic based analysis: The environment created in this work allows analysis by characteristic, such as stain, diet, or sample processing technique. Data can be grouped by experiment or project, instead of simply by sample. While these last two items are the most significant and new approaches presented by this work at this point in time, the first two items may prove the most significant lll

PAGE 4

contributions over the long run The combination of the open environment and the inclusion of the experiment's metadata create a rich analytical environment in which biologists can perform types of analysis that they simply could not have performed before. This possibility may lead to powerful new analytical techniques, and to significant biological findings and discoveries. This abstract accurately represents the content of its publication. Signed iv

PAGE 5

My great thanks to my husband, Wes Munsil, for his moral support, wisdom, and technical guidance. I also thank my advisor, Dr. Krys Cios, for his guidance and enthusiasm. My gratitude also goes to Ashleigh Reding, Dr. Suzi Schweitzer, Jeff Rogers, and Dr. Karen Newell (all of the Institute of Bioenergetics, University of Colorado at Colorado Springs) for their patience in teaching me about flow cytometry, and for generating the data which inspired this work.

PAGE 6

CONTENTS Figures .............................. ......................................................................................... viii Tables .......................................................................................................................... ix Chapter L Introduction ............................................... ....................................................... 1 2. The Fundamentals of Flow ............................................................. 4 U Instrumentation and Measurement .... ............................................................. 4 2.2 A BriefHistory of Flow Cvtometry ................................................................. 9 2.3 Experimental Design ...................... .............................................................. 10 2.4 The Analytical Process ................................... ........ ..................................... 13 3. Building the Rich Analytical Environment.. ................................................... 16 il Data Flow ..................................................... ................................................ 17 3.2 FCS Parsing ................................................... ............................................... 19 3.3 Preparing the Flow Sample Key .................................................................... 23 3.4 Connecting the Flow Data to the Sample Key ............................................... 26 3.5 Evolution of the Data Model ......................................................................... 27 3.5.1 The Entity-Relationship Model ..................................................................... 27 3.5.2 The Dimensional Model ................................................................................ 29 3.5.3 Data Volume .................................................................................................. 32 Vl

PAGE 7

4. Analvtical Techniques in the Rich Analytical Environment .......................... 34 ti Statistical Analyses ........................................................................................ 36 4.1.1 Summary Statistics, Entire Experiment.. ....................................................... 36 4.1.2 Suspect Samples ............................................................................................ 38 4.1.3 Study of a Particular Individual ..................................................................... 39 4.1.4 M5114 Fluoresence, Normal Process and CPCF Process ............................. 40 4.1.5 Fluorescence by Diet Type, Isotype and M5114 Stains ................................ 42 4.1.6 Fluorescence by Experiment and Diet Type. M5114 .................................... 43 4.1.7 Simple Histograms Using Text.. .. ............................ .... .................................. 45 4.1.8 Normalized Fluorescence by Fotward Scatter Range ................................... 46 4.2 Graphical Analyses ............. .......................................................................... 48 4.3 Summary ........................................................................................................ 52 Conclusions and Future Work ........................................................................ 54 References ...................................................................... ........ ..................................... 58 Vll

PAGE 8

FIGURES Figure 2-1 The Flow Column ........................................................................................... ...... 5 2-2 Hydrodynamic Focusing .................................................... .................................. 6 2-3 Forward and Side Scatter ...................................................................................... 7 2-4 Representative Scattergram ....................... ........ .................... ...... ....................... 14 2-5 Representative Histograill ............................ ....... .................. ......... ................. .... 14 3-1 First Generation Data Flow Diagraill ...................... ....................... .................... 18 3-2 Second Generation Data Flow Diagram .............. .... ..... .......... ...... .................... 19 3-3 Entity Relationship Model.. ............................... ... .............................................. 28 3-4 Dimensional Model ........................................................................... ........ ......... 30 3-5 The Event Table .................................................................................................. 31 4-1 Normalized FL2, Individual BL-6 ....................................................................... 49 4-2 M5114 Saillples, Experiment 1 .......................................................................... 50 4-3 M5114 Saillples, Experiment 1. Strain C .............................. ............................ 51 4-4 Forward Scatter, M5114 Samples ...................................................................... 52 Vlll

PAGE 9

TABLES Table 2-1 Sample Event Data ............................................ ...... .... ............... ........ ............... 9 3-1 An Extract from a Param File . . .......................... ....... .... . .... ........... .............. 22 3-2 An Extract from an E v en t File First Generation ......................... . .... ....... ..... .... 23 3-3 An Extract from an Event File, Second Genera t ion ..... ..................................... 23 3-4 Flow Sample Key .......... .............. ........... ........ ..................................... ... ...... 24 3-5 Sample Key Data, Reformatted for Database .... ........................................... ..... 25 3-6 Values of SRC Parameter . ....... ..... . .......... ........ ........ ............... . ........... ........ 26 3-7 Histogram-generating Query, Original Mode l ............................ .... ... ............... 32 3-8 Histogram-generating Query, Dimensional Model ............................................ 32 3 9 Selected Data Set Statistics ................................................................................. 33 4-1 Result Set (Extract}--Summary Stat i stics, En t ire Experiment .......................... 37 4-2 Result Set Suspect Samples ............................................................................. 39 4-3 Result Set-Study of Ind i vidual BH-3 ............................................................... 40 4-4 Result Set (Extract}--M5114 Samples, Normal and CPCF Processing ............ 41 4-5 Result Set-Fluorescence by Diet Type, Normal Processing ........................... 42 4-6 Result Set-Fluorescence by Diet Type, CPCF Processing .............................. 43 4-7 Result Set-Fluorescence by Experiment by Diet Type, Normal Processing .. 44 IX

PAGE 10

4-8 Result Set-Fluorescence by Experiment by Diet Type, CPCF Processing .... 44 4-9 Result Set (Extract}--Textual Histogram ........................................................... 46 4-10 Result Set-Fluorescence by Forward Scatter Range ...................................... 47 4-11 SOL Statement for Graphical Result ............................... ................................ 49 X

PAGE 11

1. Introduction Flow cytometry is a common technique used by research biologists and immunologists. A flow cytometer measures several attributes of individual cells which are suspended in a solution. Flow cytometry processing generally collects data on thousands of cells per sample This technique is used to study cell behavior, and investigate treatments for diseases such as cancer, HIV, and sickle cell anemia. The data collected by the flow cytometer is written to a published but esoteric format. Generally, biologists use proprietary software to access the data and perform a fixed set of analyses. Unfortunately, these techniques are limited and limiting. This work provides mechanisms and methods to dramatically improve the efficiency and range of the data analysis techniques. Research biologists at the University of Colorado Institute for Bioenergetics, located at the University of Colorado at Colorado Springs, provided the experimental data and guidance which inspired this work. Flow cytometry is integral to their research into cellular metabolism and cellular communication. Chapter 2, The Fundamentals of Flow Cytometry, presents background information, and provides an overview of one project conducted by the researchers at the Institute for 1

PAGE 12

Bioenergetics. It then describes the current state of the art in flow cytometry analysis. Chapter 3, "Building the Rich Analytical Environment," explains the methodology of moving the data from a limited environment to one in which a myriad of analytical techniques are possible. We combine data created by the flow cytometer and data sources maintained by the biologists in a relational database. Additionally, we convert a sample key maintained by the biologists from a word processor document into a database table. This sample key correlates the flow cytometer run number to the particular individual and the particular preparation of the sample The chapter begins with a discussion of data flow for creating this environment, and continues with details on processing the data generated by the flow cytometer. The chapter also discusses the evolution of the data model for the rich analytical environment. Chapter 4, "Analytical Techniques in the Rich Analytical Environment," expands upon the types of analysis that can be performed with a large body of data readily accessible in an open environment. Existing analytical processes limit the analyst to considering one sample at a time. With all of the samples from an experiment loaded into a relational database, the analyst can consider such groupings as "all of the samples on this particular individual" or "all of the samples of this particular sample type." Analyses can be either statistical or graphical. 2

PAGE 13

Chapter 5, and Future Work," summarizes the discussions, and offers suggestions for future work in the rich analytical environment. The possibilities for future work include generalizing the parsing algorithms, automating the identification of cell groups, incorporating additional information into the environment, identifying additional analytical algorithms, and automating the execution ofthe most valuable algorithms. 3

PAGE 14

2. The Fundamentals of Flow Cytometry This chapter presents the fundamentals of flow cytometry, a brief history of the instrument, the experiment which generated the data considered in this study, and the standard methods of analyzing flow cytometry data. The first section provides an overview of the instrument and the data it collects. The second section highlights selected aspects of the development of the instrument. The third section discusses experimental design, methods, and materials. The final section considers the current state-of-the-art techniques for analyzing the resulting data. 2.1 Instrumentation and Measurement Articles and media segments about genetics and proteomics research abound in the popular press (Kelleher, 2004; Harris, 2004). The casual reader thereby might assume that all bioscientists are immersed in such research, and that the cell is no longer an important research focus. However, the cell is the fundamental building block of every organism. "In the hierarchy of biological order, cells hold a special place, for they alone have the capacity to make themselves autonomously, and to multiply by division" (Harold, 2001, p. 17). Flow cytometry provides a mechanism for studying cell characteristics and behaviors. 4

PAGE 15

One of the strengths of the flow cytometer is its ability to record multiple independent and quantitative measurements on a large number of cells (Parks, 1996). A flow cytometer takes in a sample of cells or cell particles suspended in solution, sending them in a single file past a laser beam, as shown in Figure 2-1. The alignment of the cells is achieved by hydrodynamic focusing, as shown in Figure 2-2 (BD Biosciences, 1999). The focusing ofthe solution into a thin channel forces the cells to align in single file within the stream. laser 0 0-0 forward scatter detector --0 Figure 2-1. The Flow Column 5

PAGE 16

Figure 2-2. Hydrodynamic Focusing The particular instrument used in this study was a Coulter Epics XL-MCL. The four data points measured by this instrument for these studies are : Forward Scatter (FS) Side Scatter (SS) Red Fluorescence (FL 1) Green Fluorescence (FL2) Figure 2-3 highlights the high-level physics behind the measurement of forward scatter and side scatter Forward scatter represen t s cell size. Side scatter represents internal structure of the cell or granularity. Among cells with different structures, light scatter provides a rough indicator of cell size. Among cells with similar structures, forward scatter and side scatter increase monotonically with cell size. Dead cells and cellular debris tend to have higher side scatter than live cells. Taken together forward scatter and side scatter can help identify and thereby exclude dead cells and debris (Parks, 1996). 6

PAGE 17

forward scatter side scatter Figure 2-3. Forward and Side Scatter Fluorescence detectors measure the presence of cells or molecules that have been dyed during the pre-processing of the sample. Each particle on which data is recorded is called an event. Data collected by the flow cytometer during the processing of a sample is written to an output file. The flow cytometer analyzes a sample until a certain number of "acceptable" events have been processed. An acceptable event is one in which the data falls into a pre-configured range, based on typical size and granularity for a particular population. Biologists can label this range "live." Other ranges are sometimes labeled "dead" or ''junk." Cells that fall into the live range have FS and SS 7

PAGE 18

measurements such that they are probably intact cells. Those that fall into the dead range have FS and SS measurements such that they probably have been compromised during the processing and are no longer whole. Sample runs are often configured to process 5,000 live events. This processing generally takes 15 to 120 seconds, depending on how many cells are in the sample. Depending on the sample, anywhere from approximately 6,000 to 30,000 particles must be processed to find 5000 particles in the acceptable range. The data is written in a standard format, such as FCS 2.0 or FCS3.0, where FCS stands for Flow Cytometry Standard The formats are specified by the Data File Standards Committee of the International Society for Analytical Cytology (ISAC). The file format includes a header section, an ASCII text section specifying parameters of the data run, and a section recording the data from the events. The data section is often written in a binary encoding. Thus, the file must be processed by a utility program for the event data to be translated to a human-readable form. Table 2-1 shows sample event data, after processing. 8

PAGE 19

Table 2-1. Sample Event Data FS ss FL1 LOG FL2LOG FSLOG SSLOG 190 274 0 0 836 877 266 206 0 0 874 846 245 265 0 0 865 873 34 43 172 0 645 672 84 206 0 0 746 846 85 72 0 0 747 729 86 124 113 0 748 789 247 252 0 0 865 868 229 206 73 0 857 846 2.2 A Brief History of Flow Cytometry Two pioneers in the fields of flow cytometry fluorescence activated cell sorting (F ACS), and immunology are Leonard and Leonore Herzenberg. Their recent autobiography, published in the Annual Review of Immunology (2004), provides insights into the development of the flow cytometer. The development effort started in the late 1960s: As I became more deeply involved in immunology, I became increasingly aware of the need to characterize and isolate the different kinds of lymphocytes tha t were beginning to be visualized with fluorescent-labeled antibodies under the microscope .. So I started asking around to see whether anyone had solved the problem. (Herzenberg and Herzenberg, 2004) A group of researchers at Los Alamos had developed a machine to count and sort cell-sized particles based on particle volume. Their goal was to be able to count 9

PAGE 20

debris particles obtained from the lungs of mice and rats, set aloft in balloons after atomic tests. Consequently, they had no need to measure fluorescence. However, Herzenberg was able to convince the researchers to share their engineering drawings and schematics with him. He returned to Stanford and assembled a team to build the first device. The first cell-sorting paper was published in Science in 1969 (Herzenberg and Herzenberg, 2004) Data was originally collected by photographing histograms displayed on oscilloscope screens. Impro v ements in subseq u ent generations of the machine included the addition of logarithmic amplifiers thereby all owing the full range of data to be displayed on a single data plot; the use of computers in data collection; and software for data computation and display. The original computer used in the architecture was the PDP-8, followed by the PDP-11. In 1983, the first dual laser instrument was put into routine service. This allowed simultaneous measurements on multiple fluorescences. In 1998, a three laser instrument was developed, allowing up to eleven distinct fluorescence emissions (Herzenberg and Herzenberg, 2004). 2.3 Experimental Design Researchers at the Institute for Bioenergetics are engaged in a project which considers the link between lipid availability, lysosomallendosomal acidity, and the expression of cell surface Major Histocompatibility (MHC) class II molecules in human macrophage cell lines. This work is significant because it provides evidence for a mechanistic link 10

PAGE 21

between exogenous lipid availability, inflammation and the immune response. Because lysosomes are established sites for fatty acid accumulation and MHC class II molecules must traffic through the acidic lysosomal/endosomal compartment, we reasoned that fatty acid availability might have a direct impact on the immune response through fatty acid-dependent alteration in the expression ofMHC class II on the macrophage cell surface. (Schweitzer et al., 2004) Putting this work into context for the non-biologist, intracellular vesicles, such as the lysosome or lipid raft, may support the transport ofMHC class II molecules to the surface of the cell. These molecules aid in resistance to certain diseases. The data considered in this paper attempts to verify the lipid raft hypothesis by showing that mice raised on a high-fat diet have more surface expression ofMHC class II than those raised on a low-fa t diet. The presence of MHC class II can be detected through cytometric analysis. Expanding upon in vitro experiments in human cell lines, the researchers designed in vivo experiments on mice. The project under consideration in this paper started with 112 mice. Half of the mice were fed a high fat diet (5% coconut oil and 5% safflower oil), half a low fat diet ( 5% safflower oil). After approximately 16 weeks, the mice were killed. The spleens were removed, and a suspension of splenocytes was prepared. The suspension contained approximately 1,000,000 cells per milliliter of solution. 11

PAGE 22

Sub-samples of the splenocyte suspension were then dyed with substances designed to fluoresce in the flow cytometer Lysosomal acidity was measured by the fluorescence ofLysoSensor stain. MHC class IT expression was measured by the fluorescence of a phycoerythrin conjugated rat anti-mouse 1-NI-E. In experimental data, samples stained with this substance were labeled "M5114." Additionally, some samples were stained with an isotype (lgNigE) which binds to all non-specific matter, thereby acting as a control. Sam p les were processed through the flow cytometer, yielding the e vent data as discussed above. The experimental data includes up to 13 samples or data se t s on each individual-! unstained, 1 isotype sta i ned, 3 M 5114 stained, and 3 LysoSensor stained; with an additional set of samples trea t ed with CytoP e rm/CytoFix processing. This proc e ss, also used by Kumar et al. (200 2 ), permeates t he cell m embrane, allowing intracellular staini ng. This treatment i s labeled "CPCF" in the experimetnal data. The CPCF treatment was performed on one no-stain, one isotype, and three M5114 samples, per individual. The project was divided into four e x periments. The first experiment included only females, while the second experiment included only males. The third and fourth experiments included both males and females. Individuals in the third experiment were wounded at an age of 16 weeks. Individuals in the fourth experiment were wounded at an age of 12 weeks. The wounding provided some data for research on diabetics and wound healing. Additionally, the wounding could 12

PAGE 23

activate the immune system of the wounded individuals. This would be demonstrated by increased MHC class II expression The project included four strains ofmice-Balb/c, C57/Black 6, UPC2 Knockout, and P6129. The P6129 species is the parent line of the UPC2 Knockout mice. These species are represented in the data as B, C U, and P individuals, respectively. 2.4 The Analytical Process Cell Quest software available from BD Biosciences is the primary tool that the researchers at the Institute for Bioenergetics use to analyze the flow cytometry data Thi s software presents the data graphically, and provides summary statistics. The software also lets users manually define s ubsets or clusters of data. The graphical or statistical analysis can then be performed on those clusters The two main graphi c al presentations are the scatt e rgram and the histogram. The scattergram i s a dot plot showing event val ues for two different parameters, as shown in Figure 2-4. The dot plot is the most common and useful data representation, particularly durin g real-time data collection (Parks and Bigos, 1996) The histogram shows the value of one parameter on the horizontal axis, and the count of events with that value on the vertical axis, as shown in Figure 2-5 ''The simple one-dimensional histogram, plotting cell frequency as a function of signal level, has been a venerable analysis tool since the beginnings of flow cytometry'' 13

PAGE 24

(Parks and Bigos, 1996, p. 50.6) Scattergrams and histograms can be found in Desbarates (1999), Huber (2001), and Lee (2004) (/) u.. 0 ....... ,._ ., .. .. \: ss Figure 2-4. Representative Scattergram FS Figure 2-5. Representative Histogram 14

PAGE 25

Gating is a mechanism for graphically or numerically identifying a subset of the data. The analyst can then view scattergrams or histograms on that subset of data only. Additionally, certain summary statistics are available on the gates or regions. These include number of events, percentage of gated events, percentage of total events, arithmetic mean, and geometric mean. 15

PAGE 26

3. Building the Rich Analytical Environment Recall that the event data is written to a file in an esoteric format. Biologists are accustomed to analyzing this data with proprietary software such as CellQuest. CellQuest represents the data graphically, and via summary statistics. Consequently, biologists are not necessarily aware of the possibilities for analyzing the data were it in a more accessible format. The current analytical environment of flow cytometry analysis is limited, closed, and sample-centric. To bring more powerful analytical tools, techniques, and mindset to the problem space, a different environment is required. First, a mechanism must exist to combine event data from multiple samples. Second, the analyst needs to be able to perform a variety of statistical and graphical analyses. These analytical approaches should be constrained only by the fundamental nature of the data, and by the analyst's imagination. The approaches should not be artificially constrained by a particular tool. Third, the analytical environment should be open, allowing the data to be accessed and processed in a wide variety of means. Fourth, the analytical environment should support rapid hypothesis testing. This rich analytical environment (RAE) is created by parsing the FCS data and loading the resulting data into a relational database. Additionally, the parsed 16

PAGE 27

data is associated with information about the individual, the sample type, and the experiment. Once the data is in a standard relational database, a variety of tools and techniques can be employed. Viable tools for analyzing the data include SQL; programming languages such as Java, Perl, and database stored procedures; and data analysis and graphing programs. Such programs can have a mathematical or scientific focus, such as MATLAB (www.mathworks.com) and Origin (www.originlab com). Alternatively, they can have a business intelligence focus, such as Business Objects (www.businessobjects.com) and Cognos (www.cognos.com). Many powerful tools for analyzing data in relational databases are available. The research biologist can leverage these tools, once the data is exposed. This chapter discusses the process of building the RAE. 3.1 Data Flow Figure 3-1 shows the data flow for the first generation of the rich analytical environment. The Java program written for this work, FCSParser processes the native flow cytometry into two files, events and parameters. The resulting files are then loaded into a relationa l database. The sample key maintained by the biologists to correlate samples with individual mice and sample type, is also loaded into the database. 17

PAGE 28

. 1 params.dat ___t.LMQ)__ events.dat sample key-------------+< Figure 3-1. First Generation Data Flow Diagram Figure 3-2 shows the data flow of the second generation of the RAE. In this case, the event data is combined with information about the sample with which it is associated. This metadata includes individual, diet type, strain, and sample type. The resulting environment is easier to use, because fewer database joins are required. Additionally, the reduction in joins can improve query performance. The modified components of this solution are FCSParser', events.dat' and sample key'. The following section provides more details on preparing data for the database. 18

PAGE 29

FCS File CLMD) params .dat events .dat' sample key' L..------------+f Figure 3-2. Second Generation Data Flow Diagram 3.2 FCS Parsing The FCS files, collected and written by the flow cytometer, are written in a published format. The formats are specified by the Data File Standards Committee of the International Society for Analytical Cytology (IS A C). The file format includes a header section, a section recording parameters of the data run, and a section recording the events. There is also an optional analysis section. The header section is of a fixed format, noting the version of the data standard (FCS2.0 or FCS3.0), and providing byte offsets to the remaining data. Among the values in the header section are the byte offset to the beginning of the text section, the byte offset to the end of the text section, the byte offset to the beginning of the data section, and the byte offset to the end of data section. For example, the standard states that the offset to the first byte of the text section is found 19

PAGE 30

in positions 10-17, and the offset to the last byte of the text section is found in positions 18-25. A representative header from the files considered in this work is: FCS2 0 128 971 2048 90223 96512 96537 90240 96425 The file format is FCS2.0. The text section ranges from bytes 128-971, the data section from 2048-90223, the analysis section from 96512-96537, and a user-defined "other" section from 90240-96425. These last two sections are not processed in this work. The text section consists of ASCTI name-value pairs. The first character of the text section is the delimiter which separates the names and values. Names or keywords which are required in the specification are prefaced by the "$" character. Additional keywords can be included in the text section, but they are not prefaced by the"$" character. Among the required keywords are $DATA TYPE, $PAR, and $TOT. The values associated with these keywords represent the type of data in the data segment (ASCTI text, binary integer, binary floating point), the number of parameters recorded in an event (where the parameters are such values as FS, SS, FLlLOG and FL2LOG), and the tota] number of events in the data set, respectively. A subset of the text section from one of the files considered in this work follows: !$FIL!G0036390.LMD!$SYS!DOS 6.22!$INST!UCCS Flow Center ,Science Bldg. Rm.163 The file name is G0036390.LMD. It was created by a computer running DOS 6.22, at the institution named ''UCCS Flow Center ,Science Bldg. Rm. 163." 20

PAGE 31

The data section consists of the raw data recorded by the flow cytometer. Several of the keywords in the text section describe the data section. These keywords include $MODE, $DATATYPE, $BYTEORD, and $PnB. $MODE specifies whether the data is in list or histogram mode. The data considered in this work was list mode, which means that data was recorded for each cell or event. Additionally, in this work, the data was \Vritten as binary integers with a byte order of "1,2", as specified by the $DATA TYPE and $BYTEORD parameters. The byte order specifies the order in which the binary dat a bytes are written to compose a data word : "1" refers to the least significant digit, "2" to the most significant digit. Recording the byte order allows data which is written under one operating system to be analyzed under another operating system. $PnB specifies the number of bits in each parameter, where n is the number of the parameter (i.e. 1, 2, 3). In this work, a Java program was written to read an individual file and write two resulting files, params.dat and events.dat. The format of the data section was determined by manual inspection of the appropriate parameters of the text section. The parsing code was written with this knowledge. As such, it is not a fully general parsing solution The value of was determined by extracting the first part of the input file name. Sample input file names, as generated by the flow cytometer, are: G0036500.LMD G0036501.LMD 21

PAGE 32

G0036502.LMD G0036503.LMD G0036504.LMD Consequently, given an input file of G0036500.LMD, output files would be paramsG0036500 dat and eventsG0036500 dat. Shell scripts were used to run the parsing program on all of the files within a particular directory In the first generation of the parsing code, all rows ofboth output files were prepended with the tag and the experiment number. The tag identifies the particular sample. The experiment number allows identification and efficient processing of a subset of the data. Table 3-1 and Table 3-2 show sections of data from each of these two outpu t files Table 3-1. An Extract from a Param File G0036076 1 1IFILIG0036076.LMD G0036076IliSYSIDOS 6.22 G0036076IliiNSTIUCCS Flow Center ,Scienc e Bldg. Rm.163 G0036076IliCYTIXL AB27219 G0036076IliDATEIIO-Jul-03 G0036076IliBTIM I 16:59:34 G0036076IliSRCiss new keep 1 ns G0036076IliSMNOIG0036076 G0036076IliOPISLV G0036076Il i COMISYSTEM ll Version 3 0 G0036076IIITESTNAMEILisa spleen FLI G0036076IIITESTFILEIG0000157.PRO G0036076IliBYTEORDI1 ,2 G0036076IliDATATYPEII 22

PAGE 33

Table 3-2. An Extract from an Event File, First Generation G0036076Ili75I164I387IOI734I821 G0036076Ill197l203l451IOI841l844 G0036076Ili173I1291432IOI8261794 G0036076Ill18li116I383IOI83ll782 G0036076Ill18ll16li344IOI83ll818 G0036076Ili20li212I399IOI842I848 G0036076Ill193l22li464 IOI838I853 As the data model evolved, the information added to each row of data in the event file grew. In the second generation of the environment, each row in the event table includes key information from the sample key. Representative data is shown in Table 3-3. Table 3-3. An Extract from an Event File, Second Generation G0036076IllliBL-3INo StainiOIBILINORMALI75I164I387IOI734I821 G0036076IllliBL-3INo StainiOIBILINORMALI197I203I451IOI84ll844 G0036076IllliBL-3INo Stain!OIBILINORMALI173I129I432IOI8261 794 G0036076IllliBL-3INo StainiOIBILINORMALI18li116I383IOI83ll782 G0036076 I llliBL-31No StainiOIBILINORMALI18ll16li344IOI83ll818 G0036076IIIIIBL-31No StainiOIBILINORMALI20II212I399IOI842I 848 3.3 Preparing the Flow Sample Key For the experiments under discussion, the Flow Sample Key data was stored in a Microsoft Word Document. An extract of that data is shown in Table 3-4. 23

PAGE 34

Table 3-4. Flow Sample Key Sample# Description Sample# Description 101. BL-7 No Stain 126 BH-12 no stain 102 BL-7 isotype 127. BH-12 isotype 103 BL-7 M5114 #1 128. BH-12 M5114 #1 104 BL-7 M5114 #2 1 29 BH-12 M5114 #2 105. BL-7 M5114 #3 130. BH-12M5114#3 106 BL-8 No sta i n 131. CL-8 no stain 107. BL-8 isotype 132 CL-8 isotype 108, BL-8 M5114 #1 133. CL 8 M5114 #1 109. BL-8 M5114 #2 134. C L-8M5114#2 Turning this data into a database-friendly format required several steps. The first step was a parsing step, whereby the descriptions were split apart into "Individual," "Sample Type," "Replicate," "Strain," "Diet," and ''Process" components. The Individual values represent the particular mice. The Individual values shown in this set are BL-7, BL-8, BH -12, and CL-8. The Sample Type values indicate how the sample has been stained. The values shown in this set include No Stain, Isotype, and M5114. Replicate values are 0, 1, 2, and 3. The first letter ofthe Individual designation also represents the mouse strain. The second letter represents diet, either high fat (H) or low fat (L). Finally, some of the samples were processed with a 24

PAGE 35

CytoPerm/CytoFix (CPCF) treatment. The two process values are NORMAL and CPCF. Inspection of the sample key also shows that the input must be cleansed or standardized. For example, No stain and no stain were transformed to No Stain, and isotype to lsotype. Additionally, the Sample# value 108, was changed to 108. Finally, a value for "Experiment" was prepended to the data set. Resulting data, ready for loading into the database, is shown in Table 3-5. Table 3-5. Sample Key Data, Reformatted for Database Gi .0 8. c: E :::l >-a> z 10 1-2 E a> :::l a> 1:: a. "0 a. c::: 8. E > E a. G) )( I'll '5 I'll a> w en ..E en a:: en 0 14 BL-6 Lyso B L NORMAL 15 BL-6 Lyso 2 B L NORMAL 16 BL-6 Lyso 3 B L NORMAL 17 BH-4 No Stain 0 B H NORMAL 18 BH-4 Lyso 1 B H NORMAL 19 BH-4 Lyso 2 B H NORMAL In the future, the biologists may want to record their sample keys in a more database-friendly format. One possibility is an Excel spreadsheet with column headers corresponding to the database fields (Experiment, Sample_No, Individual, Sample_ Type). Additionally, they might want to include validation rules on the columns to keep the data clean. 25

PAGE 36

3.4 Connecting the Flow Data to the Sample Key The data collected by the flow cytometer needs to be associated with the data from the Flow Sample Key. The parameter SRC contains the necessary linlc Sample SRC values can be found in Table 3-6. The non-numeric part of the value is filtered out. The remaining numeric portion is written to the param file with the name SAMPLE_NO. Note that under certain circumstances, there are multiple runs per sample. In the second generation of the RAE, this sample number was used to retrieve information from the sample key Table 3-6. Values of SRC Parameter ss 95 ss 96 ss 97 ns ss 97 ns re ss 9 7 ns re re Parks and Bigos (1996) note that the designers of most flow data systems have viewed documentation as a poor stepchild to the g l amour of graphical display. Often all they provide is a pedestrian editor for keyword values to be stored with the data file. .. These editors tend to be oriented toward single samples and not taking sample grouping and experiment structure into account. Once the annotations are entered, there may be no facility for browsing or searching the keywords to group or retrieve relevant files when doing later analysis. ( p. 50.1) The system used by the researchers at the Institute for Bioenergetics for correlating flow files to their sample key is tedious. The SRC values discussed 26

PAGE 37

above are entered into the flow cytometer at sample run time. The values are printed automatically on a hardcopy analysis summary which is generated at run time. These printouts are stored in a binder. When a researcher wants to find the flow file associated with the sample, BL-6 Lyso #1, she first finds that entry in the sample key, and mentally notes the associated key number of that sample (e.g 14). She then turns pages in her binder, scanning the printouts until she finds the number 14 embedded in the SRC field (A. Reding, personal communication, March 30, 2004). 3.5 Evolution of the Data Model The data model for the Rich Analytical Environment underwent several iterations before reaching its current form. The initial version was normalized, with tables for parameters, events, and the sample key. This vers ion had several shortcomings. First, the vo l ume of data vis-a-vis machine resources created a need for performance tuning. Second, multiple joins were required to connect the event data with the sample key data. Both of these factors led to the consideration of a star schema or dimensional model. However since most ofthe dimensions are single attribute dimensions, the collapse of all requisite values into a single table made sense. Each of these iterations of the data model is discussed in this section. 3.5.1 The Entity-Relationship Model The entity-relationship model, as shown in Figure 3-3, accurately represents the data coming from two data sources. The first data source is the FCS files. Recall that there are two types of data that are relevant to the analyst. These are the 27

PAGE 38

individual events that are measured in a particular sample, and the parameters that describe each sample. The event data translated from binary into ASCII, comprises each row of the event table. Additionally, each event is prepended with the experiment number and the tag or file name event tag experiment fs ss fl11og fl21og fslog sslog param tag experiment param_name param_value where param_ ------+ sample_key experiment sample_no individual sample_type name='SAMPLE NO' am_ value=sample_no and par Figure 3-3. Entity Relationship Model There were two options for modeling the parameter table. One option was to create a table having one field for every parameter 68 fields in this work. The other option was to create a more general format in which each name-value pair had one row. Since implementers of the FCS format can include their own name-value pairs in the text section of the file, this second option is more flexible in the long run. Additionally, this option supports the creation of additional name-value pairs that support the analysis. For example, the sample number that corresponds to the 28

PAGE 39

sample key is embedded in the SRC field. This field was parsed to extract the numeric part of the data. This number, along with the name "SAMPLE_NO," was added to the param table. Again, each row includes the experiment and the tag. The SAMPLE_KEY table is simply a database representation of the Flow Sample Key, maintained in a word processor. INDIVIDUAL, SAMPLE_TYPE, and REPLICATE are split apart into separate fields. DIET and STRAIN are derived from the appropriate character of the individual value. Each row also includes the experiment number. 3.5.2 The Dimensional Model This environment is a reporting environment as opposed to a transactional environment. Thus, techniques for performance and usability in reporting environments are worth considering. One prominent technique is the dimensional model, which is also known as the star schema. A dimensional model for the RAE is shown in Figure 3-4. Visually represented, a fact is at the center of the model, with dimensions or attributes surrounding that fact. 29

PAGE 40

tag experiment individual sample(#) strain diet replicate process Figure 3-4. Dimensional Model In this application, the event is the fact or the measurement at the center of the star. The dimensions include INDIVIDUAL, TAG, EXPERIMENT, SAMPLE, SAMPLE_TYPE, REPLICATE, PROCESS, DIET, and STRAIN. Most ofthese dimensions include only one attribute. As such, they are degenerate dimensions. Kimball, Reeves, Ross and Thomthwa it e (1998) recommend collapsing the degenerate dimensions into the fact table, without a join to anything. INDIVIDUAL is the only dimension that has multiple attributes. These attributes could include sex, weight, age, food consumption, and various other attributes recorded by the biologists. The resulting event table is shown in Figure 3-5. 30

PAGE 41

event tag experiment individual strain diet sample sample_type replicate process fs 55 f111og f121og fslog sslog Figure 3-5. The E vent Table Because most queries are performed on the ev en t t ab l e and the event table only, this structure increases the ease with which the biologists can query the data. Table 3-7 shows a his t ogram-generating query from the original implementation. Table 3-8 shows the logically equivalent query from the second generation implementation. Index in g on key values like INDIVIDUAL, DIET_ TYPE, EXPERIMENT, SAMPLE_ TYPE, and REPLICATE will enhance performance. 31

PAGE 42

Table 3-7. Histogram-generating Query, Original Model select e.tag, sample_type, round(fs/25), count(*) from event e, param p, sample_ key s where e.tag=p.tag and param name='SAMPLE _NO' and param value=to _char( sample_ no) and individual='BH-6' and s.experiment=1 and p.experiment=1 group by e.tag, sample_type, round(fs/25) having round(fs/25)<20 Table 3-8. Histogram-generating Query, Dimensional Model select tag, sample_type, round(fs/25), count(*) from event where individual='BH -6' group by tag llsample_type, round(fs/25) having round(fs/25)<20 3.5.3 Data Volume The volume of data created by cell biology research is significant. On average, each individual generated 154,000 rows of event data. Compare this to a business application, such as cable billing. Suppose that each customer has 3 products on his or her account. Twelve months worth of billing would create 36 billing line items. One would need to bill nearly 4300 customers for 1 year to create 32

PAGE 43

as much data as one specimen generates! Selected statistics on the data set are shown in Table 3-9. Table 3-9. Selected Data Set Statistics 1! .!!l oJB .!!I,__ c: c:ro c:o. c: Q) ii: Q) ;:o:l Q) E Q)a."O 0 > lll"O > ro > :;: w i5 Wen w 'i5 'It 0 'lt"O C),_ Cl > 'It <( Exp1 418 3 889 684 25 9 305 155 587 Exp2 360 2 ,791, 145 24 7 753 116 298 Exp3 279 2 799 232 19 10 033 147 328 Exp4 362 3 1 1 9 628 22 8 618 141, 484 TOTAL 1419 12 599 689 90 33

PAGE 44

4. Analytical Techniques in the Rich Analytical Environment Once data from a flow cytometry experiment, which consists of multiple samples on multiple individuals, has been parsed and loaded into the RAE, a variety of analytical techniques can be employed. These techniques can be statistical or graphical. Additionally, many of the techniques should be thought of as standard techniques, performed each time an experiment is conducted. These standard techniques are fundamentally analysis patterns, enabling the analyst to know much more about his or her data, and to reach this knowledge very quickly. Furthermore, the RAE supports analyses in which multiple samples are analyzed with the same technique at the same time. The RAE also supports the derivation of new measurements from the base measurements collected by the flow cytometer. This chapter discusses some of the possible analytical techniques. The RAE contains data on multiple samples. There are two broad groupings of these samples, individual and sample type. In the data analyzed in this work, there are four different sample types. The types are No Stain, Isotype, Lyso, and M5114. Additionally, for certain individuals, the No Stain, Isotype, and M5114 samples are also processed with the CytoPerm/CytoFix technique, which perforates the cell membrane, allowing the stain to adhere to the interior of the cell. Thus, one 34

PAGE 45

approach to segment-based analysis is to perform standard analytical techniques on all of the samples for a particular individual. This approach demonstrates the similarities and the differences in the different sample types for that particular individual. The other approach is to analyze all of the san1ples of a particular type (e.g. M5114) for all individuals, or a particular subset of individuals. This approach demonstrates similarities and differences in that particular sample type across a population of individuals. Additionally, new measurements can be derived from the data collected by the flow cytometer. Recall that the flow cytometer measures at least four parameters: forward scatter, side scatter, and two types of fluorescence. Recall also that the fluorescence is created by staining the cell with compounds which will adhere to certain molecule types, "on" or "in" the cell. The fluorescence is generated by these stains. All other things being equal, the larger cells have more surface area for the stains to adhere to, and consequently, more fluorescence. Thus, the analyst might want to normalize the fluorescence as a function of cell size. Since forward scatter is an approximate measurement of cell size, one viable normalization function is FL2/FS. Many of the analysis techniques discussed in the following section refer to this normalized fluorescence. 35

PAGE 46

4.1 Statistical Analyses This section presents a variety of statistical analyses on the data collected by the project. The discussion of each type of analysis includes a brief overview of the purpose of the analysis, the SQL statement which retrieves data, all or part of the result set, and a brief comment about the findings. 4.1.1 Summary Statistics, Entire Experiment A good starting point for analysis is high level statistics on an entire experiment. These statistics provide an overall sense of results, and relative consistency (or lack thereof) across samples. An appropriate SQL statement is shown below. SQL Statement select tag, individual sample_ type, count(*), avg(fs), stddev(fs), avg(ss), stddev(ss), avg(fl21og) from event where experiment=2 and process='NORMAL' group by tag, individual, sample_ type 36

PAGE 47

Table 4-1. Result Set (Extract}-Summary Statistics, Entire Experiment 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 G0036563 G0036564 G0036565 G0036566 G0036567 G0036568 G0036569 G0036572 G0036573 G0036574 G0036575 G0036576 G0036577 G0036629 G0036630 G0036631 G0036632 G0036687 G0036688 G0036689 G0036690 G0036691 G0036692 G0036693 G0036694 G0036695 G0036696 G0036747 G0036748 G0036749 G0036750 G0036751 The results show: BL 9 BL-9 BL-9 BL-9 BL-9 BL-9 BL-10 BL-10 BL 10 BL-10 BL-10 BL-10 BL-10 CH-8 CH-8 CH-8 CH-8 BL-9 BL 9 BL-9 BL-9 BL-10 BL-10 BL-10 BL1 0 BL-10 BL-10 CH-8 CH-8 CH-8 CH-8 CH-8 w Cl. 1=, w Cl. No Stain No Stain No Stain Lyso Lyso Lyso No Stain No Stain No Stain No Stain Lyso Lyso Lyso No Stain Lyso Lyso Lyso No Stain lsotype M5114 M5114 No Stain No Stain lsotype M5114 M5114 M5114 No Stain lsotype M5114 M5114 M5114 I= z :::l 0 () 37 7681 7558 7663 7484 7108 7220 7102 7880 7749 7838 7680 8153 7721 76511 62706 63176 62629 6078 6332 6472 6378 6291 6 761 6773 7102 6964 6920 26788 26650 26203 27246 31303 220 216 219 198 212 212 224 242 244 238 219 212 216 135 140 136 139 235 222 245 240 238 224 228 224 226 232 157 151 151 156 156 Vi u. 0 0 len 113 108 112 92 100 101 103 125 120 118 102 102 102 133 135 128 130 107 110 135 119 118 108 116 117 119 124 137 128 123 139 149 309 299 307 267 281 283 305 357 359 351 310 308 305 2 9 2 2 8 7 277 286 178 170 194 189 183 192 199 204 203 207 219 208 208 216 220 Vi (ij 0 0 len 211 204 212 185 188 193 216 240 239 236 220 217 212 238 234 226 231 148 146 167 157 156 155 169 173 174 176 209 201 196 210 219 3.5 3.6 3 4 1.5 0.6 0 6 0.1 2.5 2 9 2 8 0.4 0 7 0 6 0.1 1.2 1 3 0.9 148 7 135 7 584.9 566 7 567 2 105 8 130 5 491. 0 524.3 503 9 115 8 129.2 409 0 590 6 583. 9

PAGE 48

in general, a fairly consistent number of events in each sample; certain suspect samples, as indicated by a large number of events and/or reruns of the sample (e.g. CH-8); relatively consistent measurements across replicates for a given individual and a given sample type (e.g. rows 1-3, 11-12); and that no stain "families" have almost no FL2 (rows 1-17), while stain families show some FL2 (rows 18-32), e v en on No Stain samples. 4.1.2 Suspect Samples Another technique highl i ghts suspect samples or individuals by identifying the samples that have a large number of events say more t han 20,000. Additionally, the query can identify how many suspect samples are assoc i ated with a particular individual. SQL Statement select experiment, individual, count(*) from ( select experiment, tag, individual, count(*) from event group by experiment, tag, individual having count(*) >20000 ) group by experiment, individual 38

PAGE 49

Table 4-2. Result Set-Suspect Samples EXPERIMENT INDIVIDUAL COUNT(*) 1 1 1 1 1 2 3 3 3 3 BL-3 BL-4 BL-5 PH-4 UL-4 CH-8 CH-2 PH-3 PL-1 UL-1 3 6 2 6 1 9 7 5 10 2 Note that compared to Experiments 1 and 3, Experiment 2 has a small number of suspect individuals. Depending on the sample type of the suspect samples, and on the particular analysis performed, the suspect samples/suspect individuals could skew comparisons made between Experiment 1 and Experiment 2. 4.1.3 Study of a Particular Individual A variation of the analysis presented in Section 4.1.1 is summary statistics, constrained to a particular individual. SQL Statement select tag, sample_type, replicate, count(*), avg(fs) avg(fl21og), avg (fl2log/fs) from event where individual='BH-3' and process='NORMAL' group by tag, sample_type, replicate 39

PAGE 50

Table 4-3. Result Set-Study of Individual BH-3 w en u. s (!> w en 0 0 ...J ...J w (.) t=: u. N ...J z (3' ...J ::i ::::l u. u. (!) :::E 0 (3' (3' w <( <( 0:: (.) 1-(/) G0037080 No Stain 0 7352 198 2 0 0 G0037081 No Stain 0 7409 195 3 0 0 G0037082 Lyso 1 7 96 8 189 3 0 0 G0037083 Lyso 2 8791 179 4 0.0 G0037084 L yso 3 8332 185 4 0 0 G0037184 No Stain 0 6691 216 45 0.3 G0037185 lsotype 0 7 127 2 1 4 59 0 4 G0037186 M5 1 14 1 7097 211 535 3.3 G0037187 M5114 2 7002 212 544 3.3 G0037188 M5114 3 7207 212 5 34 3.2 The results show: a relatively consistent number of events in each sample ; increased fluorescence on Isotype and M5114 samples; and relative consistenc y across replicates of the same sample type. 4.1.4 M5114 Fluoresence, Normal Process and CPCF Process Another possible analysis is to compare M5114 fluorescence in normally processed samples to that in CPCF processed samples. 40

PAGE 51

SQL Statement select tag, individual, sample_ type replicate, process avg( fl2log), avg( fl2loglfs) from event where experiment= 1 and sample_type='M5114' and fs>20 group by tag individual, sample_ type, replicate, process order by individual sample_ type Table 4-4. Result Set (Extract}-M5114 Samples, Normal and CPCF Processing w Ui Ll.. a.. 8" (3 ...J w 0 0 <( en ...J ...J :::> w (.) en N 0 ...J w ...J 5 a.. ::i (.) Ll.. Ll.. (,!) 0 a.. 0 (3' (3' w <( <( a:: a:: 1-en a.. G0036266 BH 5 M5114 NORMAL 597 3 4 G0036267 BH-5 M5114 2 NORMAL 600 3.3 G0036268 BH-5 M5114 3 NORMAL 594 3.2 G0036404 BH-5 M5114 1 CPCF 442 3 5 G0036405 BH-5 M5114 2 CPCF 424 3.5 G0036406 BH-5 M5114 3 CPCF 562 4.3 G0036407 B H-5 M5114 3 CPCF 570 4.4 G0036271 BH-6 M5114 1 NORMAL 599 3 7 G0036272 BH-6 M5 1 14 2 NORMAL 625 3.6 G0036273 BH-6 M5114 3 NORMAL 595 3 5 G0036410 BH-6 M511 4 1 CPCF 568 4.7 G0036411 BH-6 M5114 2 CPCF 576 4 8 G0036412 BH-6 M5114 3 CPCF 585 4 9 G0036241 BL-3 M5114 NORMAL 617 3.8 G0036242 BL-3 M5114 2 NORMAL 623 4 0 G0036243 BL-3 M5114 3 NORMAL 639 3 9 G0036377 BL-3 M5114 CPCF 608 5 1 G0036378 BL-3 M5114 2 CPCF 611 5 0 G0036379 BL-3 M5114 3 CPCF 616 5 1 41

PAGE 52

The expected finding of more FL2 on CPCF samples is not consistently shown. However, ifFL2 is normalized (FL2LOG/FS), a higher value is consistently shown on the CPCF samples. This result highlights the value of the normalization technique. 4.1.5 Fluorescence by Diet Type, Isotype and M5114 Stains The types of statistics computed on individual samples can also be computed on the entire project. This technique lumps the data into large buckets, not broken out by sample. SQL Statemem-NORMAL Process select diet, sample_ type, avg( fl2log/fs ) stddev( fl21og/fs ), count(*) from event where sample_type in ('M5114','1sotype') and process='NORMAL' group by diet, sample_ type Table 4-5. Result Set-Fluorescence by Diet Type, Normal Processing DIET H H L L SAMPLE TYPE l sotype M5114 lsotype M5114 AVG(FL2LOG/FS ) 7 3.4 6 3.3 STDDEV(FL2LOG/FS) 1.1 2 8 1 0 2.7 COUNT(*) 237376 717209 275953 780993 The results show slightly more normalized fluorescence for the high fat samples, for both isotype and M5114 stains, indicating support for the biologists' hypothesis regarding high-lipid environments being conducive to the presentation ofMHC Class IT molecules. 42

PAGE 53

Table 4-6. Result Set-Fluorescence by Diet Type, CPCF Processing DIET H H L L SAMPLE TYPE lsotype M5114 lsotype M5114 AVG(FL2LOG/FS) 1 0 4 5 1 1 4 6 STDDEV(FL2LOG/FS) 1 2 3 5 1 3 3.5 COUNT(*) 203028 624292 268040 729606 The results show slightly more normalized fluorescence for the low fat samples, both isotype and M5114 stains. The CPCF processing alters the structure ofthe cell, allowing staining ofMHC class II molecules both inside of and on the surface of the cell. Thus this result is consistent with experimental design. 4.1.6 Fluorescence by Experiment and Diet Type, M5114 The analysis above is easily expanded by adding experiment as one of the grouping variables The select statement includes only M5114 samples, and computes average normalized FL2. Result sets for both NORMAL and CPCF processes are presented. SQL Statement-NORMAL process select experiment, diet, avg(fl2log/fs) count(*) from event where sample_type ='M5114' and process = 'NORMAL' group by experiment, diet 43

PAGE 54

Table 4-7. Result Set-Fluorescence by Experiment by Diet Type, Normal Processing EXPERIMENT DIET AVG(FL2LOG/FS) COUNT(*) 1 H 3 4 215554 1 L 3 3 262408 2 H 3.6 279322 2 L 3.3 252318 3 H 3 1 222333 3 L 3.4 266267 The findings are as follows: Experiment 2, Males, shows greater difference in normalized FL2 between high fat and low fat diets than does Experiment 1, Females. Experiment 2, Males, shows higher normalized FL2. Experiment 3, Wounded Mice, shows greater normalized FL2 in low fat diets than high fat diets; and high fat diets show the lowest normalized FL2 of any of these result sets. Table 4-8. Result Set-Fluorescence by Experiment by Diet Type, CPCF Processing EXPERIMENT DIET AVG(FL2LOG/FS) COUNT(*) 1 H 4.3 225751 1 L 4 4 256878 2 H 4.4 189169 2 L 4 5 246198 3 H 4.9 209372 3 L 4.9 226530 The results show that: in no cases does the high fat diet show more normalized FL2 than the low fat diet; and 44

PAGE 55

Experiment 3, Wounded Mice, shows higher normalized FL2 than do the other two experiments. Inter-experimental comparisons may be biased by differing machine calibration on different days. Consequently, these results warrant further investigation to ascertain validity. 4.1.7 Simple Histograms Using Text Constructing simple histograms using text is helpful when the user does not have easy access to a good graphical tool, and wants to do some quick graphical exploration. The technique can be used to show similarity of histogram curves across replicates, or for a quick identification of a local minimum. Additionally, this technique can provide graphical results of a large number of samples quickly. One of the challenges of graphical presentation on a two dimensional surface, such as a piece of paper, is how to distinguish between many samples. Only a certain number of samples can fit onto the graph before the presentation becomes too cluttered to extract useful information. Thus, this technique, while not showing multiple samples in the same data space, does provide a quick visualization of a large number of samples. SQL Statement select tag, individual trunc(fl2log/50) t, substr(lpad('o' round( count(*)/20,0), 'o'), 1 40) graph from event where fs> 100 and sample_type='M5114' and process='NORMAL' and replicate= 1 group by tag individual trunc(fl2log/50) 45

PAGE 56

Table 4-9. Result Set (Extract}-Textual Histogram G0037249 IUH 2 .o 0000000000000000000000000 G0037249 IUH 2 1.0 000000 G0037249 IUH 2 2.0 000000 G0037249 IUH 2 3 0 00000000000 G0037249 IUH 2 4.0 00000000000000 G0037249 IUH 2 5.0 0000000000000000 G0037249 IUH 2 6 0 00000000000000000 G0037249 IUH 2 7 0 0000000000000 G0037249 IUH 2 8.0 0000000000 G0037249 IUH 2 9.0 00000000 G0037249 IUH 2 10.0 0000000 G0037249 IUH 2 11. 0 0000000000 G0037249 IUH 2 12. 0 0000000000000 G0037249 IUH 2 13. 0 000000000000000000 G0037249 IUH 2 14.0 00000000000000000000000 G0037249 IUH 2 15. 0 0000000000000000000000000000 G0037249 IUH 2 16.0 0000000000000000000000000000 G0037249 IUH 2 17.0jooooooooooooooooooooooo G0037249 IUH 2 18.0jooooooooooooo G0037249 IUH 2 19.0jooooooo G0037249 IUH 2 20.0joooo Note the two peaks of fluorescence, and the local minimum at trunc(fl2log/50)=10. This translates to a fluorescence reading of 450-500. 4.1.8 Normalized Fluorescence by Forward Scatter Range The following technique was inspired by a "Quartile" based analysis presented by Covas (2004). Covas presents forward scatter quartiles, and reports fluorescence by each quartile. The technique discussed below divides forward scatter into four equal ranges (0 to 255, 256 to 511, 512 to 767, 768 to 1023) and computes the average normalized fluorescence in each of these quadrants. SQL Statement select tag, individual sample_ type, avg(decode(trunc(fs/256),0,fl21og/fs null)) Ql, avg( decode(trunc(fs/256), I ,fl2log/fs,null)) Q2, avg(decode(trunc(fs/256),2,fl21og/fs,null)) Q3, 46

PAGE 57

avg( decode( trunc( fs/256),3,fl2log/fs null)) Q4 from event where experiment=2 and process='NORMAL' and sample_type in ('M5114', 'lsotype ) group by tag, individual sample_ type Table 4-10. Result Set-Fluorescence by Forward Scatter Range w a.. ...J c( :;:) w 0 ...J 0 N "' ..,. 5 a.. 0 0 0 (!) iS :::E ?; c( (J) G0036688 BL-9 l sotype 1 0 0.4 0 .5 0.5 G0036689 BL-9 M5114 4.1 1.7 1.4 1 0 G0036690 BL-9 M5114 3.9 1 7 1.3 1 0 G0036693 BL-10 lsotype 0.9 0.4 0.4 0 5 G0036694 BL-10 M5114 3 5 1 4 1 1 0.9 G0036695 BL 10 M5114 3 7 1.5 1.3 1 0 G0036696 B L -10 M5114 3.6 1 5 1.3 1 0 G0036708 BH-7 lsotype 0 9 0 4 0 4 0.5 G0036709 BH-7 M5114 3 8 1 .6 1. 2 0.9 G0036710 BH-7 M5114 4 1 1.7 1.3 1 0 G0036711 BH-7 M5114 4 1 1.7 1 3 1 0 G0036713 BH-8 lsotype 0 8 0.4 0 4 0 5 G0036714 BH-8 M5114 3.5 1 5 1.1 0 9 G0036715 BH-8 M5114 3 8 1.6 1 2 0 9 G0036716 BH-8 M51 14 2.1 0 9 0. 8 0. 6 These results show that the Isotype samples fluoresce primarily in the lower ranges ofFS (<256), while the M5114 samples also fluoresce in FS ranges from 256 to 1023 47

PAGE 58

4.2 Graphical Analyses Recall that the mainstream graphical treatments of the data collected from a single sample are two-dimensional scattergrams and histograms as discussed in Chapter 2. The RAE allows the analyst to consider multiple samples simultaneously : The two-dimensional scattergram might be able to support the simultaneous presentation of two or three samples, with different samples indicated by different colors. However, the more natural grouping of"all samples, this individual" usually contains 15 samples; the grouping all individuals, this sample type" usually contains 20-25 samples. Consequently, a chart containing one histogram for each sample is abetter way to display more information in a compact space. In this work, these charts were created with a Java program incorporating JFreeChart (www.jfree.org ) libraries. One feature of this solution is that a SQL statement submitted to the database returned a result set which was then displayed graphically. No intermediate steps were required. Successive modifications of the SQL statement led to reasonably clean visualizations. Modifications included such steps as curve smoothing by creating wider bins for the histogran1 counts (e.g. round(fl2log*4/fs)/4, count(*)) and constraining the result sets to a range of relevant data activity. Table 4-11 shows a representative SQL statement. The expression round(fl2log*4/fs)/4 creates a smoother curve. The constraint, "having 48

PAGE 59

round(fl2log*4/fs)/4 between .5 and 10", limits the result set and thus the display to an appropriate range. Table 4-11. SQL Statement for Graphical Result select tagllsample typellreplicate, round( fl2log* 4 /fs )/ 4, count(*) from event where individual='BH -6' group by tagllsample_typellreplicate, round(fl2log*4/fs)/4 having round(fl2log*4/fs)/4 between .5 and 10 Figure 4-1 shows a family ofhistograms ofNormalized FL2 for all samples for a particular individual. Normalized FL2 Histograms: lndivkluel BH-41 Nomelz..:l FL2 (JXI:35.410U51141 e OOD:38-f12U51143 OIXJ38114NoS-.i.O .. axnm72M51142 "GCIOJ8119Ho5tiii..O .o::ole117..,S.IJIO GDW&40etlotw-Q I GOO.li2'71M51 1<41
PAGE 60

This technique shows the similarity or dissimilarity of the replicate samples. It also shows the similarities and the dissimilarities of different sample types. Note that the real-time display includes both color and interactivity. Data values are displayed when the mouse pointer hovers on a tick mark. Consequently, it is more informative than the screenshot of Figure 4-1. Figure 4-2 shows histograms ofNormalized FL2 for all M5114 samples for the individuals in Experiment 1. However, there are too many samples on this chart for it to be useful. Figure 4-3 shows the same information, only constrained to all individuals of the strain "C." 700 800 200 100 Normalized FL2 Histograms: Experiment 1 1.0 1..5 2.0 2.5 l.O 3.6 4.0 4.5 !i.5 6.0 6..5 7IJ 7.!5 1.0 1.5 9.0 8.5 10.0 Ncrmaltzed FL2 8L .... mo3824e e UL--ICDOJ8351 Cl-50J038211 CH-S(I)(J36301 .. CH-Hil03e298 "'Cl .... CJl038278 UH-6al036381 f'H..8c:D038338 t BL 6B:I0062fi& 81+6
PAGE 61

lt+'M'iiMHM1111111111111'111 Normalized FL2 Histograms: Experiment 1 . I 700 I .,...., 83!10 300 """ 200 1 50 100 50 1.0 t.S 2 0 2.5 l.D 3.5 4.0 4 .5 5 0 5 .5 ti O 8..5 7.0 7.5 1.0 1.5 9.0 8.5 10. 0 NomwllzAMI FL2 I CH-1Gl03625'& CL-5Gl036281 C .... !5CXI03830 1 CL-7())()36;zg1 .. CL-8G)03821fi .. CL--1())036 2 7 5 C H-6(JJ()36J,(J81 Figure 4-3. M5114 Samples, Experiment 1, Strain C Figure 4-4 contains histograms of forward scatter for all M5114 samples of a particular individual. Again, it shows the similarity o f the replicate samples. It also shows that different processes have a fundamentally diff erent distribution of forward scatter. Upon first thought, one would not expect the sample type to influence the size of the cells. However, the CPCF process perforates the cell membrane, thereby significantly altering the forward scatter. 51

PAGE 62

1600 1400 1200 1000 c BOO 8 1500 400 200 0 Forward Scatter Histograms: Individual BL-6 3 6 000315272NORMAL 9 10 11 1 2 13 14 15 16 17 11 111 FS 00038273NORMAI. -CJI036ol1 1 C PCF 00036271 NORMAI.l Figure 4-4. Forward Scatter, M5114 Samples 4.3 Summary In summary, the Rich Analytical Environment supports both statistical and graphical anal y sis techniques. These techniques provide more information more quickly than does the traditional flow cytometry analytical process. This speed of response brings to mind another type of flow. In his book Flow: The Psychology of Optimal Experience, Czikszentmihalyi (1990) suggests that as human beings, our optimal experiences come when we are challenged. He calls these experiences "flow" and argues that flow is just as real as being hungry, just as concrete as walking into a wall. Two of the conditions required for experiencing flow are 52

PAGE 63

becoming immersed in an activity with a sense of immediacy and timely feedback (p.65) The data analysis task is fundamentally more satisfying when hypotheses can be posed, and relevant data can be retrieved, within seconds or minutes, as opposed to hours or days Additionally, the speed of retrieval supports rapid hypothesis refinement and extension. 53

PAGE 64

5. Conclusions and Future Work The work presented in this thesis demonstrates that a rich analytical environment can be created from flow cytometry data. This analytical environment allows the analyst to perform types of analysis that were not previously possible, and to gather more knowledge more quickly. The essential features of the environment are: an open architecture in which multiple analytical tools can be leveraged; the combination of flow cytometry data with experimental metadata, such as strain, diet, and sample type; the ability to derive new metrics from core data, such as the calculation of normalized fluorescence; the ability to perform analyses across experiments as opposed to on one or two samples at a time; and the ability to group multiple samples together to perform characteristic-based analysis, where such characteristics include diet, strain, and process. Furthermore, because the environment is so rich from an analytical perspective, the possibilities for future work are significant. 54

PAGE 65

Future work falls into five main categories: generalization of parsing algorithms, automated identification of cell groups or phenotypes, incorporating additional information into the RAE, additional analytical algorithms, and process automation. The code used to parse the FCS files in this work was written specifically for the data collected by this particular instrument. It is not a general implementation of a parsing algorithm for the FCS standard. In particular, the number and names of the data values collected are assumed known. Additionally, the byte order of the data section is also assumed known. A more general implementation would extract that data from the text section, and adapt the parsing of the data section to those values. Future work might also find the optional analysis section of the file to be valuable. Another aspect of generalization is extending the solution to support multiple or different relational databases. While the parsing algorithm writes the data to flat files, this solution used Oracle's bulk loading utility, SQLLOAD. Support for additional relational databases would require support for the bulk loading utilities associated with those databases. Additionally, the table and index creation syntax might need slight modifications for other databases. One of the challenges that face the biologist analyzing flow cytometry data is identifying the cell clusters, gates, or phenotypes. In traditional techniques, that identification is done manually. Automated clustering algorithms might prove to be of great value. Assuming that a viable algorithm could be found, the event-level data 55

PAGE 66

could be updated to include cluster name O.t:' number. Then, analysis could be performed on a particular cluster. That analysis could be either statistical or graphical, as discussed above. Additionally, that analysis could run across the standard cross-sections (all samples, this individual; or all individuals, this sample type) while constraining on a particular cluster, such as "T cells" or "mature B cells." Another extension of the environment is to include additional information about the individuals. This information includes, but is not limited to, sex, weight, age, and weight of selected organs. Then, future analysis could incorporate these values. The analyst could research effect of age or weight on the flow cytometry results. The types of analysis discussed in Chapter 4, "Analytical Techniques in the Rich Analytical Environment," are but a starting point for the types of analysis that can be performed once the event data is stored in an open and accessible environment. Recall the various histograms presented above. Recall that in certain cases, one of the replicate samples would not follow the expected curve. Visual inspection suggests that the sample is an outlier, but a statistical goodness-of-fit measure might be more appropriate. Numerous other statistically intensive analytical techniques can be easily supported in this environment. A final area for further work is process automation. As particular types of analyses show themselves to be valuable, the processes to perform these analyses can 56

PAGE 67

be automated. For example, the families of histograms could be automatically generated. In conclusion, the RAE empowers both the biologist and the analyst. Many types of analysis are possible when all of the data from an experiment is made available in an open environment. As biologists become more familiar with what they can accomplish with this environment, certain processes will become standard. Other processes will emerge as innovative and exciting. 57

PAGE 68

REFERENCES BD Biosciences. (1999). Introduction to Flow Cytometry: A Learning Guide. [Manual Part Number: 11-11032-00]. Covas, D.T., De Lucena Angulo, 1., Vianna Bonini Palma, P., & Zago, M.A. (2004). Effects of hydroxyurea on the membrane of erythrocytes and platelets in sickle cell anemia. Haematologica, 89 ( 3), 237-280 Retrieved March 25, 2004 from http : / / www .haematologica.org/journal/2004/3/pdf/890273. pdf Czikszentmihalyi, M (1990). Flow: The Psychology of Optimal Experience. New Y.ork: Harper & Row. Desbarats, J., Wade, T Wade, W. F & Newell ; M K. (1999). Dichotomy between na1ve and memory CD4+ T cell responses to Fas e ngagement. Proceedings of the National Academy of Sciences, 96(14), 8104-4109 R e trieved March 25,2004 from http: / /www. pubmedcentral.nih.gov/articlerender fcgi ?tool=pubmed&pubmed id=l0393955 Harold, F. ( 2001). The Way of the Cell : Molecules, Organisms and the Order of Life Oxford: Oxford University Press. Harris, R. (2004 March 31 ). Human diseases mirror e d in rat genome. All Things Considered [Radio Broadcast]. Washington, D C : National Public Radio Retrieved on April3, 2004 from http://www.npr.org/features/feature.php?wfld=1804676 Herzenberg, L. A & Herzenberg, L. A. (2004). Genetics, F ACS, immunology, and redox: a tale oftwo lives intertwined. Annual Review of Immunology, (22) 1-31. Retrieved on March 31, 2004 from http:/ /herzenberg.stanford.edu/Publications/Reprints/LAHSOO. pdf 58

PAGE 69

Huber S.A., Sakkinen P., David C., Newell M.K., & Tracy R.P. (2001). T helper-cell phenotype regulates atherosclerosis in mice under conditions of mild hypercholesterolemia. Circulation, 103, 2610-2616. Retrieved March 25, 2004 from http://circ.ahajournals.org/cgi/reprint/1 03/211261 O.pdf International Society for Analytical Cytology (ISAC). Data File Standard for Flow Cytometry, Version FCS3.0. Retrieved March 25,2004 from http://www.isac-net.org/ Kelleher, K. (2004, April). The drug pipeline flows again. Business 2.0, 5(3), 5051. Kimball, R., Reeves, R., Ross, M., & Thomthwaite, W. (1998). The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses. New York: John Wiley & Sons, Inc. Kumar, L., Pivniouk, V., de la Fuente, M., Laouini, D., & Geha, R. (2002). Differential role ofSLP-76 domains in T cell development and function. Proceedings of the National Academy of Sciences, 99(2), 884-889. Retrieved April17, 2004 from http://www.pnas org/cgi/reprint/99/2/884.pdf Lee, J., Shin, J., Kim, E., Kang, H., Yim, I., Kim, J., et al. (2004). Immunomodulatory and antitumor effects in vivo by the cytoplasmic fraction of Lactobacillus casei and Bifidobacterium longum. Journal of Veterinary Science, 5(1), 41-48. Retrieved March 25, 2004 from http://www.vetsci.org/2004/pdf/41.pdf Murphy, R. (1996). Flow Cytometry and Sorting. Retrieved on October 4, 2003 from http:/ /flowcyt.cyto.purdue.edu/flowcyt/educate/theory/fluioptilsldOO 1.htm Parks, D. R.(1996). Flow cytometry instrumentation and measurement. In L. Herzenberg, L. Herzenberg, C. Blackwell, & D. Weir (Eds.), The Handbook of Experimental Immunology (pp. 4 7.1-4 7 .12). Boston: Blackwell Science. Retrieved on March 31, 2004 from http:/ /herzenberg.stanford.edu/Publications/Reprints/LAH413. pdf 59

PAGE 70

Parks, D. R. & Bigos, M. (1996). Collection, display and analysis of flow cytometry data. In L. Herzenberg, L. Herzenberg, C Blackwell, & D. Weir (Eds.), The Handbook of Experimental Immunology (pp. 50.1-50.11 ). Boston: Blackwell Science. Retrieved on March 31, 2004 from http:/ /herzenberg.stanford.edu/Publications/Reprints/LAH414. pdf Schweitzer, S.C., Reding, A. M., Ford, C. A., Villalobs-Menuey, E., Huber, S. A., & Newell, M. K. (2004). Exogeneous Fatty Acids Affect the Expression of MHC Class II in Macrophage Cell Lines. Unpublished manuscript, University of Colorado at Colorado Springs. 60