Computer aided diagnosis of cystic fibrosis and pulmonary sarcoidosis using texture descriptors extracted from CT images

Material Information

Computer aided diagnosis of cystic fibrosis and pulmonary sarcoidosis using texture descriptors extracted from CT images
Rosskamm, Shoshana
Publication Date:
Physical Description:
ix, 81 leaves : ill. ; 28 cm.

Thesis/Dissertation Information

Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Mathematical and Statistical Sciences, CU Denver
Degree Disciplines:
Applied Mathematics
Committee Chair:
Lodwick, Weldon
Committee Co-Chair:
Cobb, Loren
Committee Members:
Gaspar, Laurie
Newman, Francis
Santorico, Stephanie


Subjects / Keywords:
Cystic fibrosis -- Diagnosis -- Computer programs ( lcsh )
Sarcoidosis -- Diagnosis -- Computer programs ( lcsh )
Lungs -- Tomography ( lcsh )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Thesis (M.A.)--University of Colorado Denver, 2010.
Includes bibliographical references (leaves 78-81).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Shoshana Rosskamm.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
671917362 ( OCLC )

Full Text
Shoshana Rosskamm
B.S., University of Colorado Denver, 2008
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Applied Mathematics

This thesis for the Master of Science
degree by
Shoshana Rosskamm
has been approved by
Weldon Lodwick
Francis Newman
Stephanie Santorico

Rosskamm, Shoshana (M.S., Applied Mathematics)
Computer-Aided Diagnosis of Cystic Fibrosis and Pulmonary Sarcoidosis Using
Texture Descriptors Extracted From CT Images
Thesis directed by Professor Weldon Lodwick
Pulmonary CT scans are analyzed by radiologists to differentially diagnose
pulmonary diseases. The objective of this study is to identify texture descrip-
tors that yield good discrimination between healthy parenchyma, cystic fibrosis
(CF), and sarcoidosis (SAR) for the purpose of constructing a computer aided
diagnostic tool. The Spatial Gray Level Dependence Method (SGLDM) and
Run Difference Method (RDM) for quantifying textural patterns are modified
to enhance their power, and a Radial Gray Level Distribution Method (RGLDM)
for extracting radial patterns is proposed. Then, classification models are con-
structed using forward logistic regression, linear discriminant analysis, support
vector machines, classification trees, and probabilistic neural networks. These
classification models then vote to classify CT images. Using this method, 94%
of healthy images, 96% of CF images, and 82% of SAR images are correctly clas-
This abstract accurately represents the content of the candidates thesis. I

recommend its publication.
Weldon Lodwick

Figures ................................................................. vii
Tables.................................................................... ix
1. Introduction............................................................ 1
2. Materials and Methods................................................... 8
2.1 Image Acquisition..................................................... 8
2.2 Feature Extraction................................................... 10
2.2.1 Spatial Gray Level Dependence Method............................... 11
2.2.2 Run Difference Method ............................................. 19
2.2.3 Radial Gray Level Distribution Method ............................. 22
2.3 Feature Selection................................................... 29
2.4 Feature Classification............................................... 34
2.4.1 Logistic Regression................................................ 34
2.4.2 Linear Discriminant Analysis....................................... 39
2.4.3 Support Vector Machine............................................. 42
2.4.4 Classification Trees............................................... 48
2.4.5 Probabilistic Neural Network....................................... 50
3. Results................................................................ 53
4. Discussion and Conclusion.............................................. 58
4.1 Discussion........................................................... 58

4.2 Future Directions.................................................. 63
4.3 Conclusion......................................................... 65
A. Statistical Code (R)................................................ 66
B. PNN Code (Matlab) .................................................. 68
C. SGLDM Code (Matlab) ................................................ 71
D. RDM Code (Matlab)................................................... 74
E. RGLDM Code (Matlab)................................................. 76
References.............................................................. 78

1.1 CT of a control patient........................................................... 2
1.2 CT of a patient with CF. The circled region contains a mucus-filled airway. . 3
1.3 CT of a patient with CF. The circled region is an example of mosaic perfusion. 3
1.4 CT of a patient with SAR. The circled region illustrates the honeycombing texture. 5
1.5 CT of a patient with SAR. The circled region shows an example of the nodular-
ization associated with SAR.................................................... 5
2.1 Conversion of an image to 8 gray levels by binning and computing f0'1, an
SGLDM matrix for 8 = 0 and d = 1............................................. 13
2.2 Plots of the eight SGLDM features as a function of d, which ranges from 1 to
|S/2J =11........................................................................ 20
2.3 Conversion of an image to 8 gray levels by binning and computing P, an RDM
matrix for 8 = 0................................................................. 22
2.4 The mucus-filled airway (circled) in this image is not as large as its containing
ROI.............................................................................. 24
2.5 We consider a circle of radius D = S/2, which we split into a = 5 annular
regions by the black circles. We sample all points within each annular region
that intersect with the red lines................................................ 25
2.6 This simple plot shows that a logistic curve with variance of homogeneity (x^)
as a predictor may lead to good separation between healthy and diseased images. 35
2.7 An SVM classifier for a two class problem in the case that the classes are separable. 43

2.8 An SVM classifier for a two-class problem in the case that the classes are not
separable......................................................................... 44
2.9 Classification tree partition of a two-dimensional feature space into seven regions. 49
2.10 Architecture of a probabilistic neural network for a two-class classification problem. 52

2.1 SGLDM Features 1-16.................................................... 16
2.2 SGLDM Features 17-32 .................................................. 17
2.3 SGLDM Features 33-48 .................................................. 18
2.4 RDM Features 49-53 .................................................... 23
2.5 RGLDM Basic Features 54-74 ........................................ 27
2.6 RGLDM Basic Features 75-88 ........................................ 28
2.7 RGLDM Difference Features 89-109 .................................. 30
2.8 RGLDM Difference Features 110-123.................................. 31
3.1 Classification accuracy for control, CF, and SAR images when com-
paring various classification algorithms and two feature selection cri-
teria............................................................. 54
3.2 Confusion matrix associated with the majority-rules voting of three
classification models.................................................. 54

1. Introduction
Radiologists use various qualitative and quantitative factors in diagnos-
ing pulmonary diseases such as cystic fibrosis (CF) and pulmonary sarcoido-
sis (SAR). CF, the most common lethal autosomal recessive disorder in Cau-
casians, leads to obstructive pulmonary disease and, eventually, pulmonary fail-
ure [11, 12]. The progression of CF is tracked using a variety of diagnostic
procedures, including sputum cultures, pulmonary function tests, and CT imag-
ing [12]. CT scoring systems have been proposed for analysis of anatomic details
in CT images. CF can be characterized in CT by a number of visual patterns
that often manifest in tandem, such as bronchiectasis, bronchial wall thicken-
ing, mucus-plugging, mosaic perfusion, and emphysema [11, 12], Two textural
patterns associated with CF that we focus on in this study are mucus-plugging
and bronchial wall thickening (Figure 1.2) and mosaic perfusion (Figure 1.3).
These patterns can be contrasted with the textures shown in Figure 1.1, which
displays healthy lung.
SAR is a multiorgan disease that manifests in the lung in 90% of patients
suffering from the disease [1, 15]. Sometimes SAR resolves itself spontaneously,
but many times it progresses to fibrosis, and although SAR can affect almost any
organ, morbidity and mortality are due primarily to pulmonary SAR. Like CF,
SAR can be diagnosed and its progress can be tracked by CT imaging. SAR
can be characterized in CT by visual patterns including bronchial distortion,
bronchovascular deformation, honeycombing, linear opacities, and nodules along

Figure 1.1: CT of a control patient.
bronchovascular bundles and in subpleural regions [1, 15]. Two textural patterns
associated with SAR that we focus on here are honeycombing (Figure 1.4) and
nodules (Figure 1.5).
Clearly, these textural patterns associated with CF and SAR are noticeable
in CT, and radiologists look for these patterns when diagnosing pulmonary CT
images. Nevertheless, there would be considerable benefit in a computer algo-
rithm to diagnose these (and other) pulmonary diseases. One advantage of such
a computer-aided diagnosis (CAD) system would be to enable the scans of pa-
tients in third-world regions that lack experienced radiologists to be analyzed
quickly and reliably.
Additionally, in recent years large pulmonary studies that involve the col-
lection of multiple CT scans from thousands of patients have been funded. Each
of these scans needs to be analyzed by a radiologist, and the analysis of these
scans can be prohibitively expensive and time consuming. A successful CAD

Figure 1.2: CT of a patient with CF. The circled region contains a mucus-filled airway.
Figure 1.3: CT of a patient with CF. The circled region is an example of mosaic perfusion.

system that can batch-process thousands of scans quickly would obviate this
expense and drastically reduce image analysis time in large-scale studies.
A third and very notable benefit of such a CAD system is to systematically
and quantitatively track progression of disease. In order for researchers or phar-
maceutical companies to test the benefit of a particular procedure or drug, it is
important to be able to objectively quantify disease progression or regression.
There can be much disagreement among radiologists who read the same pul-
monary CT images. For example, when 11 radiologists each read 131 scans with
diffuse parenchymal lung disease, observers agreed on the first choice diagnosis
for fewer than 50% of the cases [3]. Also, when two radiologists independently
determined emphysema severity scores, the correlation between readers ranged
from .59 to .87, and interobserver agreement about upper lobe-predominance
of emphysema was extremely poor [13]. This high interobserver variability sug-
gests that the tracking of disease progression based upon radiologists reads may
be highly unreliable. A successful CAD approach, however, should be able to
objectively and accurately quantify disease, thereby enabling reliable tracking
of disease progression or regression.
Because a CAD system to automatically analyze pulmonary CT images
would be such a great boon to medical scientists and pharmaceutical companies
alike, a number of studies have been done in an effort to automate the anal-
ysis of pulmonary CT scans [5, 14, 19, 20, 23, 24, 25, 27, 29, 30, 33]. Some
of these studies use very simple techniques, considering only density measure-
ments rather than textural patterns [14, 27]. Some studies directed towards a
CAD system for emphysema use pre-existing texture analysis methods such as

Figure 1.4: CT of a patient with SAR. The circled region illustrates the honeycombing
Figure 1.5: CT of a patient with SAR. The circled region shows an example of the
nodularization associated with SAR.

the Spatial Gray-Level Dependence Method (SGLDM), Gray Level Run Length
Method (GLRLM), Gray-Level Difference Method (GLDM), Local Binary Pat-
terns (LBPs), and the Adaptive Multiple Feature Method (AMFM), which is
merely a combination of some of the aforementioned methods [19, 20, 25, 30, 33].
There are also a handful of studies that use texture descriptors extracted
from CT images to diagnose a variety of lung diseases or textural patterns in the
lung [5, 23, 24, 29]. Uppaluri et. al. [29] employ the AMFM, a combination of
several pre-existing texture extraction methods, to classify tissue patterns into
six classes of texture: honeycombing, ground glass, bronchovascular abnormal-
ity, nodular abnormality, emphysema, and normal lung. Chabat et. al. [5] use
SGLDM features and introduce gray-level primitives in an effort to automate
discrimination between centrilobular emphysema, panlobular emphysema, con-
strictive obliterative bronchiolitis, and normal lung. Sluimer et. al. [24] use
the four moments of images filtered using the Gaussian, Laplacian, first-order
derivative, and second-order derivative filters to classify lung tissue into four
categories: linear and reticular abnormalities, nodular opacities, parenchymal
opacification, and cysts or emphysema. In a later study Sluimer et. al. [23]
use the same features to classify lung tissue into six textural classes: normal,
hyperlucency, fibrosis, ground glass, solid, and focal.
Although many of these studies have shown that CAD systems for pul-
monary image analysis can be reliable, to our knowledge there are no reports in
the literature on using texture descriptors extracted from CT to identify CF and
SAR. Therefore, in this study we evaluate methods for feature extraction, selec-
tion, and classification in order to build a CAD model to discriminate between

some pathologic patterns associated with CF, SAR, and healthy parenchyma.
Toward this end, we extract texture descriptors using pre-existing feature ex-
traction methods and introduce other feature extraction techniques designed to
quantitatively capture textural patterns used by radiologists to diagnose CF and
SAR. We then compare two methods for feature selection and five popular classi-
fication algorithms: logistic regression (LR), linear discriminant analysis (LDA),
support vector machines (SVMs), classification trees (CLTs), and probabilistic
neural networks (PNNs).

2. Materials and Methods
2.1 Image Acquisition
Computed tomography (CT) images were obtained from University of Col-
orado Denver with a variety of scanners, including several Siemens, Philips, and
General Electric models at 120 kVp. Slice thickness ranged from 1 to 6.5 mm.
The CT scans were stored with a 512 x 512 pixel matrix in DICOM format. In
total, we use 14 healthy scans, 8 scans with CF, and 6 SAR scans.
In order to analyze the textures present within the CT images, we carefully
selected regions-of- interest (ROIs) of 35-pixel diameter within the parenchyma
that illustrate visual patterns associated with CF and SAR, as well as healthy
parenchyma (see Figures 1.1-1.5). Traditionally, square ROIs of a fixed size are
selected for texture analysis. In analyzing pulmonary images, however, the pe-
riphery of the lung is curvilinear. Hence, square ROIs selected near the lung
pleura may contain some pixels not contained within the parenchyma. Many
researchers overcome this difficulty by avoiding selection of ROIs near the periph-
ery of the lung. However, ultimately in a region-by-region classification scheme
we will desire to analyze regions near the periphery of the lung. Furthermore,
honeycombing patterns and nodules associated with SAR often manifest along
the pleura. Consequently, rather than selecting square ROIs, we select circular
ROIs. This allows more flexibility in selecting ROIs at the periphery of the lung
that do not contain pixels outside of the parenchyma.

From each healthy scan1 we selected between 2 and 34 ROIs, and the median
number of ROIs selected per scan was 6.5. In total, we selected 136 ROIs
from the 14 healthy patients scans. From each CF scan we selected between
4 and 73 ROIs, with a median of 20.5 ROIs selected per scan. In total, we
selected 212 ROIs illustrating textural patterns associated with CF. Of these
212 CF ROIs, approximately 30% are centered on mucus-filled airways, and the
remaining 70% depict mosaic attenuation. Some of these CF ROIs, however,
contain both mucus-filled airways and the mosaic pattern. From each SAR
scan we selected between 11 and 52 ROIs, and the median number of ROIs
selected per SAR scan was 28. In total, 169 SAR ROIs were selected, some
of which depict nodularization and others of which depict the honeycombing
pattern. About 55% of these ROIs contain nodules, and the remaining 45%
illustrate the honeycombing pattern. Nonetheless, it is likely that many of these
ROIs actually contain both the honeycombing texture and nodularities.
We split these ROIs into distinct training and test datasets by randomly
selecting 1/3 of the ROIs for testing and using the remaining 2/3 for training
purposes. In order to ensure fair representation of all textures in both the
testing and the training sets, we randomly select 1/3 of each class, rather than
of the full dataset, for testing. In total, 90 control ROIs, 41 CF ROIs centered
on mucus-filled airways, 100 CF ROIs illustrating the mosaic texture, 51 SAR,
ROIs with honeycombing, and 61 SAR ROIs with nodules comprise the training
set, and the testing set is comprised of all remaining ROIs. In order to evaluate
statistical classification models fairly, the test dataset is used exclusively for
1In general, the term scan will be used to refer to a patients full CT scan, which is typically
composed of 50-150 slices.

evaluation purposes when all statistical model-building is complete.
2.2 Feature Extraction
Radiologists are very successful at identifying the presence or absence of
CF and SAR in CT images. When a radiologist reads a CT scan to perform
a diagnosis, he/she looks for visual patterns that characterize disease. For ex-
ample, nodules surrounding bronchi or the lung periphery may be evidence of
pulmonary sarcoidosis, and if a radiologist finds such a pattern in CT he/she
may be more inclined to diagnose the patient as such. Since radiologists diag-
nose CF and SAR quite successfully, in training computer algorithms to perform
such diagnoses we wish to mimic the way that radiologists process textural pat-
terns to make their diagnoses. Thus, in order to construct a computer-aided
approach to pulmonary CT diagnosis, it is important to understand the steps
that a radiologist takes in performing a diagnosis.
When a human looks at an image, he/she is able to not only consider the
image as a whole but also to identify notable features in an image. A CT scan
is, essentially, a large pixel map containing millions of individual data points,
or pixel attenuation values. In order to make sense of this huge amount of
data, when a radiologist reads a CT scan he/she is trained to look for particular
features that are evidence of different diseases. Some examples of features that a
radiologist may look for in pulmonary CT are lung nodules, airways, or textural
patterns. When looking for nodules, the radiologists are essentially looking for
small areas of bright attenuation, and when looking for airways, radiologists
look for circular rings of bright attenuation surrounding circular regions with
attenuation values similar to the density of air. Likewise, a computer algorithm

can extract these features. We can mathematically define textural features, or
texture descriptors, that describe these visual patterns.
In order to mimic the way radiologists read CT scans, we perform feature
extraction. This is a step that well-trained radiologists perform almost instan-
taneously and therefore are hardly aware of. Nevertheless, when radiologists
look for and identify particular visual patterns that describe disease, they are
performing feature extraction. In this section we detail several methods for
defining important textural patterns in a language that a computer algorithm
understands so that these algorithms can extract features that radiologists look
for to perform their diagnoses. It is important to be aware that the purpose of
feature extraction is to enable an algorithm to, like a radiologist, make sense of
millions of pixels and extract only the important information from this veritable
wealth of noisy information.
2.2.1 Spatial Gray Level Dependence Method
Mosaic perfusion, honeycombing, and even nodules are patterns that are
typically textural in nature. Especially when disease is in an advanced stage,
the patterns present in CT cover large areas of the lung. In these cases one
can expect that an ROI may contain only one of these patterns and that a
circular ROI of diameter 35 pixels displaying a particular diseased pattern will
not contain areas of healthy parenchyma. Hence, in this case we may rightfully
assume that a given ROI displays only one texture, so we may extract textural
features evenly across an ROI.
The spatial gray level dependence method (SGLDM) [8, 9] seeks to extract
image texture features that describe the spatial distribution and spatial depen-

dence among gray levels in local areas throughout an image. This method,
which is based on the estimation of the second order joint conditional probabil-
ity density function (pdf), has been used in a variety of medical image processing
settings, as well as other applications, and is one of the most successful methods
for texture discrimination [17]. Therefore, we evaluate the effectiveness of the
SGLDM and extend the SGLDM to gain additional textural knowledge for use
with pulmonary CT images.
Traditionally, before an image is processed using the SGLDM, the gray lev-
els are binned so that the resulting image contains only a few gray levels. The
purpose of reducing the number of gray levels is to simultaneously enhance com-
putational efficiency and the power of this feature extraction approach. Typical
values for the number of gray levels G to use are 8, which is what we use here,
or 16. The SGLDM then estimates the second order conditional bivariate pdf
f(i,j\d, 9), where 9 G 0, 0 = {0, f,f,x} an<^ ^ £ {1,...,iS 1} defines how
large an area is considered. Here S is the length of a side of the image, or ROI,
being analyzed. Given 6 and d, the ijth element in the G x G matrix fe'd is
the probability2 that gray level i is a distance d away from gray level j in the
direction of 9.
Using the SGLDM, about two dozen co-occurrence features can be obtained
from each SGLDM matrix, including contrast, homogeneity, variance, and clus-
ter prominence, for example. Usually SGLDM matrices are computed for a
distance d = 1 and for 0 = {0, f, §, xl an(l then averaged to create a single
2If the SGLDM matrix is normalized then the elements are probabilities. Nevertheless, the
SGLDM matrix need not be normalized, in which case the elements are scalar multiples of
the associated probabilities.

Compute ft1
Fig lire 2.1: Conversion of an image to 8 gray levels by binning and computing /0,1, an
SGLDM matrix for 6 = 0 and d = 1.
SGLDM matrix.
It is straightforward to extend the traditional SGLDM to circular ROIs. In
this study, however, rather than considering only four values for 6, we propose to
gain more information by considering eight angles: 0 = {0, |, j^}.
How to interpolate pixel values in order to consider these eight angles is not ob-
vious. We inscribe a square within the ROI and then rotate the inscribed square
6 = {0, \ radians, defining new pixel values by bilinear in-
terpolation. We then compute SGLDM matrices from each of these inscribed
squares in a single directionthe horizontal direction (6 = 0)and we extract
co-occurrence features from the sum of these eight SGLDM matrices, which we
denote fd.
Also, rather than only considering a distance d = 1, we propose to compute
SGLDM matrices for all distances d G {1,..., [S'/2J}. Hence, in total we compute
8 x L-Sy2j SGLDM matrices. Because we expect that our ROIs are rotation
invariant, we sum the 8 SGLDM matrices for each value of d, thereby ending up
with \_S/2\ SGLDM matrices, where

rd \ r9,d
Jij / J ij
We then extract the following eight co-occurrence features from each of these
|_S/2J SGLDMs:
1. Contrast, which measures local contrast in an image:
7? = EEi*-.?i24 <2-2>
i=1 j=1
2. Correlation, which measures correlation of pixel pairs on gray levels:
i=l j=1
(* i4)ti -

1 G (2.4)
II Ol - (2.5)
= i E <4 z*?)2. (2.6)
= gE(4-^)2. 1=1 (2.7)

3. Energy, which measures the occurrence of repeated pairs in an image:
ids = EY.(fif
i=l j=1
4. Homogeneity, which measures an images smoothness:
^4 = 2^ Z^T
i=1 j=1
+ \i~31
5. Variance, which measures variation of the gray level distribution:
*=1 j=l
6. Sum average, which measures the average gray level in an image:

i=l j=l
7. Maximum probability, which determines the predominance of the most
predominant pixel pair:
= max ffj (2.12)
8. Cluster prominence, which measures grouping of pixels with similar gray
i=1 j=l
We proceed by using equations (2.2)-(2.13) to define characteristics of the
SGLDM matrices. For t = 1, ...,8, we define the characeristic
It =
L5/2J £


Table 2.1: SGLDM Features 1-16
Contrast xi = 71
Correlation ^2 =72
Energy £3 = 73
Homogeneity x. II
Variance X5 = 7s
Sum average *6 = 7e
Maximum probability II
Cluster prominence 00 II
Mean contrast Xg = 7i
Mean correlation 10 = 72
Mean energy Xu = 73
Mean homogeneity X\2 = 74
Mean variance X\3 = 75
Mean sum average ^14 = 76
Mean maximum probability ^15 = 77
Mean cluster prominence Xl6 = 78

Table 2.2: SGLDM Features 17-32
Minimum contrast xn = mind 7?
Minimum correlation is = mind 72
Minimum energy xi9 = mind 73
Minimum homogeneity x20 = min^ 74
Minimum variance £21 = mind 75
Minimum sum average x22 = mind 7e
Minimum maximum probability x23 = mind 7 7
Minimum cluster prominence x24 = mind 7i
Maximum contrast x2b = maxd 7i
Maximum correlation x26 = maxd 72
Maximum energy x27 = maxd 73
Maximum homogeneity x28 = maxd 74
Maximum variance x29 = maxd 75
Maximum sum average x30 = maxd 7e
Maximum maximum probability X31 = maxd 77
Maximum cluster prominence x32 = maxd 7s

Table 2.3: SGLDM Features 33-48
Variance of contrast X33 L//2J Ei=l J (7? Ilf
Variance of correlation X34 J/2] Ei=7lJ (72 72 f
Variance of energy 1 V'L5/2! (.d 32 X35 [5/2J l^d=1 (73 73j
Variance of homogeneity ~ 1 V'L'S/2! (..d .-32 X36 |s/2J Z^d=l (74 74j
Variance of variance 1 y^L5/2! (^d X37 ~ LS-/2J l~id=l (75 75j
Variance of sum average 1 y^Ls'/2J d \2 X38 [5/2J 2^,d=\ (76 76j
Variance of maximum probability _ 1 V^LS/2J f,d .-7 32 X39 L5/2j 2^d=1 (77 77 J
Variance of cluster prominence rp 1 y^Ls'/2J ( d ^ \2 X40 [5/2J 2/d=l (78 78j
Skewness of contrast ~ m^J(-it *)3 41 (T42tE^/i2J(7?-7i)2)3/2
Skewness of correlation ~ ^2~72)3 42 (t4h ^=12J (72d-7-2)2)3/2
Skewness of energy TSM Ed=l J <7s 73 )3 43 (73d-7-3)2)3^2
Skewness of homogeneity 1S/2! E Skewness of variance (75d-75)3 45 (l^E^J (75d-75)2)3/2
Skewness of sum average ~ (76d-76)3 46 (TskrSiif (76d-76)2)3/2
Skewness of maximum probability rt-rr)3 47 (^T2Ti:^/i2J(77d-77)2)3/2
Skewness of cluster prominence 1 rskE^ (78d-l^ X48 (^kE^/2J(78d-78)2)3/2

which we use to compute the 48 SGLDM features listed in Tables 2.1-2.3.
In reality the SGLDM matrix /1 (d = 1) contains the most useful infor-
mation, whereas the other SGLDM matrices fd contain progressively less in-
formation as d increases. Hence, we do not wish to keep all 8 co occurrence
features from all \_S/2\ SGLDM matrices. Instead, we merely keep the raw
co-occurrence features from f1. Nevertheless, the remaining SGLDM matrices
do contain important information, and by transforming the raw co-occurrence
features extracted from all \_S/2\ SGLDMs we gain tremendous insight into the
distribution of these features when different values of d are considered (see Figure
2.2). For example, if we compute the homogeneity of /1 we may find that homo-
geneity is high. Nonetheless, it can be of great benefit to know more information
about the pdf of homogeneity sampled from the different fd,d = 1,..., \_S/2\.
Therefore, in addition to keeping the co-occurrence features from f1, we propose
to also keep the mean, minimum, maximum, variance, mean absolute deviation,
and skewness of these cooccurrence features over all values of d (see Tables
2.1-2.3). These six measures together yield a lot of information about the dis-
tribution of these features as d varies.
2.2.2 Run Difference Method
The run difference method (RDM) [6], like the SGLDM, seeks to extract
texture features that describe the size and prominence of textural elements in
an image. The RDM is a generalized form of the gray level difference method
(GLDM) [17], which is based on the estimation of the pdf of gray level differences
in an image.

Figure 2.2: Plots of the eight SGLDM features as a function of d, which ranges from 1
to [5/2J = 11.
Once again, before an image is processed using the RDM, the gray levels
are binned so that the resulting image contains only eight gray levels. The
RDM then estimates the conditional bivariate pdf p(r, gdi/\9), where 6 E 0,
0 = {0, r = 1 measures distance, and g^f is the gray level
difference. Given 6, each element in the RxG matrix p6 is the probability that
the absolute difference between the attenuation values of two pixels a distance
r apart from each other in the direction of 9 is g^f.
We propose to extend the RDM to the circular ROIs considered here the
same way we extended the SGLDM. Additionally, we propose to compute RDM
matrices for each of eight angles 9 E 0, 0 = {0, f, |, f, ^}, by ro-
tating an inscribed square inside the ROI and using bilinear interpolation to

interpolate pixel values. We then sum the eight RDMs and normalize the re-
sulting RDM to get the matrix P,
P = P6
T9dif /_^1rgdifi
from which we extract statistical measures of the distribution of gray level dif-
Rather than extracting textural features directly from the matrix P, we
define three characteristic vectors that are used to define texture descriptors.
The distribution of gray level differences (DGD) vector is computed as follows:
DGDgdif = Prgdifi (2-16)
The distribution of the average gray level difference given r is represented by
the DOD vector
DO Dr = ^2 9difPrgdir (2.17)
and the distribution of the average distance given is represented by the DAD
DADgMJ = ^2 rPrgdif (2-18)
Five features that describe the distribution of gray level differences and are
defined from these characteristic vectors are:
1. Large difference emphasis, which measures the predominance of large gray
level differences;

2. Sharpness, which measures the contrast and definition in an image;
3. Second Moment of DGD, which measures the variation of gray level dif-
4. Second Moment of DOD, which measures the variation of average gray
level differences; and
5. Long Distance Emphasis, which measures the prominence of large differ-
ences a long distance from each other.
The equations that define these texture descriptors are listed in Table 2.4.
Figure 2.3: Conversion of an image to 8 gray levels by binning and computing P, an
RDM matrix for 0 = 0.
2.2.3 Radial Gray Level Distribution Method
When a patients disease is in an advanced stage, mosaic perfusion, honey-
combing, or nodularization may fill an entire ROI. However, when the disease

Table 2.4: RDM Features 49-53
Large difference emphasis = E/.o log(2 hm + i)DGDm,
Sharpness *50=y.:uadgd
Second moment of DGD = E^O DGD2gdl,
Second moment of DOD *52 = Ei=i2J DOD2r
Long distance emphasis *53 = T,g~f=o 9difDADgdif
has not yet become very severe, mosaic patterns, honeycombing, or nodularities
may be present in CT in smaller, more localized regions. When image areas
containing patterns of disease are smaller than the ROIs used to identify the
diseased patterns, healthy parenchyma may be contained within an ROI labeled
as diseased. Hence, from these ROIs it is important to extract textural features
that can recognize disease in spite of the possibility that a majority of the area
contained within the ROI is actually healthy lung.
Additionally, mucus-filled airways associated with CF may be quite small,
less than 10 pixels in diameter. Thus, it is inevitable that ROIs containing
airways also contain a large portion of healthy parenchyma. This, too, compels
us to extract textural features that can recognize mucus-filled airways even
when the ROIs containing them include a lot of healthy tissue. For example, in
Figure 2.4 the assumption made by the SGLDM and the RDM that the entire
ROI illustrates a particular textural pattern is quite evidently false.
Since the construction of the SGLDM and RDM relies on the assumption
that the texture contained within an ROI is uniform, features extracted from
the SGLDM or RDM are not reliable when a single ROI contains more than one

Figure 2.4: The mucus-filled airway (circled) in this image is not as large as its containing
textural pattern. Therefore, we introduce a new feature extraction method, the
Radial Gray Level Distribution Method (RGLDM). If we assume that in these
cases the diseased pattern is approximately centered in its containing ROI, we
can construct a set of radial features that describe the diseased pattern3. We
construct radial features as follows:
Given a maximum distance D 6 {1,..., |_S/2J}, we consider the circular area
defined by the equation (x S/2)2 + (y S/2)2 < D, where S/2 is the center
of the image. We divide this area into a annular areas by splitting the circle of
radius D into a annuli. When a = 1 we are left with the full circle, which is
not divided into annuli. As a increases and approaches D, the circle is divided
into more annuli, and if a = D then the circle is split into D annuli, each of
which is the thickness of a single pixel (see Figure 2.5). Let a* {1,..., D) be
the radius of the outer circle of the tth annulus, and define A, = {(x, y) : ai_i <
3The assumption that the diseased pattern is approximately centered in its containing
ROI is indeed a reasonable assumption. This is because when a 2D CT image is processed via
overlapping region-by-region classification, any pattern that is not centered within a given
region will be centered within a neighboring region.

(x S/2)2 + (y S/2)2 < di}. In other words, each A i is the set of points within
the ith annulus.
with the red lines.
Because the RGLDM is designed for texture patterns that are rotationally
symmetric, we can sample points along predetermined directions instead of con-
sidering all points within each annular region. Rather than using every point
within each annulus, we prune each Ai by sampling sixteen points a distance d
from the central pixel, for d 1,..., D. These are the points that intersect with
the red lines in Figure 2.5. We then compute a series of statistics from the A^.
1. Mean, which measures the average gray level:
2. Variance, which measures variation of gray levels but emphasizes larger
deviations from the mean:

3. Standard deviation, which measures variation of gray levels:
Pi2 sfpix
4. Mean absolute deviation, another measure of gray level variation:
'y" i Aij ~ i
5. Skewness, which measures the skewness of the gray level distribution:
ptr (Av A)3
(ph Eli'! {Ail A)2)3/2
6. Kurtosis, which measures the peakedness of the gray level distribution,
thereby measuring whether variance results from infrequent extreme devi-
ations or frequent modestly sized deviations:
J_ (A 4.14
l^l 2^j=i lAj AJ
(pfc Eii'! (An A)2)2
We then compute a series of textural features using p1; p2, P3, P4, and
P5. These basic RGLDM features, which are listed in Tables 2.5-2.6, quantify
properties of attenuation values within the Ai.
In addition to these features we also compute additional texture descriptors.
Many ROIs containing mucus-filled airways, nodules, or other patterns contain
two or more patterns: one pattern in the center of the ROI and other patterns
surrounding this central area of the ROI. In order to identify and correctly
classify these ROIs, we extract several additional features that describe how pi,
p2, P3, p4, and p5 vary as the radial distance from the central pixel varies. We

Table 2.5: RGLDM Basic Features 54-74
Average variance 1 X54 5 £j=i Pil
Minimum variance *^55 Pil
Maximum variance ^56 5 Pil
Range of variance x57 = maxi=i;...!5 pn minj==ii 5 pa
Median variance x58 = medi=li...!5pii
Standard deviation (SD) of variance ^59 = l £?=i (pn ~ l Ef=i Ai)2
Mean absolute deviation (MAD) of variance ^60 = gEf=i IPii ~ lEtiPiil
Average SD X61 5 £j=1 P*2
Minimum SD x62 = mirij=1]5 pi2
Maximum SD 3163 = maXj=l,...,5 Pi2
Range of SD xq4 = max,=iv..i5 pi2 mini=1 5 pi2
Median SD x65 = medi=l,...,5Pi2
SD of SD ^66 = l £?=1 (Pi2 ~ | Etl Pi2?
MAD of SD X67 = l Etl \Pi2 l Etl P*21
Average MAD 1 3168 5 Z_/i=l P*3
Minimum MAD 3169 = mini=i,...,5 piz
Maximum MAD 3-70 = maxt=iv..)5 Piz
Range of MAD xn = maxj=i)...j5 Piz mini=1)...)5 pi3
Median MAD 3i 72 = medi=ii...,5pi3
SD of MAD X73 = I Ef=l (P*3 l Etl Pi3)2
MAD of MAD 3174 = l £?=i |Pi3 I EL Pi3|

Table 2.6: RGLDM Basic Features 75-88
Average skewness 1 v-'S x75 5 2^=1 Pi4
Minimum skewness x7e = mini=ii...j5 pi4
Maximum skewness xn maxi=i>...j5 Pi4
Range of skewness x7g = maxj=ii...i5 pi4 mini=i). >5 pi4
Median skewness x7q medj=i)...i5Pi4
SD of skewness ^80 = |EL(A4-IELA4)2
MAD of skewness *81 = g Ef=l IPi4 g Ei=l Pi4|
Average kurtosis 1 *82 5 2-ft=l P*5
Minimum kurtosis x83 = mini=li...i5 pi5
Maximum kurtosis %84 ^5 pi§
Range of kurtosis x85 = maxj=iv..)5 p^ min^i^.^ pi3
Median kurtosis x86 = medi=i,...i5pi5
SD of kurtosis *87 = g Etl (As g Etl Pis)2
MAD of kurtosis *88 = 5 Ei=l IP*5 5 Ei=l P*51

begin by defining Afc = {\pik-pjk\ i = l,-, 5, j = 1,5,i^j} for k = 1,5.
Ai is the set of differences among the variances for the different annuli, A2 is
the set of differences among standard deviations for the different annuli, and so
on. For <5 = 1,..., |A*|, let Aks denote the 5th element of A*,. We then use the
A*, to define the RGLDM difference features listed in Tables 2.7-2.8.
2.3 Feature Selection
We have 123 features in total, and many of these features are highly corre-
lated. Also, some statistical classification models, such as support vector clas-
sifiers, may suffer from the curse of dimensionality [10]. Thus, rather than
using all 123 features to perform classification, we identify only those features
that yield the best discriminatory power in classifying these images. We employ
a forward feature selection procedure to select useful feature variables from the
large feature space detailed in Section 2.2.
In forward feature selection, the classification model begins with no pre-
dictors and subsequently adds to the model features that improve classification
accuracy. When a feature is added to the model, its effect on the separation
of the C classes can be analyzed by one of many criteria, such as correlation,
distance, or entropy measures. Many commonly used feature selection criteria
are functions of the misclassification error, which is defined here as
where N is the number of observations, yt is the predicted class for the zth
observation, and f* is the target class for the ith observation.

Table 2.7: RGLDM Difference Features 89-109
Average difference of variance
Minimum difference of variance x90 = mina A1(5
Maximum difference of variance x9i = maxj Aia
Range of difference of variance £92 = min<5 Aia maxa AH
Median difference of variance £93 = med^Aia
SD of difference of variance ~ 1 (a 1 y-'IAil A i2 X94 |Ai| 1~>6= 1 v^l<5 |Ax| 2^(5=1 ^S)
MAD of difference of variance
Average difference of SD ** = ra £i=.'
Minimum difference of SD £97 = mina A2j
Maximum difference of SD £98 = max.5 A2<5
Range of difference of SD £99 = mina A2s maxa A2s
Median difference of SD £100 = medaA2a
SD of difference of SD 1.01 = iii Eft1 (4 ii, Ejti a24)2
MAD of difference of SD x.o2 = it, E&1 IAm £&' a2S|
Average difference of MAD x103 |^| 2^S=1 A3<5
Minimum difference of MAD £104 = mina A3S
Maximum difference of MAD £105 = maxa A35
Range of difference of MAD £ioe = miria A36 maxa A3a
Median difference of MAD £107 = medaA3a
SD of difference of MAD *.o8 = ,ii E&1 (Am ii Efei Am)2
MAD of difference of MAD *io. = i, E&1 |Am ii, EAmI

Table 2.8: RGLDM Difference Features 110-123
Average difference of skewness Xno IClti1 ^45
Minimum difference of skewness xm = min.5 A45
Maximum difference of skewness Xn2 = maxj A4,5
Range of difference of skewness ar113 = mind- A4(5 maxj A4(5
Median difference of skewness xiu = medd-A4(5
SD of difference of skewness *.. = iii Eft' (A lit Eft1 A)2
MAD of difference of skewness *n. = l£l Eft'!*-,£[ Eft'll
Average difference of kurtosis *1.7 = ii, Eft1 a5<
Minimum difference of kurtosis xn8 = nxin<5 A5(5
Maximum difference of kurtosis 2-119 = maxj A5S
Range of difference of kurtosis X120 = min^ A5(5 max^ A5(5
Median difference of kurtosis X121 = med^Asa
SD of difference of kurtosis = ji, Eft1 (A ^ Eft,' Am)2
MAD of difference of kurtosis ^iiiEft'lAM-iitEft'AMl

Nevertheless, using the misclassification error as a criterion for evaluating a
classification model may not be ideal. For example, consider a two-class classifi-
cation problem with binary targets and 100 observations in each class. Suppose
that adding one feature to a classificaton model leads to correct classification of
75% of the observations from each class. Adding a different feature to the model
may lead to correct classification of 100% of the observations from one class and
50% of the observations from the second class. Both of these models are asso-
ciated with a misclassification error of .25. However, the second model achieves
perfect classification of one class and is probably preferable to the first. Other
criteria, such as maximizing the Jaccard Index or minimizing cross entropy or
deviance, pick up on this subtlety. The Jaccard Index is higher for the second
model than for the first, and the deviance is lower for the second model than
the first.
Therefore, in addition to selecting features that minimize the misclassifica-
tion error, we select features that maximize the Jaccard Index, which is defined
Here C is the number of classes and x is the confusion matrix associated with
the classification model. The ijth element of x is the proportion of patterns
belonging to class i that were assigned by the classifier to class j. Rather than
considering only the overall classification accuracy, the Jaccard Index takes into
account both the sensitivity and the specificity of the model. J ranges from 0
to 1; as the rate of true positives increases and the rates of false positives and

false negatives decrease, J approaches 1. Thus, maximizing J corresponds to
maximizing sensitivity and specificity rates for all classes simultaneously. An
optimal classification model is associated with a value for J as close to 1 as
In performing forward feature selection using J as the criterion to be maxi-
mized, the P features are entered into the model one at a time, and the feature
variable that causes the largest increase in J is included in the feature set. Once
the single feature that yields the model with the highest J has been identified
and added to the classification model, the search continues for a second feature
that leads to the highest increase in J. This process continues until including
additional features in the model reduces the models stability and causes J to
decrease. At each step of the forward selection routine, we estimate J using
leave-one out cross-validation. Similarly, we perform forward feature selection
using E as the criterion to be minimized to determine whether one of these two
criteria is preferable to the other.
Although using cross-validation when evaluating a model should lead to
an unbiased value for J and E, the model resulting from the forward selection
routine may lead to over-fitting. When constructing a statistical classification
model, we desire to avoid both over- and under-fitting the model. A com-
mon way of avoiding both over- and under-fitting is by penalizing the selection
criterion as the number of predictors increases. Rather than minimizing the
misclassification error we propose to minimize the penalized misclassification

E'= ^Tl{yi^ti) + k\ogp, (2.27)
where p is the number of predictors that have been included in the model and k
is a tuning parameter that controls the tradeoff between over- and under-fitting.
Here we use the natural log of p rather than p itself to prevent under-fitting
of the model. Typical values for k are k £ [0, .05]. When k 0 E' reduces
to the un-penalized inisclassification error E, and as k increases the number of
predictors is penalized more greatly. An optimal value for k is determined via
Similarly, rather than maximizing the Jaccard Index we propose to maximize
the penalized Jaccard Index
1 c
Xjj + ]Ci=l Xij
For simplicity of notation, for the remainder of this paper we let E = E'
and J = J', the penalized selection criteria. We perform forward selection with
both criteria to determine which leads to better classification models.
2.4 Feature Classification
There is a wealth of statistical learning methods that can be applied to
this image classification problem. In this paper we consider a diverse set of
algorithms: logistic regression, linear discriminant analysis, support vector ma-
chines, classification trees, and probablistic neural networks.
2.4.1 Logistic Regression
The logistic regression (LR) model [2, 10] is very popular in medicine be-
cause it can be used to model probabilities associated with the binomial and

. HL
0 5 10 15 20
Variance of Homogeneity (X36)
Figure 2.6: This simple plot shows that a logistic curve with variance of homogeneity
(£36) as a predictor may lead to good separation between healthy and diseased images.
multinomial distributions. Outcome variables like survival versus death or the
presence or absence of a particular condition are binomial-distributed, and if
a logistic, or S-shaped, decision boundary divides the data well, then logistic
regression is often the classification method of choice. Figure 2.6 illustrates,
with one example, the fact that an S-shaped decision boundary is somewhat
suited for this classification problem. Logistic regression models the posterior
probabilities of the C classes with linear functions of X and ensures that the
basic properties of probabilities are satisfied: that these probabilities remain in
the interval [0,1] and sum to 1.
If we let Xi denote the ztli observation, a vector composed of a 1 to accomo-
date the intercept and the 123 features described in Section 2.2, then the logistic

regression model is specified in terms of C 1 log-odds, or logit, transformations:
10S P(y=C\X=Xi) Pi A*
, P(y=2\X=Xj) oTv
i0S P(y=C\X=Xi) A*
, P(y=C-l\X=Xi) oT X
l0S P(y=C\X=Xi) ~ Pc- 1A*
Here fl is a p + 1 vector of coefficients. Then, for c = 1,..., C 1,
P(y = c\x = Xi) =
expiry + 0jXj)
1 + ESTiX expiPfXi)'
P(y = C\X = Xi) =
1 + T,t=i exp{fi[Xi)
One can easily verify that these probabilities sum to 1.
Usually logistic regression models are fit by maximum likelihood using the
conditional distribution of y given X Pr(y = c\X = X,). Assuming that the
observations come from the multinomial distribution, the log-likelihood for N
observations is
m = J2'gPr(y = c\X = Xi,(3).
Consider the two-class case where the response, yj, is binary. The log-
likelihood can then be written as
i(P) = Yl{yilogPr(y = *\x = + (x ~ 2/i)1s(1 Pr(y = i| x = Xi,p))},

which simplifies to
i{P) = {yiPTxi ~ M1 + ^TXi)]
We set the derivatives of 1(0) to 0 in order to maximize the log-likelihood:
MVi ~ Pr(y = 1\X = 0)) = 0.
Recall that 0 is a vector with p + 1 coefficients. Thus, equation (2.35) gives us
p + 1 equations that are nonlinear in 0 to be solved for 0. We use the Newton-
Raphson algorithm to solve (2.35). To do so, we first compute the Hessian
= jh XtXjPT(v = 1\X = Xt,0)(l Pr(y = 1\X = Xt,0)).
Given initial values for 0old (typically 0old = 0), the Newton-Raphson up-
date for 0 is
r*. Pm-'auji)
f)i,0;iT 00
where these derivates are evaluated at 0old. Although the convergence of this
algorithm is not guaranteed, the algorithm usually converges since the log-
likelihood is concave.
Before extending the Newton-Raphson method to the multiclass case we re-
express the algorithm in matrix notation for simplification. Let Y be the vector
of yi values, X be the N x (p+1) matrix of X,t values, P the vector of probabilities

with ith element Pr(y = 1\X = Xt, ft), and W an iV x N matrix of weights with
ith diagonal element being Pr(y = 1\X = Xi, ft)( 1 Pr(y = 1\X = Xi,ft)).
= XT(Y P)
The Newton-Raphson step is then
ftnew = ftld + (XtWX)-1Xt(F P)
= (XTWX)~1XTW(Xftold + W_1(y P)) (2.40)
where Z = Xftold + W-1(T P). These equations are solved repeatedly for Z
and ft until convergence occurs.
For the multiclass case where C > 2, the Newton-Raphson algorithm re-
quires a vector of C 1 responses and a nondiagonal weight matrix for each
observation. Alternatively, coordinate-descents methods may be more efficient
for log likelihood maximization [10].
Logistic regression is a popular tool for data analysis and classification when
the objective includes inference. Logistic regression models are easily inter-
pretible, and the impact that each feature has on the outcome can be deter-
mined by looking at the coefficients ftj. For the two-class problem, for example,
for j = 2, ...,p + 1, if ftj =0 then there is no relationship between the outcome
and the jth predictor; if ftj > 0 then the log-odds of classifying to class 1 is an

increasing function of the jth predictor; and if ftj < 0 then the log-odds of clas-
sifying to class 1 is a decreasing function of the jth predictor. The magnitude
of each /J, represents the change in log odds of classifying to class 1 for a unit
increase in the jth predictor. In the case of a multiclass problem, interpretation
of the coefficients /3 is similar. Because there are more decision boundaries and
there is a vector /5 for each decision boundary, there are merely more coefficients
to be interpreted on a class by class basis. We use the multinom function in the
nnet [22] package in R to perform LR classification (see Appendix A for code).
2.4.2 Linear Discriminant Analysis
Linear discriminant analysis (LDA) is a classification method that trans-
forms the input variables using linear combinations and then finds linear deci-
sion boundaries in this transformed feature space. LDA, a classic but powerful
classification tool, was among the top three classifiers for 7 of the 22 datasets in
the STATLOG project4 [16]. Although LDA relies on the assumption that the
data be approximately Gaussian in distribution and that the class covariances
be approximately equal, LDA can perform well even when these assumptions
are not met [10].
Consider a multiclass problem with C classes, and let Xt represent the ith p-
dimensional observation vector belonging to class yt. The posterior probability
of observing class c is given by Bayes Theorem:
Pr(y = c\X = Xi)
Pr(X = Xj\y = c)Pr(y = c)
Hc=i Pr(x = xi\y = c)Pr(y =c)
4Other classifiers that were considered include: quadratic discriminant analysis, fc-nearest
neighbors, naive Bayes, classification tree method C4.5, backpropagation, and many more.

where the prior probability of observing class c is
1 N
Pr{y = c) = -£ I{yi = c)
Bayes Theorem (2.41) shows that knowing the prior probabilities gives
enough information to perform classification if the pdf of X is known. Sup-
pose that each class pdf is modeled as multivariate Gaussian:
MX) =

Linear discriminant analysis (in contrast to quadratic discriminant analysis
(QDA)) arises in the case when we assume that all classes share a common
covariance matrix, i.e., £c = S Vc. Although in the image classification problem
considered here the three classes may very well not share a common covariance
matrix, performing LDA instead of QDA typically does not lead to significant
differences, and LDA is usually conveniently substituted for QDA because the
number of parameters to estimate is so much less [10]. In the case when we
assume that there is a common covariance matrix, the normalization factors
cancel, simplifying computations considerably. Rather than considering the ratio
of two classes we can consider the log-ratio, which turns out to be linear in X:
log l}r(y=c\X=X,) , Pr(X=Xj\y=c') , Pr(y_c)
1U6 Pr(y=c'\X=Xi) 1U& Pr( X = XAn=r'\ iU& Pr(n= Pr(X=Xi\y=c)
= Pr(y=c') ~ + Hc')T^ '(He ~ Hd) + XTT,
where the first equality is due to Bayes Theorem (2.41). The fact that the
log odds function is linear in X implies that the decision boundary between

classes c and dthe set where Pr(y = c\X = Xt) = Pr(y = d\X = Xt)is a
hyperplane, so all decision boundaries are linear. From (2.44) we see that the
linear discriminant functions
SC(X) = XTSrVc ^£-y + log Pr(y = c), (2.45)
with c(X) = argmaxc5c(X) produce the LDA decision rule.
In practice, though, we do not know the parameters of the Gaussian distri-
butions. We use training data to estimate the parameters:
Pr(y = c) = jf, where Nc is the number of training observations from
class c;
i'- = k Vi " £ = Y,ti Eft, (v. Ac)(V h)t/(n C).
The computations for LDA are simplified by diagonalizing S. We compute
the eigen decomposition E = UDUT, where U is p x p and orthonormal and D
is a diagonal matrix of positive eigenvalues Ap of E. The data is sphered so that
X* = D~1PJJtX so that the common covariance estimate of X* is the identity
matrix. Then (2.45) reduces to
SC(X) = X*TIpfic \nTcIPiic + logPr(y = c), (2.46)
and an observation Xi is classified to the class c that maximizes 5c(Ai). We use
the Ida function in the MASS [31] package in R to perform LDA classification
(see Appendix A for code).

It is clear that, unlike in LR, in LDA the decision bounday is determined by
the covariance of the class distributions and the positions of the class centroids.
Thus, all pointsnot just the points near the decision boundaryplay a role in
shaping the decision boundary.
2.4.3 Support Vector Machine
The support vector machine (SVM) [10] performs nonlinear classification by
constructing linear boundaries in a transformed feature space. For simplicity we
will first suppose that we have two linearly separable classes, which we would like
to separate with an optimal separating hyperplane. The training data consists
of N observations Xi G Mp+1 belonging to the classes yt { 1,1}. Define a
hyperplane by
{X : f(X) = Xt/3 = 0}, (2.47)
where ||/3|| = 1. A classification rule based on this separating hyperplane is
c(X) =sign{XT/3). (2.48)
Since the classes are separable and y; G { 1,1} we can find a function f{X)
such that f(X) = XT{3 with yif(Xi) > 0 Vi. We can find the hyperplane that
creates the greatest margin between the training points for the two classes by
solving the optimization problem
subject to
max A

Figure 2.7: An SVM classifier for a two-class problem in the case that the classes are
yi(X?/3)>A,i = l,...,N, (2.50)
where A is half of the separating margin. If we use the hyperplane that solves
this optimization problem we will end up with a margin 2A units wide separating
the two classes, the largest margin possible (see Figure 2.7). If we let A = pjj,
this optimization problem can be rephrased as the minimization problem
min 11/3|| (2.51)
subject to
yi{Xj0) > M = 1, ,N. (2.52)
This is a quadratic optimization problem with linear inequality constraints.
In most cases, however, the classes are not perfectly separable; the classes
overlap in feature space. This overlap can be dealt with by defining slack vari-

XTP = 0
Figure 2.8: An SVM classifier for a two-class problem in the case that the classes are not
ables £ = (£i>£25 ...,£jv) and then maximizing 2A, the margin between the two
classes. The constraint in (2.50) can then be reformulated as
yi(Xf/3)>A(l-6),Vi, (2.53)
where & > 0 Vi and & < e. The value & in the constraint (2.53) is the
proportion of A by which the prediction f(Xi) = Xf/3 is on the wrong side
of the margin. In other words, the points that are on the wrong side of the
margin are a distance A& from the margin (see Figure 2.8). Misclassifications
occur when £ > 1, so by bounding & at a value e we are bounding the total
number of misclassifications allowed during training. Like in logistic regression,
the data points well inside their class boundary have little impact on the decision
boundary, and only those data points near the decision boundary influence the
decision boundary.
Once again, if we define A = tTtt the optimization problem is equivalent to

subject to
yi(X?P) >1-6 Vi,& > 0,^Ci < e. (2.55)
This problem (2.55) is a convex optimization problem, and a quadratic pro-
gramming solution can be found using Lagrange multipliers. We can re express
(2.55) as
nuni||£||2 + <5
subject to
&>0,^(Xf/3)>l-£;VL (2.57)
Here the cost parameter 5 replaces e in equation (2.55). The case when the two
classes are perfectly separable corresponds to 6 = oo.
The Lagrange primal function is
1 N N N
Lp = 5 m?+<5 - £>*,(*,t/?) (i &)] -
i=l i= 1 i=l
which we minimize with respect to and £*. By setting the respective derivatives
equal to zero we find that
P =

o = ^ViVi,
Oi-l ^
where cq,/ij,£i > 0 Vi. Substituting (2.59)-(2.61) into (2.58) gives the La-
grangian dual objective function
N j JV iV
Ld ^ ^ ^ ^ ^ ^ ^ ai^i'DiVi'^-i
i=l i=l i'=l
We maximize Ld subject to the constraints tha 0 < cq < 5 and ^iL] aiVi =
0. The Karush-Kuhn-Tucker conditions lead to three more constraints:
a,{y,(XTP) (1 &)] = 0, (2.63)
/if. = 0, (2.64)
!/.(X,TJ3)-(l-f.)>0, (2.65)
Vi. Taken together, the constraints (2.59)-(2.65) uniquely characterize the solu-
tion to the primal and dual objective functions. Maximizing the dual function is
a simpler convex quadratic programming problem than minimizing the primal
function, and it can be solved with standard optimization techniques [10].
Once the solution vector j3 has been found by solving this quadratic opti-
mization problem, the decision function can be written as

c{X) sign[XT0\. (2.66)
The optimal value for the tuning parameter, 8, is determined via cross-
A more flexible form of the SVM classifier enlarges the feature space using
basis expansions such as polynomials or splines. Enlarging the feature space
typically leads to better classification and corresponds to nonlinear decision
boundaries in the original feature space. Once the basis functions hm(X),m =
1 are selected, we fit the SVM classifier using input features h(Xi) =
(hi(Xi),..., hM(Xi)),i = 1 , and produce the function f{X) = h(X)Tj3.
The decision function is still defined as c(X) = sign(f(X)).
For the transformed feature vectors h(Xi), the Lagrange dual function can
be reformulated as
N ^ N N
otiOtvyiVi' {h{Xi), h(Xi>)). (2.67)
1=1 2=1 i'=l
From (2.59) we find that f{X) = h(X)T(3 can be rewritten as
f(X) = Y,^Yt(h(X),h(Xi)),
where a* (0,8) VL
It is clear that both (2.67) and (2.68) involve h(X) only through inner
products. Thus, there is no need to specify the transformation h(X) at all, and
all we need to specify is a kernel function

K(X,X) = (h(X),h(X')}.
There are three common choices for K:
Polynomial kernel: K(X,X') = (1 -f- (X,X'))de9
Radial basis kernel: K(X,X') = exp(j\\X X'\\2)
Sigmoid kernel: K(X, X') = tanh(/C! (X, X') + k2).
The radial basis kernel sometimes produces the closest decision boundary
to the Bayes optimal decision boundary, and it the most commonly used kernel
[10]. This is the kernel that we use here. We use the svm function in the el071
[7] package in R to construct SVM classifiers (see Appendix A for code).
2.4.4 Classification Trees
Classification trees (CLTs) [4, 10] partition the feature space into a set of
A;-cells and then fit a very simple model in each A; cell. For example, Figure
2.9 shows a partition of the feature space in the case that the feature space is
two-dimensional. Here we consider only binary partitions, choosing a feature
and a split-point and then splitting the space into two regions. The feature and
split-point are chosen to achieve the best classification. Then one or both of
these two regions are split into two or more regions, and this process continues
until a stopping criterion is met. For example, in Figure 2.9 we first split at
X2 = t\. Then the region X2 > t\ is split into two regions at X2 = t2. Then the
region X2 < t2 is split at X\ = h, and the region X2 > t2 is split at Ad = t.

ft, ft.,
ft, R6
Figure 2.9: Classification tree partition of a two-dimensional feature space into seven
Finally, the region X\ < t4, X2 > t2 is split at X2 = f5. This results in a
partition with seven regions: i?i, R2,..., R7.
CLTs are popular among medical scientists because of their interpretability
and the way that they mimic a doctors thought process. For example, one
(fictitious) binary classification tree may be summed up in the statement, If the
patient has a smoking history of at least 24 pack years, then if the patient suffers
from Chronic Obstructive Pulmonary Disease, then if the patient has end-stage
disease and is above age 63, then the patient is predicted not to survive for at
least 2 years. Just like a doctor might ask a patient a hierarchy of questions
and then base his/her diagnosis on the answers to all of these questions, a CLT
queries a hierarchy of features and classifies an observation based on all of these
features. Additionally, CLTs handle both nonlinearity and collinearity without
a complex model [4].

Suppose that the target is a classification outcome taking on values 1,2,C
and that Xi denotes the ith ^dimensional observation. The CLT algorithm de-
termines the splitting variables and the split-points that are considered optimal
according to some criterion, such as minimum misclassification error or maxi-
mum Jaccard Index. Suppose that the CLT partitions the feature space into M
regions Ri, R2,..., Rm and that each region Rm contains rm observations. Then
is the proportion of observations in region Rm that belong to class c. Each
observation in region Rm is classified to class c(m) = argmaxc/3mc, the class
with the greatest representation in that region. The rpart function in the
2.4.5 Probabilistic Neural Network
In recent years neural networks have gained popularity in classification of
patterns based on learning from examples. Whereas many successful neural
network algorithms are not grounded in statistics, as they use heuristic ap-
proaches to discover the underlying data structure, the probabilistic neural net-
work (PNN) [18, 26, 32] is grounded in statistical theory. In fact, the decision
boundary implemented by the PNN asymptotically approaches the Bayes op-
timal decision boundary, which minimizes expected risk, making the PNN
attractive to statisticians and probabilists.
Consider a multiclass problem with C classes, and let Xi denote the zth
p-dimensional observation vector normalized to unit length and belonging to
class yl. Recall that Bayes Theorem says that
rpart [28] package in R is used to construct CLTs (see Appendix A for code).

Pr(y = c\X = Xt)
Pr(X = Xj\y = c)Pr(y = c)
E?=i Pr(x = xi\v = c)Pr(y = c)
The Bayes decision rule classifies a vector Xt to class c if
Pr(y = c\X = Xt) > Pr(y = d\X = X,\d = 1,C. (2.72)
Since these probablities are unknown and the denominator in (2.71) is just a
normalization factor, Bayes Theorem tells us that an equivalent Bayes decision
rule classifies a vector Xi to class c if
Pr(X = Xt\y = c)Pr(y = c) > Pr(X = Xt\y = d)Pr(y = d),d = 1,.., C.
In order to use this decision rule, we need to fully specify Pr(y = c) and
Pr(X = Xz\y = c). We define
1 N
Pr(y = c) = = c),
v i=i
and we use Parzens approximation [26, 32] of Pr(X = Xt\C = c):
Pr(X = Xi\y = c) = ^ exp
k= l
Here Nc is the number of training observations X^ belonging to class c and a is
a smoothing parameter. We define cr < 0.1 as the absolute difference between
the two smallest variances among the different features, as proposed in [21].
Equations (2.73)-(2.75) together specify the Bayes decision rule of the PNN.
The architecture of the PNN is shown in Figure 2.10 for a two-class problem.
a2 J

Input layer
Pattern layer
Summation layer
Decision layer
Figure 2.10: Architecture of a probabilistic neural network for a two-class classification
The input layer is comprised of the different features in an observation to be
classified by the PNN; the pattern layer consists of the training patterns; the
summation layer represents equation (2.75); and the decision layer uses the
decision rule (2.73) to perform the classification. The PNN was implemented in
Matlab; see Appendix A for details.

3. Results
We perform forward feature selection using both E (2.27) and J (2.28) as
selection criteria to construct models using LR, LDA, SVMs, CLTs, and PNNs
for classification. The percentage of images from each class that are classified
correctly, as well as the values for E and J associated with each of these models,
are listed in Table 3.1.
Regardless whether we use E or J, LR performs best, achieving the lowest
misclassification error (E ~ .10) and the highest Jaccard Index (J > .81). Nev-
ertheless, other models perform similarly to LR. The SVM model resulting from
maximizing J achieves the best classification of normal images, the CLT model
that results from minimizing E achieves the best classification of CF images,
and the logistic regression model resulting from minimizing E achieves the best
classification of SAR images. If we use these three models for classification and
then let them perform a majority rules vote, we can do better.
The confusion matrix resulting from this majority-rules voting scheme is
shown in Table 3.2. The misclassification error associated with this majority-
rules model is E .0925, and the associated Jaccard Index is J = .8270. Clearly,
this model achieves lower error and higher Jaccard Index than each individual
model in this case.
Sometimes classification models that perform poorly overall can improve
classification accuracy [10]. Therefore, we consider using all twelve models,
rather than only the top three, in a majority-rules voting model. The confusion

Table 3.1: Classification accuracy for control, CF, and SAR images when
comparing various classification algorithms and two feature selection criteria.
Penalized Error E
Algorithm Normal CF SAR E J
LR 89.1% 91.6% 87.5% .1034 .8105
LDA 89.1% 90.1% 83.9% .1207 .7854
SVM 91.3% 94.4% 82.1% .1092 .8080
CLT 87.0% 95.8% 80.7% .1149 .7838
PNN 71.7% 83.1% 73.2% .2312 .6140
Penalized Jaccard Index J
Algorithm Normal CF SAR E J
LR 93.5% 90.1% 85.7% .1034 .8146
LDA 93.5% 94.4% 79.0% .1092 .8035
SVM 95.7% 90.1% 78.6% .1214 .7885
CLT 87.0% 95.8% 80.4% .1156 .7820
PNN 73.9% 57.8% 87.5% .2832 .5825
Table 3.2: Confusion matrix associated with the majority-rules voting of three
classification models.________________________________________
Classified as
Normal CF SAR
Normal 93.5% 0.0% 6.5%
True Class CF 1.4% 95.8% 2.2%
SAR 12.5% 5.4% 82.1%

matrix resulting from this majority-rules model is identical to the confusion
matrix associated with the majority-rules model that disenfranchises all but
three models (Table 3.2).
The following are the predictors that entered into each model:
Models determined by minimizing E
Logistic Regression (k = .01):
* 3-36Variance of homogeneity (SGLDM)
* x22Minimum sum average (SGLDM)
* £51Second moment of DGD (RDM)
* x2Correlation (SGLDM)
* £55Minimum variance (RGLDM)
* x72Maximum kurtosis (RGLDM)
Linear Discriminant Analysis (k = .015):
* £48Skewness of cluster prominence (SGLDM)
* x6Sum average (SGLDM)
* ^62Average mean absolute deviation (RGLDM)
Support Vector Machine (k = .025):
* x62Average mean absolute deviation (RGLDM)
* X14Mean sum average (SGLDM)
* £5Variance (SGLDM)
* X\Contrast (SGLDM)

Classification Tree (k = .01):
* x64Maximum mean absolute deviation (RGLDM)
* X22Minimum sum average (SGLDM)
* ^24Minimum cluster prominence (SGLDM)
* 2-36Variance of homogeneity (SGLDM)
Probabilistic Neural Network (k = .01):
* xa(\Skewness of sum average (SGLDM)
* .x83Minimum difference of mean absolute deviation (RGLDM)
* x68Maximum skewness (RGLDM)
* £121Median difference of kurtosis (RGLDM)
* x73Standard deviation of mean absolute deviation (RGLDM)
* x88Mean absolute deviation of kurtosis (RGLDM)
* XigMinimum energy (SGLDM)
Models determined by maximizing J
Logistic Regression (k = .01):
* ^36Variance of homogeneity (SGLDM)
* X22Minimum sum average (SGLDM)
* x5Variance (SGLDM)
* £54Average variance (RGLDM)
* x\Contrast (SGLDM)
* ^6iRange of standard deviation (RGLDM)

* 3-66Average skewness (RGLDM)
* £74Average difference of variance (RGLDM)
Linear Discriminant Analysis (k = .015):
* £36Variance of homogeneity (SGLDM)
* £14Mean sum average (SGLDM)
* £60Maximum standard deviation (SGLDM)
Support Vector Machine (k = .025):
* £62Average mean absolute deviation (RGLDM)
* X\4Mean sum average (SGLDM)
* £5Variance (SGLDM)
* XfioMaximum standard deviation (RGLDM)
Classification Tree (k = .01):
* £64Maximum mean absolute deviation (RGLDM)
* X22Minimum sum average (SGLDM)
* £24Minimum cluster prominence (SGLDM)
* £36Variance of homogeneity (SGLDM)
Probabilistic Neural Network (k = .01):
* £46Skewness of sum average (SGLDM)
* £75Minimum difference of variance (RGLDM)
* xStandard deviation of difference of variance (RGLDM)
* £61Range of standard deviation (RGLDM)
* x8lRange of difference of standard deviation (RGLDM)

4. Discussion and Conclusion
4.1 Discussion
LR appears to perform better than all other methods, as both the LR model
that was found to minimize E and the LR model that was found to maximize J
result in lower miselassification error and higher Jaccard Index than the other
classification methods. Both LR models lead to about 90% classification accu-
racy. Nevertheless, LDA, SVM and CLT perform almost as well as LR, achiev-
ing classification accuracies between 88-90%. Because LR and SVM are similar
in theory it is not surprising that they perform similarly. The present study,
which split the data set into 2/3 for training and 1/3 for testing is insufficient
to determine that LR outperforms LDA, SVM, and CLT because these models
do not perform much more poorly than LR. A; fold cross validation on a larger
dataset would better determine whether LR is truly better suited than the other
methods for this classification problem.
It is curious, though, that whereas LR, LDA, SVM, and CLT perform simi-
larly, the PNN performs more poorly than the other methods, achieving classi-
fication accuracy of 72-77%. It is unclear at this point why the PNN performs
so much more poorly than the other methods.
Nevertheless, we find that constructing a majority-rules voting scheme us-
ing all models, even those (such as the PNN) that perform poorly, improves
classification for this dataset. When we allow all twelve models to vote on the
final classification of each image, 94% of normal images, 96% of CF images, and

82% of SAR images classify correctly. It is not surprising that there is confusion
between normal and SAR images (see Table 3.2). This is because in many cases
nodules associated with SAR can be quite subtle. As expected, there is little
confusion between CF and normal images because the textures associated with
CF tend to be far from normal. It is somewhat surprising that even 1.4% of CF
images classified incorrectly as normal. Some confusion is seen between SAR
and CF images, a result of the similarity between the mosaic pattern associated
with CF and early, subtle honeycombing that looks very much like the mosaic
It is interesting to note that although all twelve models use only 6-8 features,
each of the twelve models uses a different subset of the feature space. In fact,
the one feature that is used most frequentlyvariance of homogeneityis used
in only five of the twelve models. In total, 31 of the 123 features are used by at
least one of the twelve models, and of these 31 features 21 are used only once.
Nevertheless, besides for the case of the PNN, all models perform similarly in
spite of the fact that they may not use the same features as other models. This
is not surprising due to the high collinearity present in the feature space. In
fact, there may be many more subsets of the feature space that lead to similar
classification performance.
We point out that only 1 of these 31 featuressecond moment of DGDis
an RDM feature, and it is used only once. This suggests that run difference fea-
tures may be unnecessary in this situation. From the remaining 30 features, 13
are SGLDM features and 17 are RGLDM features. Every one of the twelve mod-
els uses a combination of SGLDM and RGLDM features, although the features

that are present at least three times (variance of homogeneity and minimum sum
average) are SGLDM features. This suggests that SGLDM features codify dif-
ferent textural patterns than RGLDM features and that both feature extraction
methods are useful in this classification endeavor.
Although it is understood that collinearity manifests greatly in this data
set, the fact that different models use different feature subsets raises a number
of questions. Why do different models use different features? Are some features
predisposed, due to an underlying geometry, to be used in particular classifi-
cation models? These questions can only be answered by a closer look at the
features that were included in each model, and the answers to these questions
would be important and insightful in feature extraction and model-building.
Some classification methods use the same features when maximizing J as
when minimizing E. For example, both CLT classifiers use only four features,
and these features do not change when maximizing J instead of minimizing
E. In fact, two of these four features are used in at least three of the twelve
models. Also, both SVM models use four features, and the three first features
in the SVM models are included regardless whether the model is determined
by minimizing E or by maximizing J. The first two features used by the LR
classifiers are also the same in both cases. In contrast, both the LDA and the
PNX classifiers determined by minimizing E and by maximizing J do not share
any features with each other. One would expect that the criterion for feature
selection should not affect the resulting feature subset very much, and certainly
not in the early stages of the feature selection routine. The fact that CLT and
SVM use mostly the same feature subset regardless of feature selection criterion

suggests that these two classification methods may determine an optimal feature
subset more robustly than the other methods.
There are a number of factors to consider when determining what classifica-
tion model to use. In general, a simpler model that is relatively easy to interpret
is best. The LR and PNN classifiers use 5-8 predictors for classification, so they
are considerably more complex than the other methods. The LDA classifiers use
only 3 predictors, and the CLT and SVM classifiers use only 4 predictors, and
yet they lead to equivalent or better classification. A simpler model that uses
fewer predictors is always better than a more complex model that uses more
predictors but performs similarly to the simpler model, and the LDA, CLT, and
SVM classifiers are certainly the simpler models.
Nevertheless, it is important to also consider interpretation of a classifica-
tion model. Since LDA and SVM classifiers transform the input features using
either linear combinations of the features or radial basis function expansions,
respectively, they are more difficult to interpret than LR, SVM and CLT clas-
sifiers. The coefficients /3 returned by LR and SVMs are easily interpertible
and give insight into the type of effect that each predictor has on classification.
Also, the split-points returned by a CLT are sufficient to give insight into the
underlying structure of the data. In fact, one might argue that CLT classifiers
are easier to interpret than LR classifiers and that CLT classifiers more closely
imitate the thought process of a doctor in performing a diagnosis. For example,
one branch of a CLT for this problem states: If minimum sum average is less
than 3.39, then if average standard deviation is less than 41.8, then if minimum
skewness is less than -0.44 then the image is normal, but if minimum skewness is

at least -0.44 then the image contains evidence of SAR. The importance of min-
imum skewness in determining classification is self-evident, and little external
interpretation is necessary.
In this study, there is a third consideration to take into account in selecting a
best model, in addition to simplicity and interpretation. Because this research
takes a first step towards an automated CAD scheme, it is important to be
able to measure the probability of correctly diagnosing an image. Ideally, we
want a CAD system to return the probabilities of correct diagnosis together
with the diagnosis itself. Some classification algorithms are more inclined to
giving a probability of correct diagnosis than others. For example, it is easy
to extract the probability of correct diagnosis from a PNN classifier because
the summation layer of the PNN returns these probabilities. Also, in an SVM
classifier it is relatively simple to calculate the distance, in feature space, of a
particular observation from the decision boundaries. Based upon the distances of
an observation from the decision boundaries, the probability of correct diagnosis
can be determined. In fact, the svm function in the el071 [7] package in R has
the option of returning these probabilities. Similarly, we can do the same for
LR. The fact that the SVM and LR classifiers return the probability of correct
diagnosis may make SVMs and LR classifiers more desirable options in a CAD
system. Then, the user of the CAD system can define a threshold, such as .9,
so that all images with probability of correct diagnosis lower than the threshold
can be looked at by radiologists for a second opinion.

4.2 Future Directions
This study represents an initial step towards a CAD system to automatically
diagnose CT images as normal, CF, or SAR. Future research may continue
along several avenues in the areas of feature extraction, feature selection, and
feature classification. Although the SGLDM and RGLDM features discussed
here quantify the textural patterns associated with CF and SAR quite well,
continued research should determine additional texture descriptors that can aid
in diagnosis. For example, this set of features did not succeed in capturing
the subtle differences between the small SAR nodules and normal parenchyma,
and future research directions would include finding texture features that better
describe this subtlety. Another future direction would be to take a closer look at
the features that enter into the classification models together with radiologists
to be sure that the significant texture descriptors correlate well with radiologists
visual reads.
In addition to feature extraction, the feature selection procedure can also
benefit from continued research. In this study we used forward selection based
upon two criteriamisclassification error and Jaccard Indexbut there are
caveats associated with performing forward selection. Firstly, sometimes adding
a feature to a classification model may not aid much in classification and may
even reduce classification accuracy minimally, but adding this feature may lead
to better classification later on, when future features are added to a model
[10]. This is typically a result of interactive effects among predictors, and CLT
classifiers are particularly suscpetible to this caveat [10]. Thus, rather than

terminating feature selection when the misclassification error begins to increase
or when the Jaccard Index no longer increases, future research may consider
whether cross-validation would lead to better classification models. Another
possibility would be a combinatorial approach to feature selection where a user
specifies a maximum number, p, of predictors to be included in a model and
then all possible combinations of < p features are considered. Once again,
cross-validation can be used in combination with this combinatorial approach
to select the best-performing model.
A second problem with performing forward feature selection is that adding
features that improve a model may lead to inclusion of multiple collinear fea-
tures. One way to circumvent this issue is by performing stepwise feature se-
lection. At each step of the stepwise feature selection algorithm, in addition to
considering adding a feature to the model the algorithm also considers whether
the most benefit results from removing a feature from the model. Future research
should include trying stepwise feature selection to see if this leads to better, and
less collinear, feature subsets particularly for the LR and PNN classifiers, which
include 5-8 features, some of which are not independent.
In terms of classification there are a number of directions for future re-
search. We considered here five classification algorithms: LR, LDA, SVM, CLT,
and PNN classifiers. Though diverse, this is by no means an exhaustive list.
Other classification algorithms that may perform well include supervised clus-
tering algorithms such as /c-nearest neighbors classifiers, random forests, and
other neural network algorithms, such as back-propagation. Additionally, we
found that a majority-rules voting scheme outperformed each of the individual

classifiers considered here, and this majority-rules method is a very simplified
version of an ensemble learning scheme. Future research may consider more
complex ensemble learning, bagging, boosting, or meta-analysis methods for
combining the strengths of multiple classification models.
4.3 Conclusion
This study demonstrates the feasibility of using textural patterns present in
CT images to differentially diagnose CF and SAR. The SGLDM and RGLDM
features defined here capture the textures associated with these two pulmonary
diseases quite well. Classification results are very good, especially considering
the fact that we are performing diagnoses based solely upon CT image textures
and we are not considering other factors, such as pulmonary function tests and
sputum cultures. Nevetheless, this is only a preliminary study, and it will take
considerable progress and continued research before a CAD system ready for
clinical use can be developed.

APPENDIX A. Statistical Code (R)
R function to find the feature that, when added to an LR model, minimizes E:
findFeats <- function(p,added){
newfeat <- data.frame(trainset[,c(added,p)])
fit <- multinom(disease ~ data=newfeat, CV=ftrue,maxit=1000)
predicted <- predict(fit,newfeat)
errors <- unlist(lapply(notadded,findFeats,added))
minerror <- min(errors)
newfeat <- match(minerror.errors)
R function to find the feature that, when added to an LDA model, minimizes
findFeats <- function(p,added){
newfeat <- data.frame(trainset[,c(added,p)])
fit <- lda(disease ~ ., data=newfeat)
predicted <- predict(fit,newfeat)$class
errors <- unlist(lapply(notadded,findFeats,added))
minerror <- min(errors)
newfeat <- match(minerror.errors)
R function to find the feature that, when added to an SVM model, minimizes

findFeats <- function(p,added){
newfeat <- data.frame(trainset[,c(added,p)])
fit <- svm(disease data=newfeat, type=1C)
predicted <- predict(fit,newfeat)
errors <- unlist(lapply(notadded,findFeats,added))
minerror <- min(errors)
newfeat <- match(minerror.errors)
R function to find the feature that, when added to a CLT model, minimizes E
findFeats <- function(p,added){
newfeat <- data.frame(trainset[,c(added,p)])
fit <- rpart(disease ~ ., data=newfeat, method=class)
predicted <- predict(fit.newfeat,type=class)
errors <- unlist(lapply(notadded,findFeats,added))
minerror <- min(errors)
newfeat <- match(minerror,errors)

PNN Code (Matlab)
Returns PNN decision and probabilities for three-class classification
function (probabilities,class]
pnnclassify pattern_weights,vec2classify)
- Four inputs:
'training_set' is an MxN matrix of N training vectors, each having M
features. (training_set = (Classlvecs Class2vecs Class3vecs])
'training_key' can be one of two things:
* A 1x3 vector whose elements represent the number of training
vectors per class. Thus, length{training_key) is equal to the
number of classes, and training_key(c) is equal to the number of
training vectors for class c.
* A scalar representing the number of classes. If training_key*
is a scalar then it is assumed that there is an equal number of
training patterns for each class.
clas3_weights' is a 1x3 vector of weights for the decision layer of the
PNN. Class_weights should reflect a priori probabilities. If
class_weights is an empty matrix then all weights are set to 1.0.
'pattern_weights' is a IxN vector of weights for the surrmation layer of
the PNN. If pattern_weights is an empty matrix then all weights are
set to 1.0.
'vec2classify* is the Mxl vector being classified.
Two outputs:
'probabilities* is a 3x1 probability vector where probabilities(c)
represents the probability that the vector being classified belongs to
class c.
class* represents the class to which the vector being classified is
assigned by the PNN. Class is also the index of max(probabilities).
Check that dimensions match up and that there are no NaNs among the data
(m, n] size{training_set);
(x, y] size(vec2classify);
if isequal(m,x) 0
disp(Dimension of vector being classified:'); disp(size(vec2classify)) ,*
disp{'Dimension of training vectors:'); disp(size(training_set));
error('Dimensions of training vectors and vector to be classified do not
for i l:x
for j l:y
if isnan(vec2classify(i,j)) * 1
vec2classify(i,j) * 0;
for i l:m
for j l:n
if isnan(training_set(i,j)) 1
training_set(i,j) 0;
Get number of classes

training_key cumsum (training_key);
if isscalar(training_key)
nuinclasses = training_key;
else numclasses length(training_key);
if numclasses 3
error{The number of classes is not 3!);
% Set class_weights equal to 1.0 if class_weights matrix is empty
if isempty(class_weights) *** 1
if isscalar(training_key) - 1
class_weights ones(1,numclasses);
else class_weights ones(size(training_key));
% Normalize training and test vectors
for column = l:n
training_set(:, column) =
for column l:y
vec2classify(:,column) =
vec2classify{:,column)/sqrt(sum(vec2classify(:,column). A2));
% Determine value for sigma (spreading parameter) according to procedure
% proposed by Ramakrishnan and Selvan, 2006
std_training_set = zeros(size(training_set));
for row = l:m
mu = mean(training_set(row,:));
sd std(training_set(row,:))?
std_training_set(row,:) (training_set(row,:) mu)/sd;
variances = sort (var (std_training_set)) ,*
if m1, sigma = .1;
else sigma = abs(variances(1) variances(2))?
if sigma > .1, sigma .1; end
% Probabilistic Neural Network
if numclasses 3
if isscalar(training_key) -- 1
numclassl n/3;
numclass2 2*n/3;
else numclassl training_key(1);
numclass2 = training_key(2);
% Hidden pattern layer and summation layer
SsubA =0; SsubB 0; SsubC =0;
for c ** l:numclassl
Zsubc = exp((((training_set(:,c))')*(vec2classify)-1)/(sigmaA2));
SsubA * SsubA + pattern_weights (c) *Zsubc;
for c (numclassl+1):numclass2

Zsubc exp({((training_set(:,c))')*(vec2classify)-1)/(sigmaA2))
SsubB = SsubB + pattern_weights(c)*Zsubc;
for c 35 (numclass2+l) :n
Zsubc exp({((training_set(:,c))1)*(vec2classify)-1)/(sigmaA2))
SsubC SsubC + pattern_weights(c)*Zsubc;
% Decision layer
probA SsubA/(SsubA + SsubB + SsubC);
probB = SsubB/(SsubA + SsubB + SsubC);
probC SsubC/(SsubA + SsubB + SsubC);
probabilities = [class_weights(1)*probA; class_weights(2)*probB;
probabilities =* probabilities/sum(probabilities);
[m, class] max(probabilities);

function GLCMfeatures circularGLCM(image)
rotate image pi/8 radians 8 times
s size(image);
if 3(1) 8(2)
errorflmage must be a square matrix.')
else diameter s(l);
images zeros(diameter,diameter,8);
images1) image;
images(;,:,2) imrotate(image,-90*(1/4),'bilinear','crop');
images(:,:,3) imrotate(image,-90*(2/4),'bilinear','crop');
images(:, :,4) imrotate(image,-90*(3/4),'bilinear','crop*);
images(:,:,5) imrotate(image,-90,'bilinear','crop');
images(:,:,6) imrotate(image,-90*(5/4),'bilinear','crop);
images(:,:,7) imrotate(image,-90*(6/4),bilinear','crop');
images(s,8) imrotate(image,-90*(7/4),'bilinear','crop');
get inscribed squares within each rotated image
midpoint floor(diameter/2);
side sqrt(2)midpoint;
rectangle [midpoint-floor(side/2-2) midpoint-floor(side/2-2) side-2
squares zeros(floor(side-1),floor(side-1),8);
parfor sqr 1:8, squares(:,:,sqr) = imcropfimages(:,:,sqr),rectangle);
stats 'Contrast Correlation Energy Homogeneity Variance Sumaverage
MaximumProbability ClusterProminence;
numStats 8; o = length(1:(side-1)/2);
features = zeros(o,numStats);
calculate co-occurrence features from each image
for d 1:((side-l)/2)
glcm ** zeros (8, 8);
for sqr = 1:8
glcm* glcm + graycomatrix(squares(:,:,sqr),'offset',[-d
0],'graylimits', ...
[min(squares(:)) max(squares(:))],numlevels',8);
features(d,:) ~ getHarStats(glcm,stats);
GLCMfeatures [features(1,:)'; mean(features)'; min(features)*;
max(features)*; ...
var(features)'; mad(features)'; skewness(features)'];
function HarFeats getHarStats(glcm,stats)
statistics mygraycoprops2 (glcm, stats);
HarFeats = [statistics.Contrast statistics.Correlation
statistics.Energy statistics.Homogeneity statistics.Variance
statistics.Sumaverage statistics.MaximumProbability
s length(HarFeats);
for q = l:s, if isnan(HarFeats(q)), HarFeats(q) 0; end; end
Computes statistical features from GLCMs
function stats mygraycoprops2(varargin)
allStats {'Contrast','Correlation','Energy','Homogeneity',

% Initialize output stats structure.
numStats length(stats);
numGLCM size(glcm,3);
empties repmat({zeros(1,numGLCM)},(numStats X]);
stats cell2struct(empties,stats,1);
for p 1:numGLCM
if numGLCM 1, tGLCM normalizeGLCM(glcm(:,:,p));
else tGLCM normalizeGLCM (glcm);
% Get row and column subscripts of GLCM, which correspond to
pixel values in the GLCM.
s size(tGLCM); [c,r] meshgrid(1:s(1),1:s(2)); r = r(:); c -
c (:);
% Calculate fields of output stats structure,
for k ~ lrnuraStats
name stats(k);
switch name
case 'Contrast', stats.(name)(p) =
case 'Correlation', stats.(name)(p) =
case 'Energy', stats.(name)(p) calculateEnergy(tGLCM);
case 'Homogeneity', stats.(name)(p) *
case 'Variance', stats.(name)(p) =
case 'Sumaverage', stats.(name)(p) =
case 'MaximumProbability', stats.(name)(p) *
case 'ClusterProminence', stats.(name)(p) -
function glcm normalizeGLCM(glcm)
if any(glcm(:)), glcm glcm ./ sum(glcm(:)); end
function C calculateContrast(glcm,r,c)
k = 2; 1 = 1; terml abs(r c).Ak; term2 = glcm.Al;
term terml .* term2(:); C sum(term);
function Corr calculateCorrelation(glcm,r,c)
mr meanlndex(r,glcm); Sr stdlndex(r,glcm,mr);
me meanlndex(c,glcm); Sc = stdlndex(c,glcm,me);
terml = (r mr) .* (c me) .* glcm(;)? term2 = sum(terml);
ws = warning('off','Matlab:divideByZero*);
Corr term2 / (Sr Sc); warning(ws);
function S stdlndex(index,glcm,m)
terml = (index m).A2 .* glcm(:); S = sqrt(sum(terml));

function M = meanlndex(index,glcm)
M = index .* glcm(:); M sum(M);
function E calculateEnergy(glcm)
foo = glcm.A2; E sum(foo{:));
function H calculateHomogeneity(glcm,r,c)
terml (1 + abs(r c)); term glcm(:) ./ terml; H sum(term);
function Var calculateVariance(glcm,r,c)
me meanlndex(c,glcm); mr meanlndex(r,glcm);
mean (me + mr)/2; terml (r mean) .A 2;
term terml .* glcm(:); Var sum(term);
function Sumavg calculateSumaverage(glcm,r,c)
terml r .* glcm(:) + c .* glcm(:);
term sum(terml); Sumavg = term/2;
function MaxProb calculateMaximumProbability(glcm)
MaxProb max(glcm{:));
function Clusterprominence = calculateClusterProminence(glcm,r,c)
me meanlndex(c,glcm); mr meanlndex(r,glcm);
terml = (r mr + c me) .A 4; term =* terml .* glcm(;);
Clusterprominence sum(term);

RDM Code (Matlab)
function RDMfeatures circularRDM(image)
rotate image pi/8 radians 8 times
s * size(image);
if s(1) s(2)
error('Image must be a square matrix.)
else diameter s(l);
images = zeros(diameter,diameter,8);
images(:,:,1) = image;
images{:,:,2) imrotate(image,-90*(1/4),'bilinear',crop');
images(:,:,3) imrotate(image,-90*(2/4),'bilinear','crop');
images(;,:,4) imrotate(image,-90*(3/4),'bilinearcrop);
images(:,:,5) imrotate(image,-90,'bilinear*,1 crop');
images(s,:,6) imrotate(image,-90*(5/4),'bilinear','crop');
images(:,:,7) = imrotate(image,-90*(6/4),'bilinear','crop');
images(:,:,8) imrotate(image,-90*(7/4),'bilinear','crop');
get inscribed squares within each rotated image
midpoint floor(diameter/2);
side sqrt(2)midpoint;
rectangle = [midpoint-floor(side/2-2) midpoint-floor(side/2-2) side-2
squares = zeros(floor(side-1),floor(side-1),8);
parfor sqr 1:8, squares (:,:, sqr) imcropf images sqr), rectangle);
scaledSquares = zeros(size(squares));
numLev =8;
for sqr 1:8
im = squares(:,:,sqr);
graylevels = [min(im(:)) max(im(:))];
[glcm, scaledSquares(:,:,sqr)]
create run difference matrices
r Isfloor(floor(side-1)/2);
rctons zeros(length(r),numLev,8);
for sqr ** 1:8
im = scaledSquares(:,:,sqr);
for rr r
for height 1:floor(side-1)
for pixel 1:(floor(side-l)-rr)
gdif abs(im(pixel,height)-im(pixel+rr,height)) + 1;
rdms(rr,gdif,sqr) rdms(rr,gdif,sqr) + 1;
normalize run difference matrices
for r parfor col l:length(r)
rdms(col,:,rdm) rdms(col,:,rdm)./sum(rdms(col,:,rdm));
get mean run difference matrix over 8 directions
rdm mean(rdms,3);
extract features from run difference matrix
according to Chen, et. al., 2005, Ultrasound in Medicine and Biology

% begin by computing characteristic vectors
[rmax jmax] size(rdm);
dgd zeros(1,jmax);
for jj = lrjmax, dgd(l,jj) sum(rdm(r,jj)); end
dod zeros(1,rmax);
for rr lrrmax,
for jj = lrjmax, dod{rr) dod(rr) + rdm(rr,jj)*jj; end
dad zeros(1,jmax);
for jj = l:jmax
for rr lrrmax, dad(jj) dad(jj) + rdm(rr,jj)*rr? end
% now extract features
lde = 0; K = 2;
for jj = lrjmax, lde = lde + dgd(jj)*log(K/{jj-1)); if isinf(lde), lde
0; end; end
shp =* 0; for jj * lrjmax, shp shp + dgd (j j) (j j-1) A3; end
smg * 0; for jj -= lrjmax, smg = smg + dgd(jj)A2; end
smo * 0; for rr * lrrmax, smo * smo + dod(rr)A2; end
ldel = 0; for jj = lrjmax, ldel = ldel + dad(jj)*(jj-1)A2; end
RDMfeatures [lde; shp; smg; smo; ldel];

function RadFeatures NewReceptiveField(image)
{xc, yc] = size(image); xc xc/2; yc yc/2;
radius 1;(floor(xc)-1); Can 'shorten* radius if want to use smaller
numstats 5; variance, sd, mad, skew, kurt, entropy
numradii length(radius);
Lfeature zeros(numradii,numstats);
for rcount lrnumradii
Get 16 points surrounding the centra) pixel at designated radius
r radius{rcount);
pi image(round(xc+r),round(yc));
p2 = image(round(xc+r*cos(pi/8)),round(yc+r*sin(pi/8)));
p3 = image (round (xc+sqrt ((r/'2)/2)), round (yc+sqrt ((rA2)/2))) ;
p4 = image(round(xc+r*sin(pi/8)), round(yc+r*cos(pi/8)));
p5 = image(round(xc),round(yc+r));
p6 = image(round(xc-r*sin(pi/8)),round(yc+r*cos(pi/8)));
p7 = image(round(xc-sqrt((r~2}/2)),round(yc+sqrt((rA2)/2)));
p8 = image(round(xc-r*cos(pi/8)),round(yc+r*sin(pi/8)));
p9 = image(round(xc-r),round(yc));
plO = image(round(xc-r*cos(pi/8)),round(yc-r*sin(pi/8)));
pll = image(round(xc-sqrt((rA2)/2)),round(yc-sqrt((r^2)/2)));
pl2 = image(round(xc-r*sin(pi/8)),round(yc-r*cos(pi/8)));
pl3 = image(round(xc),round(yc-r));
pl4 = image(round(xc+r*sin(pi/8)),round(yc-r*cos(pi/8)));
pl5 = image (round (xc+sqrt ((rA2) /2)),round (yc-sqrt ((r/'2)/2))) ;
pl6 = image(round(xc+r*cos(pi/8)),round(yc-r*sin(pi/8)));
points [pl;p2;p3;p4;p5;p6;p7;p8;p9;pl0;pll;pl2;pl3;pl4;pl5;pl6];
Calculate 5 features from these points
Lfeature(rcount,1) var(points);
Lfeature(rcount,2) std(points);
Lfeature (rcount, 3) mad(points)r
Lfeature (rcount, 4) ** skewness (points);
Lfeature(rcount,5) kurtosis(points);
Split radii into radii groups of 5 radii each, and compute the mean
features over all radii within each radii group
numRPerGroup = 5; numRGroups floor(numradii/numRPerGroup);
groupFeatMat zeros(numRGroups,numstats);
for rGroup 1:numRGroups
groupFeatMat(rGroup,:) mean(Lfeature((((rGroup-
Compute the mean, min, max, etc. of each feature over the different
radii groups
statMeans = mean(groupFeatMat,1); statMins = min(groupFeatMat,[],1);
statMaxs = max(groupFeatMat,(],1); statRngs range(groupFeatMat,1);
statMeds = median(groupFeatMat,1); statVars var(groupFeatMat,1);
statStds = std(groupFeatMat,0,1); statMads mad(groupFeatMat,0,1);
statSkews skewness (groupFeatMat, 0,1);
Compute the mean, min, max, etc. of the differences between values of
each feature from the different radii groups
differences {];
for rGroup 1:numRGroups
for rrGroup (rGroup+1):numRGroups
differences [differences; groupFeatMat(rGroup,:) -

diffMean = mean(differences,1); diffMin min(differences,[],1);
diffMax max(differences,[],1); diffRng range(differences,1);
diffMed = median(differences,1); diffVar = var(differences,1);
diffStd = std(differences,0,1); diffMad mad(differences,0,1);
diffSkew = skewness(differences,0,1);
% Return desired radial features
RadFeatures [statMeans statMeds statMins statMaxs statRngs statMeds
statVars ...
statStds statMads statSkews diffMean diffMed diffMin diffMax diffRng
diffMed ...
diffVar diffStd diffMad diffSkew];

[1] M. Abehsera, D. Valeyre, P. Grenier, H. Jaillet, J. P. Battesti, and M. W.
Brauner. Sarcoidosis with pulmonary fibrosis: CT patterns and correlation
with pulmonary function. American Journal of Roentgenology, 174:1751-
1757, 2000.
[2] A. Agresti. An Introduction to Categorical Data Analysis. John Wiley and
Sons, Inc., Hoboken, NJ, 2007.
[3] Z. A. Aziz, A. U. Wells, D. M. Hansell, G. A. Bain, S. J. Copley, S. R. Desai,
S. M. Ellis, F. V. Gleeson, S. Grubnic, A. G. Nicholson, S. P. G. Padley,
K. S. Pointon, J. H. Reynolds, R. J. H. Robertson, and M. B. Rubens.
HRCT diagnosis of diffuse parenchymal lung disease: Inter-observer varia-
tion. Thorax, 59:506-511, 2004.
[4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and
Regression Trees. CRC Press, Boca Raton, 1993.
[5] F. Chabat, G. Yang, and D. M. Hansell. Obstructive lung diseases: Texture
classification for differentiation at CT. Radiology, 228:871-877, 2003.
[6] W. Chen, R. Chang, S. Kuo, C. Chang, W. K. Moon, S. Chen, and D. Chen.
3-D ultrasound texture classification using run difference matrix. Ultra-
sound in Medicine and Biology, 31(6):763-770, 2005.
[7] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. Miscel-
laneous functions of the department of statistics (el071), TU Wien, 2010.
[8] R. M. Haralick, K. Shanmugam, and I. Dinstein. Textural features for
image classification. IEEE Transactions on Systems, Man, and Cybernetics,
3:610 621, 1973.
[9] R. M. Haralick and L. G. Shapiro. Computer and Robot Vision. Addison-
Wesley Publishing Co., Boston, MA, 1992.
[10] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statisti-
cal Learning: Data Mining, Inference, and Prediction. Springer Series in
Statistics, New York, 2nd edition, 2009.

[11] T. H. Helbich, G. Geinz-Peer, D. Fleischmann, C. Wojnarowski, P. Wun-
derbaldinger, S. Huber, I. Eichler, and C. J. Herold. Evolution of CT find-
ings in patients with cystic fibrosis. American Journal of Roentgenology,
173:81-88, 1999.
[12] T. H. Helbich, G. Heinz-Peer, I. Eichler, P. Wunderbaldinger, M. Gotz,
C. Wojnarowski, R. C. Brasch, and C. J. Herold. Cystic fibrosis: CT
assessment of lung involvement in children and adults. Radiology, 213:537-
544, 1999.
[13] C. P. Hersh, G. R. Washko, F. L. Jacobson, R. Gill, R. S. Estepar, J. J.
Reilly, and E. K. Silverman. Interobserver variability in the determination
of upper lobe-predominant emphysema. Chest, 131:424 431, 2007.
[14] B. M. Keller, A. P. Reeves, T. V. Apanasovich, J. Wang, D. F. Yankelevitz,
and C. I. Henschke. Quantitative assessment of emphysema from whole
lung CT scans: Comparison with visual grading. Proceedings of the SPIE,
7260, 2009.
[15] N. L. Midler, P. Kullnig, and R. R. Miller. The CT findings of pulmonary
sarcoidosis: Analysis of 25 patients. American Journal of Roentqenoloqy,
152:1179-1182, 1989.
[16] D. Michie, D. Spiegelhalter, and C. Taylor. Machine Learning, Neural and
Statistical Classification. Ellis Horwood Series in Artificial Intelligence, Ellis
Horwood, 1994.
[17] A. H. Mir, M. Hanmandlu, and S. N. Tandon. Texture analysis ofct images.
IEEE Engineering in Medicine and Biology, November/December:781-786,
[18] F. Newman. PNN notes, April, 2010.
[19] Y. S. Park, J. B. Seo, N. Kim, E. J. Chae, Y. M. Oh, S. D. Lee, Y. Lee, and
S. Kang. Texture-based quantification of pulmonary emphysema on high-
resolution computed tomography: Comparison with density-based quan-
tification and correlation with pulmonary function test. Investigative Ra-
diology, 43(6):395-402, 2008.
[20] M. Prasad, A. Sowmya, and P. Wilson. Multi-level classification of em-
physema in HRCT lung images. Pattern Analysis Applications, 12:9 20,

[21] S. Ramakrishnan and S. Selvan. Classification of brain tissues using mul-
tiwavelet transformation and probabilistic neural network. International
Journal of Simulation: Systems, Science and Technology, 7(9):9-25, 2006.
[22] B. D. Ripley. Feed-forward neural networks and multinomial log-linear
models, 2009.
[23] I. Sluimer, A. Schilham, M. Prokop, and B. van Ginneken. Computer
analysis of computed tomography scans: A survey. IEEE Transactions on
Medical Imaging, 25(4):385-405, 2006.
[24] I. Sluimer, P. F. van Waes, M. A. Viergever, and B. van Ginneken.
Computer-aided diagnosis in high resolution CT of the lungs. Medical
Physics, 30(12):3081-3090, 2003.
[25] L. Sorenson, S. B. Shaker, and M. de Bruijne. Quantitative analysis of
pulmonary emphysema using local binary patterns. IEEE Transactions on
Medical Imaging, 29(2):559-569, 2010.
[26] D. F. Specht. Probabilistic neural networks. Neural Networks, 3:109-118,
[27] T. Stavngaard, S. B. Shaker, K. S. Bach, B. C. Stoel, and A. Dirksen.
Quantitative assessment of regional emphysema distribution in patients
with chronic obstructive pulmonary disease (COPD). Acta Radiologica,
9:914-921, 2006.
[28] T. M. Therneau and B. Atkinson. Recursive partitioning and regression
trees, 2002.
[29] R. Uppaluri, E. A. Hoffman, M. Sonka, P. G. Hartley, G. W. Hunninghake,
and G. McLennan. Computer recognition of regional lung disease patterns.
American Journal of Respiratory and Critical Care Medicine, 160(2) :648-
654, 1999.
[30] R. Uppaluri, T. Mitsa, M. Sonka, E. A. Hoffman, and G. McLennan.
Quantification of pulmonary emphysema from lung computed tomogra-
phy images. American Journal of Respiratory and Critical Care Medicine,
156:248-254, 1997.
[31] W. N. Venables and B. D. Ripley. Main package of Venables and Ripleys
MASS, 2010.

[32] P. Wasserman. Advanced Methods in Neural Computing. Van Nostrand
Reinhold, New York, 1993.
[33] Y. Xu, M. Sonka, G. McLennan, J. Guo, and E. A. Hoffman. MDCT-
based 3-D texture classification of emphysema and early smoking related
lung pathologies. IEEE Transactions on Medical Imaging, 25(4):464 475,