Citation
Robust estimation of censored mixture models

Material Information

Title:
Robust estimation of censored mixture models
Creator:
Santos, A. M
Publication Date:
Language:
English
Physical Description:
xviii, 100 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Mixture distributions (Probability theory) ( lcsh )
Expectation-maximization algorithms ( lcsh )
Gaussian processes ( lcsh )
Expectation-maximization algorithms ( fast )
Gaussian processes ( fast )
Mixture distributions (Probability theory) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 96-100).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by A.M. Santos.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
761011274 ( OCLC )
ocn761011274
Classification:
LD1193.L622 2011d S36 ( lcc )

Full Text
ROBUST ESTIMATION OF CENSORED MIXTURE MODELS
by
A. M. Santos
B.S., University of North Texas, 2002
M.S., University of North Texas, 2005
M.S., University of Colorado Denver, 2007
M.A., University of Colorado Denver, 2009
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2011


This thesis for the Doctor of Philosophy
degree by
A. M. Santos
has been approved
by
Weldon Lodwick
Gary Grunwald


Santos, A. M. (Ph.D., Applied Mathematics)
Robust Estimation of Censored Mixture Models
Thesis directed by Professor Karen Kafadar
ABSTRACT
We examine the robustness and efficiency of a variant of the EM-algorithm
for estimating parameters of mixtures of heavy-tailed distributions. Specifically,
we use the biweight in an EM-like algorithm to estimate location, scale, and
proportion of each component of a mixture of h-distributions. This family of
long-tailed distributions, indexed by a tail-length parameter (h=0 corresponds
to the Gaussian), provides a measure by which robustness and efficiency rela-
tive to the Gaussian can be assessed. We compare our results to those using the
conventional EM-algorithm (in the context of a model that assumes Gaussian
distributions, producing maximum likelihood estimates) and with Scotts L2E
(2001) approach, and find that our variant is more consistent and tends to be
less biased than the alternatives. Guided by these results, we then consider the
problem of censored mixtures. We modify EM to use the correct likelihood for
a censored mixture, extend a restricted likelihood model for censored data to
mixtures (REML), and modify it further to use the biweight (RREML). Our
modified EM and RREML methods consistently outperform the conventional
EM-algorithm on censored data. Finally, we describe the application that mo-


tivated this research and examine the results of our methods in a simulation
experiment designed to mimic the application.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.


DEDICATION
This dissertation is dedicated to my husband Neil, for all his patience during
this graduate school adventure.


ACKNOWLEDGMENT
I would like to thank David W. Scott, who not only answered my questions and
upgraded his L2E code for me, but did so cheerfully and quickly.
Most of all, I would like to thank my adviser, Karen Kafadar.


CONTENTS
Figures ............................................................... ix
Tables............................................................... xviii
Chapter
1. Introduction........................................................ 1
2. Robustly Fitting Mixture Models..................................... 3
2.1 Theory........................................................... 3
2.1.1 History of robustly fitting mixtures............................ 3
2.1.2 M-estimates of location and scale............................... 4
2.1.3 Fitting Mixtures................................................ 8
2.2 Using L2E to estimate mixtures................................... 11
2.3 Robust modification of EM ....................................... 12
2.4 The family of h-distributions.................................... 14
2.5 Simulation Results.............................................. 19
2.5.1 Testing for Good Estimates..................................... 19
2.5.2 Two-component Mixtures......................................... 21
2.5.3 Three-component Mixtures ...................................... 28
2.5.4 Five-component Mixtures........................................ 41
2.5.5 Measuring Overall Fit.......................................... 48
2.6 Conclusion.............................................. 51
3. Fitting Censored Mixture Models..................................... 55
vii


3.1 History of fitting censored data.................................. 55
3.2 Methods for fitting Censored Mixtures............................. 58
3.2.1 Maximum Likelihood via EM..................................... 58
3.2.2 Restricted Maximum Likelihood................................... 60
3.3 Simulation Results................................................ 62
3.3.1 10% Censoring................................................. 63
3.3.2 25% Censoring................................................. 65
3.4 Conclusion........................................................ 66
4. Case Study: Images of Biological Cells.............................. 74
4.1 Problem Description.............................................. 74
4.2 Choosing good starting values for parameter estimates........... 75
4.3 Simulation....................................................... 80
4.3.1 Results ........................................................ 81
4.4 Fitting Real Data................................................. 86
4.5 Conclusion........................................................ 89
5. Conclusions and Future Work......................................... 90
Appendix
A. A Comment about Computing.......................................... 92
B. Acronym Reference.................................................. 95
References............................................................. 96
viii


FIGURES
Figure
2.1 The p and ip functions that describe the Least Squares estimate of
location........................................................... 5
2.2 ^-function for the Biweight......................................... 7
2.3 QQ plots of ^-distributions (solid line corresponds to the Gaussian,
h = 0)............................................................ 15
2.4 QQ plots of h = 0.2 versus t-distributions with a variety of degrees
of freedom........................................................ 16
2.5 QQ plots of several ^-distributions versus t-distributions with similar
shape............................................................. 17
2.6 Boxplots of mean estimates less the true values for two-component
mixtures. Components are in separate rows, with the first component
on the bottom, h values increase from left to right, so that the
Gaussian case is on the left and the heaviest tails are on the right. . 22
2.7 The difference between the EM and EM with biweight mean esti-
mates for the first component of two-component mixture. The header
of each panel indicates the h value............................... 23
2.8 Boxplots of weight estimates less the true values for two-component
mixtures. Components are in separate rows, with the first component
on the bottom, h values increase from left to right, so that the
Gaussian case is on the left and the heaviest tails are on the right. . 24
IX


2.9
The difference between the EM and EM with biweight weight esti-
mates for the first component of two-component mixture. The header
of each panel indicates the h value............................ 25
2.10 Boxplots of normalized IQR estimates for two-component mixtures.
Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right. 26
2.11 The difference between the EM and EM with Su IQR estimates for
the first component of two-component mixture. The header of each
panel indicates the h value............................... 27
2.12 Boxplots of the mean estimates less true values for first three-
component mixture. Components are in separate rows, with the
first component on the bottom, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right...................................................... 28
2.13 The difference between the EM and EM with biweight mean esti-
mates for the first component of first three-component mixture. The
header of each panel indicates the h value..................... 29
2.14 Boxplots of the weight estimates less true values for first three-
component mixture. Components are in separate rows, with the
first component on the bottom, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right...................................................... 30
x


2.15 The difference between the EM and EM with biweight weight esti-
mates for the first component of first three-component mixture. The
header of each panel indicates the h value................. 31
2.16 Boxplots of normed IQR estimates for first three-component mixture.
Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right. 32
2.17 The difference between the EM and EM with S& IQR estimates for
the first component of first three-component mixture. The header of
each panel indicates the h value........................... 33
2.18 Boxplots of mean estimates less true values for the three-component
mixture with equal weights and spreads. Components are in separate
rows, with the first component on the bottom, h values increase from
left to right, so that the Gaussian case is on the left and the heaviest
tails are on the right.......................................... 35
2.19 The difference between the EM and EM with biweight mean esti-
mates for the first component of the three-component mixture with
equal weights and spreads. The header of each panel indicates the h
value........................................................... 36
2.20 Boxplots of weight estimates less true values for the three-component
mixture with equal weights and spreads. Components are in separate
rows, with the first component on the bottom, h values increase from
left to right, so that the Gaussian case is on the left and the heaviest
tails are on the right.......................................... 37
xi


2.21 The difference between the EM and EM with biweight weight esti-
mates for the first component of three-component mixture with equal
weights and spreads. The header of each panel indicates the h value. 38
2.22 Boxplots of normed IQR estimates for the three-component mixture
with equal weights and spreads. Components are in separate rows,
with the first component on the bottom, h values increase from left
to right, so that the Gaussian case is on the left and the heaviest
tails are on the right........................................ 39
2.23 The difference between the EM and EM with IQR estimates for
the first component of three-component mixture with equal weights
and spreads. The header of each panel indicates the h value... 40
2.24 Boxplots of mean estimates less true values for five-component mix-
ture. Components are in separate rows, with the first component on
the bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right. 42
2.25 The difference between the EM and EM with biweight mean esti-
mates for five-component mixtures. The header of each panel indi-
cates the h value.......................................... 43
2.26 Boxplots of weight estimates less true values for five-component mix-
ture. Components are in separate rows, with the first component on
the bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right. 44
xii


2.27 The difference between the EM and EM with biweight weight esti-
mates for the first component of five-component mixture. The header
of each panel indicates the h value............................ 45
2.28 Boxplots of normalized IQR estimates for five-component mixture.
Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right. 46
2.29 The difference between the EM and EM with S^ IQR estimates for
the first component of five-component mixture. The header of each
panel indicates the h value................................. 47
2.30 Density plots of two-component mixtures with differing h values.
From top to bottom, h = 0,0.05,0.1,0.2 and 0.3................. 49
2.31 Hellinger distances of fits of two-component mixtures, h values in-
crease from left to right, so that the Gaussian case is on the left and
the heaviest tails are on the right............................ 50
2.32 Hellinger distances of fits of three-component mixtures. Two dis-
tances from EMbiwt (h = 0, h = 0.2) were greater than 0.3 and were
omitted from the second plot, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right...................................................... 52


2.33 Hellinger distances of fits of three-component mixtures with equal
spreads and weights. Distances greater than 0.3 were omitted from
the second plot: 6 for EM, 2 each for EMbiwt and EMSbi, and 7 for
L2E. h values increase from left to right, so that the Gaussian case
is on the left and the heaviest tails are on the right........... 53
2.34 Hellinger distances of fits of five-component mixtures. Distances
greater than 0.3 were omitted from the second plot: 15 for EM,
20 for EMbiwt, 4 for EMSbi, and 15 for L2E. h values increase from
left to right, so that the Gaussian case is on the left and the heaviest
tails are on the right.............................................. 54
3.1 Location estimates for all five h values, 10% censoring from each end,
50 simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so
that the Gaussian case is on the left and the heaviest tails are on the
right............................................................... 64
3.2 Weight estimates for all five h values, 10% censoring from each end,
50 simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so
that the Gaussian case is on the left and the heaviest tails are on the
right............................................................... 65
xiv


3.3 IQR estimates for all five h values, 10% censoring from each end,
50 simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so
that the Gaussian case is on the left and the heaviest tails are on the
right.............................................................. 66
3.4 Location estimates for h = 0 and 0.2, 10% censoring from each end,
1000 simulation runs. Components are in separate rows, with the
first component on the bottom, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right............................................................. 67
3.5 Weight estimates for h = 0 and 0.2, 10% censoring from each end,
1000 simulation runs. Components are in separate rows, with the
first component on the bottom, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right............................................................. 68
3.6 IQR estimates for h = 0 and 0.2, 10% censoring from each end, 1000
simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so
that the Gaussian case is on the left and the heaviest tails are on the
right................................................................. 69
xv


3.7 Location estimates for h=0 and 0.2, 25% censoring from each end,
1000 simulation runs. Components are in separate rows, with the
first component on the bottom, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right.......................................................... 70
3.8 Weight estimates for /i=0 and 0.2, 25% censoring from each end,
1000 simulation runs. Components are in separate rows, with the
first component on the bottom, h values increase from left to right,
so that the Gaussian case is on the left and the heaviest tails are on
the right.......................................................... 71
3.9 IQR estimates for h=0 and 0.2, 25% censoring from each end, 1000
simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so
that the Gaussian case is on the left and the heaviest tails are on the
right.............................................................. 72
3.10 IQR estimates for just the EM4C and RREML methods. Compo-
nents are in separate rows, with the first component on the bottom.
h values increase from left to right, so that the Gaussian case is on
the left and the heaviest tails are on the right................... 73
4.1 An example image of a slide of biological cells..................... 75
4.2 Example histogram of slide image pixel values....................... 76
4.3 A Gaussian Q-Q plot of the values from an image of a biological slide. 77
4.4 A Gaussian Q-Q plot of the uncensored values only from an image
of a biological slide.............................................. 78
xvi


4.5 Individual Gaussian Q-Q plots for the three sections of values in a
biological slide dataset........................................ 79
4.6 Boxplots of mean estimates less true values for data approximating
slide images. Components are in separate rows, with the first com-
ponent on the bottom, h values increase from left to right, so that
the Gaussian case is on the left and the heaviest tails are on the right. 81
4.7 Boxplots of weight estimates less true values for data approximating
slide images. Components are in separate rows, with the first com-
ponent on the bottom, h values increase from left to right, so that
the Gaussian case is on the left and the heaviest tails are on the right. 82
4.8 Boxplots of normed IQR estimates for data approximating slide im-
ages. The second figure has the extreme values removed to facilitate
comparing the bulk of the estimates. Components are in separate
rows, with the first component on the bottom, h values increase
from left to right, so that the Gaussian case is on the left and the
heaviest tails are on the right................................. 84
4.9 Scatterplots of the weight, mean, and IQR estimates for the 34 in-
stances where the third components parameters included near-zero
weights......................................................... 85
XVII


TABLES
Table
2.1 True Parameters of Component Distributions....................... 20
2.2 True IQR values for all h and scale values used in our simulations . 21
2.3 Hellinger distance from the Gaussian case for two-component mix-
tures with differing h values...................................... 48
3.1 True parameter values of mixture components................... . . 63
4.1 True values for simulating data approximating slide images....... 80
4.2 Estimated distributions for 3 real slide images............... 86
4.3 Summary of simulation results on all combinations of tail weights.
ND indicates no difference between methods in the results. Based
on 10 runs only, with 3 components of weights 85%, 10%, 5%; means
275, 1300, 3500; SDs 20, 600, 370 (reflecting characteristics of slide
3 of real data; cf Table 4.2, p.86.)............................ 88
xviii


1. Introduction
Mixture models, commonly used in clustering and classification, attempt to
identify the underlying components that are present in a single dataset. Most
methods assume that these underlying components are Gaussian distributions.
Some methods allow for other specified parametric forms, but here we consider
the case where the distributions of the components may be close to Gaussian
but possibly heavier-tailed. Robust methods aim to achieve reasonably reliable
results not only when distributional assumptions are satisfied, but also when
they are not. Our goal is to find an estimation method that, while still making
the assumption of Gaussian components, performs well even when the data
deviate from this ideal. By using robust measures of location and scale, we
create a robust algorithm for estimating mixture models.
We assume that the data come not from a single population but possibly
multiple populations with unknown frequency; i.e., the pdf of our population is
/(:r; 6) = Xa=i nifiixi #i) where 6i denote the parameters of the zth pdf, /*, and
7Tj is the proportion of the population composed of population i. We assume for
simplicity that /* are the same shape (Gaussian) though the parameters $i will
differ.
The motivation for this research arose from an imaging study of tumors
[37]. The goal of that study was the classification of pixels into tumor cell
interior, tumor cell wall, and non-tumor regions to attain a reliable measure of
tumor volume. Pixel densities are measured via gray scales and so are discrete
1


integers (usually between 0 and 4095). Thus, rounding of density values as
well as censoring (at each end of the grey scale) affect the measurements and
hence potentially the validity of mixture estimation methods assuming Gaussian
distributions. Further, the distributions of these components appeared to have
longer tails than the Gaussian. For all these reasons long tails, granularity, and
censoring more robust methods for mixtures are needed for this and similar
studies.
In Chapter 2 we discuss the history of robust mixture estimation, modify the
Expectation-Maximization (EM) algorithm to make use of robust M-estimates,
and compare the performance of EM, L2E [44], and our robust EM algorithm.
Chapter 3 covers the history of censored data, our extensions of both MLE
and restricted maximum likelihood methods to the case of mixture models, and
compares the performance of these new methods with EM. Chapter 4 returns
to our motivating example, where we examine both a simulation experiment
mimicing that data and the estimates from fitting real data. We summarize our
conclusions and propose further work in Chapter 5.
2


2. Robustly Fitting Mixture Models
2.1 Theory
2.1.1 History of robustly fitting mixtures
Fitting mixtures started with Pearsons method of moments approach [32],
Cohen [8] found an iterative method that more efficiently produced the same
results as Pearsons solution. Tan and Chang [47] showed that maximum likeli-
hood (ML) results were generally better than moment-based methods. Concerns
about maximum likelihood approach include the potential for an unbounded
likelihood which may produce inconsistent estimates. Quandt and Ramsey [38]
suggest a method based on the moment generating function to avoid estima-
tion problems with the likelihood, and show that it compares favorably to the
method of moments. But their method has a tendency to fail for computational
reasons. However, in his rebuttal to their paper, Bryant [4] cites Wolfes work
[49] which addressed the issue of unbounded likelihood functions for unknown
parameters and found it to be extremely unlikely in practice. Moreover, maxi-
mum likelihood estimates for mixture models can be computed fairly easily by
the Expectation Maximization (EM) algorithm [10] and appear to perform best
for fitting Gaussian mixtures.
Despite the advantages over method of moments estimates, maximum likeli-
hood estimates are well known to be extremely sensitive to departures from the
assumed model and outliers in the data [26]. Campbell [5] shows that replac-
ing the usual maximum likelihood estimates in the EM procedure with robust
3


M-estimates using a Hampel ^-function successfully protects against outliers.
Campbells small simulation experiment (50 runs) considers cases of only 2 com-
ponents, each of which contains only 50 observations contaminated with only one
or two atypically large values. Despite the much smaller sample size (50) than
we consider here (10,000), that paper encourages us to consider M-estimates in
our modification. De Veaux and Krieger [9] demonstrate EMs sensitivity to
outliers and propose using t distributions in the E step and using the median
and (scaled) median absolute deviation from the median to introduce robustness
into the estimates.
Many people have used mixture models to fit non-normal components: e.g.,
exponentials [22], gammas [27], and binomial-Poisson mixtures [48]. See [31] for
more examples. Peel and McLachlan [35] suggest fitting t distributions instead
of Gaussians in order to achieve more robust results. If the components are
known to have heavier tails than the Gaussian, this method is perhaps the best
or most natural one to use. Still, Gaussian distributions are most familiar and
unless there is strong reason to suspect heavier tails, users are inclined to move
forward with the Gaussian assumption. For this reason, as well as for reasons
of computational efficiency, we prefer to develop an algorithm that provides
robust estimates of mixture parameters under the hopeful, but likely unrealistic,
Gaussian assumption.
2.1.2 M-estimates of location and scale
In this section we review M-estimates of location and scale; readers familiar
with this background may wish to skip to Section 2.1.3. Because MLEs are not
robust against even slight departures from model distributional assumptions, we
4


plan to modify the basic mixture-fitting algorithm using M-estimates.
The M-estimate Tn(x\,... ,xn) for the function p and the sample aq,..., x,
from a distribution F, is the value of T that minimizes the objective function
Hi=iP{xi\t) [25]:
9 = arg mm.y^ p(xj,T)
If ip, the derivative of p, exists almost everywhere, we may calculate 9 by
solving Y^=liixuT) = 0 for T. For example, in the case of Least Squares,
p(-) (')2 = T : £(*< T) = 0 ^ T = x.
u u
Figure 2.1: The p and ip functions that describe the Least Squares estimate
of location
An important concept for measuring robustness is the influence curve [18,
19]. It describes the effect on the estimate T if there is a point contamination
8X at x with probability mass e introduced to the data distribution, F. The
5


contaminated distribution is then 1 e)F + e5x. To examine the proportional
change in the estimate, we divide by e. The influence curve is then defined as
IC(x\ F, T) = lim
T{{l-e)F + e6x)-T(F)
e
if the limit exists. Given a statistic and an underlying distribution, the influence
curve tells us how a small amount of contamination in the sample affects the
estimate given a large sample. For the sample mean, IC{x\ F, x) = x, so the in-
fluence of any point x, including x oo, is unbounded. Clearly this unbounded
influence effect on x is highly undesirable.
For M-estimates with odd ^-functions, assuming a symmetric F, IC{x\ F, T) oc
ip(x]T) and the shape of the influence curve is the same as the shape of the ip-
function. If we find an influence curve that is highly resistant to contamination
(e.g. is bounded, insensitive to rounding, etc.), we can find its corresponding
^-function which defines a statistic with these resistant properties [23]. This
method can be used to create robust estimates in the sense that they limit the
influence of an outlying x.
One way to calculate M-estimates is via a nonlinear root-finding search
algorithm for the equation £)"=1 ^(ui) = 0 where iq = Xj T, and T is a
location parameter. Alternatively, defining the function w(-) via ip{u) = u-w(u)
enables a direct iteration algorithm for calculating T:
showing that T is a weighted average of the data. Hence io(-) is called the
weight function, and T is calculated as an iteratively re-weighted mean
6


(the weights ^ change with each updated iteration of T).
X
Figure 2.2: ^-function for the Biweight
In particular, the biweight [2], is one estimate engineered by choice of influ-
ence curve and ^-function to be robust. An M-estimator of location, it is the
solution Tn of ip(v.i) = 0, where
tp(u)
u(l u2)2
0
|u| < 1
\u\ > 1
Xi Tn
Ui = ----£
cSn
(2.1)
c is a tuning constant, and Sn is an estimator of scale, often taken to be the me-
dian absolute deviation from the median (MAD), scaled so that c(1.5-MAD) ~ a
if the underlying population is Gaussian. Because the weights trail off to zero
7


for sufficiently large u (beyond c multiples of the estimated standard deviation,
Sn, of the sample), extreme outliers do not affect the biweight. Moreover, since
ip{u) ~ u for small u, the biweight performs well with distributions that are
Gaussian in the middle [23]. The smoothness of 'tp (and the corresponding
influence curve) guarantees a finite bound on the estimate under slight pertur-
bations of the data and its symmetry gives it a breakdown bound of 50% or
up to the breakdown of Sn [23]; up to 50% of the data could take on arbitrary
values without greatly changing the value of the estimate.
The biweight as an estimate of location may be calculated using the di-
rect iteration method as follows. Let Ia(u) denote the indicator function, i.e.,
IA(u) = 1 if u G A, else zero. Let M0 be a robust estimate of location, usu-
ally the median. Let S be a robust estimate of scale, usually MAD, and set a
tolerance criterion e > 0.
1. Set Ui = 2^
2. Wi = (1 ?)2/[_u](u)
4. If |M0 Mi| > e, set Mo = Mi and repeat steps 1-3; else T = M\.
2.1.3 Fitting Mixtures
Expectation Maximization (EM) has been the standard method of estimat-
ing mixture models for much of the last 30 years [10, 31]. It is fairly simple to
implement, not terribly sensitive to starting values, and in its general form can
be used to fit any parametric model, not just a mixture model.
8


The general EM algorithm for estimating a set of parameters, 9, as described
in [15], proceeds as follows:
1. Choose an initial estimate of 9, 9^
2. For t 1,2,...
(a) Expectation: Calculate Q{9\9^) = £[log L{9\ x)|x, 0^]
(b) Maximization: = argmaxg Q(9\99'>)
The sequence of estimates, {0^}, converges to the maximum likelihood
estimate [10]. Because the rate of convergence is linear, the number of iterations
can be large, especially with starting values far from the solution.
For a two-component normal mixture model, the likelihood for a single ob-
servation, x, is
f(x) = 7r001(x) + (1 7r)0e2,
where denotes the normal density with parameters 9i = (n,a2). Then 9 =
($1,02, ?r) and the log-likelihood for a sample of size N is (cf. [20]):
N
1{9-,X) = £log[(l 7r)0l(xi) +-K(j)e2{xt)).
t=i
where X = (xi,..., xn). When the likelihoods for each observation are multi-
plied together, and then logged, the sum of terms inside the logarithm makes
it difficult to maximize this log-likelihood directly. To solve for 9 via EM, we
consider unobserved latent variables A* such that A* = 0 if x* comes from the
first component, = 1 if X{ comes from the second component. The new
formulation of the log-likelihood is
9


N
1(0] x, A) = J^[(l-Al)log/1(zi) + Allog/2(xi)] + [(l-Ai)log7r+Ai log(l-Tr)]
i=i
If we knew the values of A*, we would proceed in the usual fashion by esti-
mating fij and crj as the mean and sample variance of observations in component
j, and estimate ir, the proportion of observations in the second component, as
7r = In the absence of knowledge of the values of A*, in each iteration of
the EM procedure, we compute the expectation of each Ap
ll(6) = E(Al\9,X) = Pr(Al = l\9,X)
The EM algorithm then proceeds as follows. Let /,() denote the pdf of a
Gaussian with mean fij and variance estimates of these five parameters (based on expert judgement or on previous
data). Using these initial estimates of Ai> A2, of, of, fr:
1. Expectation:
Calculate the estimated values of 7,, i 1... TV, called the EM-weights or
responsibilities:
7i
_________71/2(37)________
(1 n)fl(xi) + nf2(Xi)
,i =
2. Maximization:
Using 7j, estimate 6^ as the MLEs:
EiiiU-TOzi-a ££i(i ~li)(xi- Ai)2
A4! / n i ^1

e;li(i-70
E£i(i-70
E£l 7ixi .2 Eill 7i(Xi £2)
) 2

^2
d-
Z_/i=l /I
Z^i=i /
E£i7i
and ft = 1

10


3. Repeat 1. and 2. until max((/}(")/j(n *1), (d^ eVz.
This algorithm can be extended easily to k components by setting Ay = 1 if
Xi belongs to component j (Aik = 1 J2j=i Ay). The process stays almost the
same, apart from more components to compute at each stage of the iteration:
1. Expectation:
= TtjfjjXj)
lJ Eli fti/ite)
2. Maximization:
fij ~
Ei=1 7ijxi -2
- > Uj
Et=i
Et=i 7q(* ~ P-i?
EN
i=1 7ij
Ei=i 7tj
TV '
' ^j
2.2 Using L^E to estimate mixtures
Scott [44] proposed a method for fitting parametric models based on mean
integrated squared error and used it to fit parameters for various models, in-
cluding mixtures.
If /, with parameters 6, describes the model you wish to fit, let f denote the
estimate of an integral, and consider minimizing the mean integrated squared
error with respect to 6:
§ = arginin J f0(x) f{x)
Jfo(x)2dx -2 Jfe(x)f{x)dx + Jf(x)2dx
= arg mm
e
Then because the third term does not include 0 and will have no impact on
arg min, it can be removed.
11


Assuming we have chosen a functional form for /, this leads to:
f2(x] 9)dx ~y]f(xi\d)
n z'
n
6L2e = arg min
6
i=l
Because it maximizes (minimizes the negative of) a sum of likelihoods in-
stead of the product, L2E is more robust to outliers. Conveniently, L2E pa-
rameters are consistent and asymptotically normal. Under the true model, L2E
estimates are less efficient than MLEs (larger asymptotic variance), but they
also tend to be more robust [45]. In particular, the influence curve of L2E
estimates has a shape that is similar to that of the biweight, which has been
demonstrated to be robust for many situations. This similarity makes L2E a
particularly appropriate choice for comparison with our robust modification of
EM in this study.
2.3 Robust modification of EM
Because MLE and L2E are model-dependent, we consider an approach that
uses biweights in an attempt to be robust to model misspecification. Our EM
with biweight is calculated as follows. We first execute one iteration of standard
EM to get initial estimates for the EM-weights, means and standard deviations.
(We start with a single iteration of EM for convenience and to give all methods
equal starting values, but for a truly robust method, robust measures should be
used to generate starting values.) Then for j = 1,2,... ,k:
12


1. Expectation:
7y ~
Eti *ih(xi)
(2.2)
2. Maximization: For each component, let the estimate of the mean, flj, be
a biweight with starting values M = fijMLE, S = ajMLE and modified
weights to include the EM-weights: Wij = 7^(1 uf)2l[_1)1j(u)
i.2 wij(xi -fii? ~ E,"l7y

aj =
E<=i wi
(2.3)
Using the biweight as our measure of location, the literature ([23, 29]) in-
cludes some related measures of scale to replace the non-robust standard devi-
ation. We consider the two respective scale estimates, modified to include the
EM-weights, 7^.
£t7 iMui)2 V (24)
SJ (,(£,
from Hoaglin et al. [23] and
(El7p)^(Ei7p(^-Mb,)2(l-ul2)4)^
*i =
T.i'nA1 ~ ni)(1 ^!)\
(2.5)
due to Lax [29].
Computing these measures at each EM iteration for each component added
a noticable amount of time to the simulations. Additionally, in their definitions,
they are not updated with each iteration of the biweight calculation, but are
instead defined using the final weights. For these reasons, we compute them
only once for each estimate, at the end of the iterative procedure.
2.4 The family of h-distributions
13


To evaluate the performance of the biweight-EM on mixtures of non-
Gaussian distributions we simulate components from Tukeys h-distribution fam-
ily of heavy-tailed distributions [24]. The parameter h measures the heaviness
of the tails; h 0 corresponds to a Gaussian distribution. The tails can be-
come quite heavy; h = 0.975 closely approximates the Cauchy distribution. See
Figure 2.3 for quantile-quantile (QQ) plots demonstrating the effect of h on tail
heaviness. An h value of 0.3 produces very heavy tails, giving a 0.95 quantile 1.5
times that of the Gaussian, while still having the appearance of a bell-shaped
curve, h < 0 produces distributions with tails thinner than the Gaussian. With
the addition of a, a location parameter, and b, a scale parameter, members of
the h-distribution family are defined as follows:
Z ~ N(Q, 1)
Yh{Z) = Zehz2/2
X(h, a, b) = a + b x Y
Headrick et al. [21] derived the pdf and cdf of the h-distribution as follows.
The pdf and cdf of a standard Gaussian, Z, are:
fz{z) = (27r)-^ exp(z2/2)
Fz(z) = P(Z < z) = f (27r) 2 exp(w2/2)dw, oo < z < +oo.
J OO
14


QQ plots for h-distributions
Expected Gaussian quantile
h = 0.0 (bottom), 0.1,0.2....0.5 (top)
Figure 2.3: QQ plots of h-distributions (solid line corresponds to the Gaussian,
h = 0)
Then let z = (x, y) be an auxiliary variable for mapping the parametric
curves of the pdf and cdf as
/ : z > R2 := fz(z) = fz(x, y) fz(z, fz{z))
F : z > R2 := Fz(z) = Fz(x,y) = Fz(z, Fz(z)).
The quantile function for h-distributions is
q(z) = zexp(hz2/2)
with derivative
q(z) = exp(hz2/2)(l + hz2).
15


Let the compositions foq and Foq map the parametric curves of fq(z)(q{z))
and Fq(Z){q{z)) where q(z) = q(x,y), then the pdf and cdf of the h-distribution
are
foq: q(z) > R := fq(z){q(z)) = fq{z){q(x,y)) = fq(Z){q(z), fq(z)(z)) (2.6)
Foq: q(z) - M2 := Fq{z)(q{z)) = Fq{z)(q{x,y)) = Fq{z)(q(z), Fq{z)(z)) (2.7)
t(5)
t(4)
t(3) t(2)
Figure 2.4: QQ plots of h = 0.2 versus t-distributions with a variety of degrees
of freedom.
16


h=0.05, t(19)
h=0.1, t(9)
Figure 2.5: QQ plots of several ^-distributions versus t-distributions with
similar shape.
While h-distributions may not be the most widely known family of heavy-
tailed distributions, we prefer to use them in this application because they in-
clude the Gaussian as a special case, and because the h parameter is continuous,
allowing us to move smoothly away from the assumption of normality. Although
Students t distributions are defined for fractional degrees of freedom, the density
with fractional degrees of freedom case is awkward to compute. Double expo-
nentials and other generalized Laplacian distributions are much more peaked
than the Gaussian, not allowing us to consider the behavior of the algorithm
when the distributions are nearly Gaussian. Another advantage of Tukeys h-
17


distribution is that its quantiles are easily calculated from the quantiles of Z,
a standard Gaussian, e.g., xp = a + bzpehzp/2. The h-distributions have been
used in applications where the data appear nearly Gaussian, but are known to
be heavy-tailed [11, 28].
As Students t distribution is the best known family of heavy-tailed distri-
butions, we look briefly at how it compares the h-distribution over the range of
values we will be using. Since h = 0 and with oo degrees of freedom correspond
to the Gaussian, and h 1 and t with 0 degrees of freedom correspond to the
Cauchy, the relationship should be very nearly 1 jh degrees of freem to find the
corresponding t distribution. Figure 2.4 explores this relationship for h = 0.2
and it appears that 1/h 1 is a better fit. Figure 2.5 shows that this concordance
works fairly well for h < 0.3, but that 1 is too large of a correcting factor for
h = 0.3.
We consider how the interquartile range (IQR) of an h-distribution changes
as h increases. Since we can calculate easily the quantiles of an h-distribution,
we can find an equation for the IQR as a function of the Gaussian IQR and h:
IQR^ = b (IQR2 exp
h x IQR
8
18


2.5 Simulation Results
For this experiment, we compared three methods: regular EM, our EM
augmented with biweight measures (EM-bi), and L2E. Table 2.1 presents the
parameters of the four mixtures in this study. We considered mixtures of k = 2,
3 and 5 components, and h values of 0, 0.05, 0.1, 0.2, 0.3. For each mixture and
h-value combination we generated 1,000 samples of size 10,000 from the mixture
and fit them with each of the three methods.
2.5.1 Testing for Good Estimates
In each case, the means of the components and their respective weights in
the mixture are unchanged by increasing the tail length. A good method should
be able to find accurate estimates of the means for any value of h. Each method
is fitting Gaussian distributions to the components, including an estimate of
the standard deviation as a measure of scale. Since the true underlying data
consist of ^-distributions, which do not have a directly comparable scale estimate
to the standard deviation, a comparison of scale estimates in terms of standard
deviation is not appropriate. We use IQR instead as a comparison scale measure
for assessing accuracy of scale estimates. It seems reasonable that a Gaussian
component that is fitting an h component should have the same IQR as the h
component. Because we can compute the quantiles of the h distribution, we can
directly compute the true IQR for each h component. See Table 2.2 for the IQR
values for each h and scale value combination we used. (Table 2.2 also includes
the factor that the standard deviation is multiplied by to get the IQR by h value
19


Two Components Mean Weight Std. Dev.
1000 0.444 200
3000 0.556 300
Three Components
300 0.333 200
1500 0.222 400
3000 0.444 300
Three Components, Equal Scales and Weights
300 0.333 300
1500 0.333 300
3000 0.333 300
Five Components
300 0.200 200
1500 0.150 400
3000 0.300 300
4000 0.150 200
6000 0.200 500
Table 2.1: True Parameters of Component Distributions
20


h \std. dev. 1 200 300 400 500
0 1.349 269.80 404.70 539.60 674.50
0.05 1.364 272.89 409.33 545.77 682.22
0.1 1.380 276.01 414.01 552.02 690.02
0.2 1.412 282.36 423.54 564.72 705.90
0.3 1.444 288.85 433.28 577.71 722.14
Table 2.2: True IQR values for all h and scale values used in our simulations
for reference.)
First, we will discuss how well each method does on these metrics: accuracy
and variance of estimates of the means, weights and IQRs. We then will consider
the overall fit of the Gaussian models to the data.
2.5.2 Two-component Mixtures
Figures 2.6 and 2.8 show the estimated values less the true values of the
means and weights respectively of each mixture component. For the means, all
three methods perform well until the tails become very heavy. Only at h = 0.3 do
the plain EM and L2E methods show gross bias and variability in the estimates
of the two means. EM-bi displays the desired behavior of estimating unbiasedly
the means even for very heavy tails. Similar performance arises in the weight
estimates, except that EM and L^E estimates become noticeably more variable
for slightly less heavy tails, h 0.2. Figure 2.7 shows that the mean estimates for
the first component from EM and EM-bi have very little difference and remain
centered around the true value for h values as high as 0.2. Similarly, Figure 2.9
21


Figure 2.6: Boxplots of mean estimates less the true values for two-component
mixtures. Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian case is on the
left and the heaviest tails are on the right.
0 0.05 0.1 0.2 0.3

Location Estimates True ro 000 D 0 0 0 ill 1 1 1 f' *r t- E$3E^3- !
1 1 1 1 1
0 0.05 0.1 0.2 0.3
1 1 1 1
-200 -
EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E
shows that weights differ very little for the first component, even though there
is slightly more variation in the estimates for higher hs.
Since we considered multiple scale measures for EM-bi, Figure 2.10 shows
the IQR results for both EM-bi with MLE standard deviation (labeled EMbiwt)
and EM-bi with S^ (labeled EMSbi) calculated as shown in Equation 2.4. The
estimates are normalized against the true values so that 0 indicates a perfect fit.
When h = 0 and the components are Gaussian, EM and L2E correctly estimate
the true IQR. EMbiwt underestimates the IQR and EMSbi overestimates it.
But with heavier tails, even as low as h = 0.05, EM and L2E overestimate the
22


Figure 2.7: The difference between the EM and EM with biweight mean
estimates for the first component of two-component mixture. The header of
each panel indicates the h value.
(EM + EMbiwt)/2
IQR, and for h > 0.1, they overestimate it more than EMSbi does. This can
be seen more clearly in Figure 2.11 which shows how the difference between the
EM IQR and the S& IQR increases for increasing values of h. Although L2E
does correctly estimate the IQR on average for h = 0.3, it can occasionally lead
to very poor estimates (45 out of 5000, or 0.9% of the values in the 5th panel of
Figure 2.8 are severe outliers). EM-bi slightly overestimates IQR in this case,
but is much less variable than EM or L2E with either scale measure.
23


Figure 2.8: Boxplots of weight estimates less the true values for two-component
mixtures. Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian case is on the
left and the heaviest tails are on the right.
24


Figure 2.9: The difference between the EM and EM with biweight weight
estimates for the first component of two-component mixture. The header of
each panel indicates the h value.
5
!q
LU
I
HI
0.440 0.442 0.444 0.446 0.448 0.450
J_____I____I____I_____I_J_____I_____I____I____I____I__
0.1 0.2
q 1 q q qfl q q q
0 0.05
-
i----1----1----1----1----rn-----1----1----1---r
0.440 0.442 0.444 0.446 0.448 0.450
0.004
0.002
0.000
-0.002
-0.004
(EM + EMbiwt)/2
25


Figure 2.10: Boxplots of normalized IQR estimates for two-component mix-
tures. Components are in separate rows, with the first component on the bottom.
h values increase from left to right, so that the Gaussian case is on the left and
the heaviest tails are on the right.
2 2 2 2 2
0 0.05 0.1 0.2 0.3
_ HK* ..... gfe 1 i o I -
_e*s ~i T 1
1 1 1 1
0 0.05 0.1 0.2 0.3
_ -t- *- i i Gp i * *+ i
rwn ^ ^ ^ T 1 I
EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E
26


Figure 2.11: The difference between the EM and EM with S^ IQR estimates
for the first component of two-component mixture. The header of each panel
indicates the h value.
250 300 350
(EM + EMbiwt)/2
27


2.5.3 Three-component Mixtures
Figure 2.12: Boxplots of the mean estimates less true values for first three-
component mixture. Components are in separate rows, with the first component
on the bottom, h values increase from left to right, so that the Gaussian case is
on the left and the heaviest tails are on the right.
1000
500
o
a>
03
>
D
to
0)
To
E
to
LU
c
o
as
o
o
3 3 3 3 3
0 0.05 0.1 0.2 0.3

q -f T mWm T" T-V
2 2 2 2 2
0 0.05 0.1 0.2 0.3
-T- q Q a s 4= 4^
t r T ^ ^ q
1 1 1 1 1
0 0.05 0.1 0.2 0.3
q q

EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E
Figure 2.12 displays the mean estimates of our first three-component mix-
ture. For small h values (h < 0.1) we see a similar phenomenon to that observed
in the two-component mixture: all three methods are unbiased, but EM and
28


Figure 2.13: The difference between the EM and EM with biweight mean
estimates for the first component of first three-component mixture. The header
of each panel indicates the h value.

LU
I
111
280 290 300 310
J_____________I_____________I_____________I___________________I_____________I_____________L
0.1 0.2
a Q q q q q
0 0.05
o
-1------1-------1-----1--------1-------1------r
280 290 300 310
20
10
0
-10
-20
(EM + EMbiwt)/2
L2E are more variable. For h = 0.2 and 0.3, EM-bis estimates for the mean
of the center component are more variable than those of the other two meth-
ods. When we consider the weight estimates as well (Figure 2.14), we see that
this increase in variability coincides with positive bias in the weight estimate
of that component. The weight estimates produced by EM and L2E also show
positive bias toward the second component, even for h values as low as 0.05.
This bias is shifting weight away from the third component, resulting in positive
bias in the mean estimates for the second component as well. The variability
in the mean estimate from EM-bi for the second component may be due to
29


Figure 2.14: Boxplots of the weight estimates less true values for first three-
component mixture. Components are in separate rows, with the first component
on the bottom, h values increase from left to right, so that the Gaussian case is
on the left and the heaviest tails are on the right.
30


Figure 2.15: The difference between the EM and EM with biweight weight
estimates for the first component of first three-component mixture. The header
of each panel indicates the h value.
0.22 0.24 0.26 0.28 0.30 0.32 0.34
S
!q
LLI
I
m 0.05 -
0.00 -
-0.05 -
0.1 0.2
V q
0 0.05
-
- 0.05
- 0.00
- -0.05
1------------T
0.22 0.24 0.26 0.28 0.30 0.32 0.34
(EM + EMbiwt)/2
that components relatively large spread, low true weight, or the closeness of
the true means. Because these results may arise as an artifact of the specific
three-component mixture, we further investigate its performance with an alter-
native three-component mixture. Figure 2.13 shows the difference in the EM
and EM-bi mean estimates for the first component, where Figure 2.12 shows
no difference, and verifies that the differences for that component are small for
h < 0.2.
The estimated IQRs of each component appear in Figure 2.16. EM and L2E
are unbiased and have low variability in the Gaussian case, while EM-bi with
31


Figure 2.16: Boxplots of normed IQR estimates for first three-component
mixture. Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian case is on the
left and the heaviest tails are on the right.
2 1 - 3 3 3 3 3
0 0.05 0.1 0.2 0.3

-1 - 1 1 +- S3 + 4, t +
2 2 2 2 2
0 0.05 0.1 0.2 0.3
Jd nfcD T T 4- [j-, C£] -4- 1 1 a a a a 4^- ^ 1-
- 1 - inf* *T **- *r- 1 '
1 1 1
0 0.05 0.1 0.2 0.3
t 1 a a
-1 - *1" t *V' -r 4- T' * + -4
EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E
MLE standard deviation underestimates the spread and EM-bi with slightly
overestimates the spread. When h 0.05 and h = 0.1, both EM-bi cases remain
close to the true IQR of the distribution, but EM and L2E overestimate the
spread of the second component and become more variable in their estimates.
For h = 0.2 and h = 0.3, the weight estimates in Figure 2.14 showed that
all methods assign too much weight to the second component, i.e., too many
observations from the end components were assigned to the second component.
The inclusion of these observations from the ends of the mixture give strongly
overestimated IQRs. Figures 2.15 and 2.17 show in more detail the dynamics of
the first component estimates: for h = 0 and 0.05 the difference between EM and
32


Figure 2.17: The difference between the EM and EM with 5^ IQR estimates
for the first component of first three-component mixture. The header of each
panel indicates the h value.
200 220 240 260
*. (EM + EMbiwt)/2
EM-bi are small, for h = 0.1, EM consistently underweights and underestimates
the IQR of the first component, and for h = 0.2 both estimates are more variable
and tend to underestimate the first component.
To determine if parametric choices in the three-component model affect per-
formance of the estimators, we also generated a mixture of three components
with equal weights and spreads. The performance of these estimators in this
case is much more similar to the behaviour we observed in the two-component
mixture. For h < 0.1, the mean (Figure 2.18) and weight (Figure 2.20) estimates
are nearly unbiased for all three methods. Also, for these low h values, the EM
33


and EM-bi estimates are very similar (Figures 2.19 and 2.21). When tails are
heavier, EM and L2E estimates become much more variable, but EM-bi esti-
mates continue to have low variability. The IQR estimates, Figure 2.22, again
show EM and L2E perform well in the Gaussian case, that EM-bi with MLE
estimates tend to underestimate the spread, and EM-bi with 5^ overestimates
the spread. As h increases, EM and L2E overestimate the spread of the first
and third components, underestimate the spread of the second component, and
become increasingly variable in their estimates. By h = 0.1, the EM-bi estima-
tors IQRs are more positively biased than those with S^. The EM-bi estimates
of the means and weights are much less variable for higher h values than those
from EM and L2E. (See the comparison in Figure 2.23.)
34


1000
500
0
-500
I
Cl)
3
CO
05
cb
E
to
LU
c
o
o
9 1000
500
0
-500
-1000
3 3 3 3 3
0 0.05 0.1 0.2 0.3
-fl- -fl-
a*a m^m m%m T T f T 1 1
2 2 2 2 2
0 0.05 0.1 0.2 0.3

^ -f- -f- "Vf- ~T* tTT-- T^
1 1 1 1 1
0 0.05 0.1 0.2 0.3
4- l
q -q
EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E
Figure 2.18: Boxplots of mean estimates less true values for the three-
component mixture with equal weights and spreads. Components are in sep-
arate rows, with the first component on the bottom, h values increase from left
to right, so that the Gaussian case is on the left and the heaviest tails are on
the right.
35


Figure 2.19: The difference between the EM and EM with biweight mean
estimates for the first component of the three-component mixture with equal
weights and spreads. The header of each panel indicates the h value.
300 350 400 450 500
(EM + EMbiwt)/2
36


0.4
0.2
0.0
-0.2
(fl
>
3
(/>
Q)
Cd
E
to
LU
O)
Q)
5 0.4
0.2
0.0
-0.2
3 3 3 3 3
0 0.05 0.1 0.2 0.3
4-^4- 1 1 t. ^ t
* ^ ^ -q q -f- T^T
2 2 2 2 2
0 0.05 0.1 0.2 0.3

m^m m^m T T ... ... nj~i r?j~i -q. -q-
1 1 1 1 1
0 0.05 0.1 0.2 0.3
A 1 I I i~q~i [~q~) -ft* s#* -ft-
t*i* vMr- ..... .-a- q I T
- (
EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E
Figure 2.20: Boxplots of weight estimates less true values for the three-
component mixture with equal weights and spreads. Components are in sep-
arate rows, with the first component on the bottom, h values increase from left
to right, so that the Gaussian case is on the left and the heaviest tails are on
the right.
37


Figure 2.21: The difference between the EM and EM with biweight weight
estimates for the first component of three-component mixture with equal weights
and spreads. The header of each panel indicates the h value.
0.34 0.36 0.38 0.40 0.42
I
JD
5
HI
I
0.10
0.1 0.2

0 0.05

0.00
0.34 0.36 0.38 0.40 0.42
(EM + EMbiwt)/2
38


3 3 3 3 3
0 0.05 0.1 0.2 0.3
* 4= ... *. 4- 1 1 1 -v f"q"i f "q "1 | + i
a 1 1;
2 2 2 2 2
0 0.05 0.1 0.2 0.3
- -
*- i r T !(!' 1 -t- | iP --- dju 1 r-ip i i
1 1
0 0.05 0.1 0.2 0.3
<=1= c:l=| ..... ** ' I I C$3 [^3
l"' -f< 9 J" 1 EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E
Figure 2.22: Boxplots of normed IQR estimates for the three-component
mixture with equal weights and spreads. Components are in separate rows,
with the first component on the bottom, h values increase from left to right, so
that the Gaussian case is on the left and the heaviest tails are on the right.
39


Figure 2.23: The difference between the EM and EM with 5^ IQR estimates
for the first component of three-component mixture with equal weights and
spreads. The header of each panel indicates the h value.
400 500 600 700
Q
UJ
I
UJ
600 -
500
0.1 0.2
q q q <*
0 0.05

600
200
400 500 600 700
(EM + EMbiwt)/2
40


2.5.4 Five-component Mixtures
As with two and three-component mixtures, EM and L2E estimates (Fig-
ures 2.24, 2.26, and 2.28) are more variable than EM-bi estimates for h > 0.1.
For h as low as 0.1, EM and L2E estimates of the weight and spread of the
second component tend to be positively biased. The comparison plots of EM
versus EM-bi estimates of the first component, Figures 2.25, 2.27, and 2.29,
show that all differences are very small until h 0.1 and that the estimates
increase greatly in variability for h = 0.2. EM with biweight estimates are much
less variable until h = 0.3, when the estimates for the second component spread
over a greater range, though still not as far as those of EM and L2E.
41


Figure 2.24: Boxplots of mean estimates less true values for five-component
mixture. Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian case is on the
left and the heaviest tails are on the right.
42


Figure 2.25: The difference between the EM and EM with biweight mean
estimates for five-component mixtures. The header of each panel indicates the
h value.
1
!o
LU
I
Lll
270 280 290 300 310
J______I_____I_____I_____I________I_____I_____I_____I_____I_
0.1 0.2
qq 9
0 0.05

i-----1----1----1--1------i-----1-----1----r
270 280 290 300 310
(EM + EMbiwt)/2
43


Figure 2.26: Boxplots of weight estimates less true values for five-component
mixture. Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian case is on the
left and the heaviest tails are on the right.
0.3
o.o
Q)
£
1 0.2
j5 0.1
2 0.0
a -0.1
t/>
LLI
JO
TO
0.0
-0.1
5 5 5 5 5
0 0.05 0.1 0.2 0.3
a a
q q q ^ f <1 ^ ^ f r t T T
4 4 4 4 4
0 0.05 0.1 0.2 0.3

^ t r "1 m% * t " T **t*- l-fl-l LU_1
3 3 3 3 3
0 0.05 0.1 0.2 0.3

mqm 1(1 Q l|l + ^ *t* ^ T ^ ={= Hh
2 2 2 2 2
0 0.05 0.1 0.2 0.3
... H=? ^ 4, ^ to
q mqm 1 . X 1 -
1 1 1 1 1
0 0.05 0.1 0.2 0.3
I I
'% 'f *V 0.3
0.2
0.1
0.0
0.0
EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E EM EMbiwt L2E
44


Figure 2.27: The difference between the EM and EM with biweight weight
estimates for the first component of five-component mixture. The header of each
panel indicates the h value.
45


Figure 2.28: Boxplots of normalized IQR estimates for five-component mix-
ture. Components are in separate rows, with the first component on the bottom.
h values increase from left to right, so that the Gaussian case is on the left and
the heaviest tails are on the right.
5 5 5 5 5
0 0.05 0.1 0.2 0.3
_ dU J* 1 CP T
1 T t" i^ ^ r
4 4 4 4 4
0 0.05 0.1 0.2 0.3

T' ^ *T' T T1 ^ ^ ^ ^ TF
3 3 3 3 3
0 0.05 0.1 0.2 0.3
-
4^1 ^ 1 4 1 *** TT *-f- ^ Xs
2 2 2 2 2
0 0.05 0.1 0.2 0.3
^ *lr + + i + l 1 0 f t ^
i i" tT" i < T < r
1 1 1
0 0.05 0.1 0.2 0.3
- 1 1
I4j"' "f"' *"' <' t ^ T''
EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E EM EMbiwt EMSbi L2E
46


Figure 2.29: The difference between the EM and EM with IQR estimates
for the first component of five-component mixture. The header of each panel
indicates the h value.
180 200 220 240 260 280
0.1 0.2

0 0.05

n-----1---1---1----1---1----n----1----1---1----1---r
180 200 220 240 260 280
(EM + EMbiwt)/2
47


2.5.5 Measuring Overall Fit
Table 2.3: Hellinger distance from, the Gaussian case for two-component mix-
tures with differing h values
dist from h = 0
h = 0.05 0.05
h = 0.1 0.08
h = 0.2 0.12
h = 0.3 0.31
Hellinger Distance is one of many possible measures of distance between
functions; defined as follows:
= dx (2.8)
with the property that 0 < H(f,g) < 1. As an example, Figure 2.30 shows den-
sity plots for all the two-component mixtures we considered in our experiment,
while Table 2.3 shows the Hellinger distance from the Gaussian case for each h
value.
Headrick et al. [21] derived closed-form expressions for the pdf and cdf of
the h-distribution family (Equations 2.6 and 2.7), but for simplicity of com-
putation we found it more convenient to work with simulated estimates of the
densities based on a sample of 10,000 observations from each component den-
sity (density.fd from [39]). We considered three possible scale measures for
48


Figure 2.30: Density plots of two-component mixtures with differing h values.
From top to bottom, h = 0,0.05,0.1,0.2 and 0.3.
EM-Bi: the MLE standard devation, Laxs A, and S^. Laxs A, as modified for
this procedure, gave wildly large estimates of the spread and so is not shown.
In almost all cases, EM and L2E results are very similar. EM-bi and Sm
consistently outperform EM-bi and the MLE standard deviation. When h = 0,
EM and L2E slightly outperform EM-bi, being about 0.01 smaller in Hellinger
distance. As soon as the distribution deviates from the Gaussian, EM-bi fits
are as good or better than those given by EM or L2E. For the two-component
49


Figure 2.31: Hellinger distances of fits of two-component mixtures, h values
increase from left to right, so that the Gaussian case is on the left and the
heaviest tails are on the right.
mixture (Figure 2.31) with heaviest tails, h = 0.3, EM performs as well as
EM-bi on average, but is much more variable, just as the parameter estimates
display in Figures 2.6 and 2.8. For the three-component mixture with the uneven
components, Figure 2.32, EM and L2E have a slight edge over EM-bi even with
heavier tails, but the difference is small. The mixture of three equal components,
Figure 2.33, is fit slightly better by EM-bi. In the five-component mixture
(Figure 2.34), EM-bi does have a few outliers when h = 0.2, but the bulk of
the estimates produced a better fit to the true underlying distributions than
those from the EM and L2E estimates and with less variance in the fit. Overall,
EM-bi (and the robust scale measure, Su), provides a less variable and often
closer fit to the underlying data, especially when the true distribution deviates
from the Gaussian.
2.6 Conclusion
50


EM-bi consistently performed almost as well as, or better than, classical
EM or L2E, with two, three or five components, even when h = 0 and the
components are truly Gaussian. While L2E is robust in many situations, that
characteristic does not always carry over to estimates of mixture models. If the
Gaussian model is believed to hold, then EM is still the best method to use.
Such definitive knowledge is often unavailable in applications, so EM-bi offers
great insurance with little cost in bias or precision of estimates.
51


Figure 2.32: Hellinger distances of fits of three-component mixtures. Two
distances from EMbiwt (h = 0, h 0.2) were greater than 0.3 and were omitted
from the second plot, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right.
0 0.05 0.1 02 0.3
0 A ^ §? o § -a 4- + 4-
EM EMbiwCMSbi L2E EM EMbiwtEMSbi L2E EM EMbiwtEMSbi L2E EM EMbiw£MSbi L2E EM EMbmiEMSba L2E
0 0.05 0.1 0.2 0.3
& 5* £ & -6 dp £> d? OjD IT - o 0 A L A O -O O 1 4 ojo cfc ~o
EM EMbiwtEMSbi L2E EM EMbiwtEMSbi L2E EM EMbmEMSbi L2E EM EMbiwtEMSb. L2E EM EMbiwtEMSbi L2E
52


Figure 2.33: Hellinger distances of fits of three-component mixtures with equal
spreads and weights. Distances greater than 0.3 were omitted from the second
plot: 6 for EM, 2 each for EMbiwt and EMSbi, and 7 for L2E. h values increase
from left to right, so that the Gaussian case is on the left and the heaviest tails
are on the right.
0 0.05 0.1 0.2 0.3
0 o o o O 0 0 O 0 ^ 0 0 0 + A. o 4 a 4 ^ *5
EM EMbiwEMSbi L2E EM EMbtwEMSbi L2E EM EMbiwEMSbi L2E EM EMbiwEMSbi L2E EM EMbiwEMSbi L2E
0 0.05 0.1 0.2 0.3
& J ^ £ -& 4 A A 4. A 4. 0 o 1 4 i Op Dp J, J i ; 0 $ 0
EM EMbiwtEMSbi L2E EM EMbiwtEMSbi L2E EM EMbiwEMSbi L2E EM EMbiwEMSbi L2E EM EMbiwEMSbi L2E
53


Figure 2.34: Hellinger distances of fits of five-component mixtures. Distances
greater than 0.3 were omitted from the second plot: 15 for EM, 20 for EMbiwt,
4 for EMSbi, and 15 for L2E. h values increase from left to right, so that the
Gaussian case is on the left and the heaviest tails are on the right.
I
H
E
J?
0.5 -
0.4 -
0.3 -
S3
O
W

?
0.2 -
0.1
0.0 -
0 0.05 0.1 0.2 0.3
o 8 Q 0 0 fi 1 1 0
m- 4- 4- o o 4- 4. A Q * $
EM EMbiwEMSbi L2E EM EMbtwtEMSbi L2E EM EMbiwtEMSbi L2E EM EMbiwtEMSbi L2E EM EMbtwCMSbi L2E
54


3. Fitting Censored Mixture Models
3.1 History of fitting censored data
A dataset is called censored if the value of one more observations at one
or both extremes is unknown, but the number of these observations and which
extreme they are at or past are known. Censoring comes in one of two types.
Type 1 censoring is when the point of censoring is known, but the values beyond
the censoring point are not. For example, a dataset with an income variable
for which annual incomes less than $10,000 are recorded as only less than
$10,000 is a Type 1 censored dataset. Type 2 censoring arises when the number
of censored observations is fixed, and the values of the remaining observations
again are unknown. For example, a biological lab that ends the experiment once
20 specimens have died, recording the lifetime of all remaining specimens to be
greater than the time of death of the 20th specimen (even if each specimen had
a different start date) is a Type 2 censored dataset. The problem that motivated
this research involved Type 1 censoring, so we focus on Type 1 censoring.
The history of censored data is based on the study of truncated data,
datasets where the population takes on values outside the value that can be
measured. An example of truncated data is willingness to donate to a charity
whose minimum donation level is fixed above 0. There are no responses from
people whose willingness to pay is below the minimum, so we know neither their
willingness to donate, nor the number of them.
Truncated data were considered in a statistical paper by Galton [14], who
investigated qualification finishing times for horses, leaving out all horses whose
55


times were greater than that needed to qualify. Using the mode to estimate
location, and sample quartiles to estimate spread, his results were close enough
to the expected ones for his model of a Gaussian that he was content with
that method. Pearson [33] was dissatisfied with Galtons method, and instead
fit parabolas to logarithms of the sample frequencies. Later, Pearson and Lee
[34] developed a solution using the method of moments. Fisher [12] derived the
maximum likelihood estimators for data truncated at only one extreme, and
found them to be the same as those studied by Pearson and Lee [34] in the case
of the Gaussian.
According to Cohen [6], the first person to handle censored data was Stevens
[3], who derived maximum likelihood equations for censored Type 1 samples
from Gaussian populations. Cohen [7, 8, 6] provided much of the subsequent
work including derivations of the MLEs, calculating the asymptotic variance
estimates, and creating tables to aid in finding the MLE solutions for both
Types 1 and 2 censoring of Gaussian data. See [6] for a full list. His later work
considers the problem of censoring for many different distributions including
log-normals, Weibulls, negative binomials, and gammas.
Persson and Rootzen [36] came up with a new method for approximating
the ML estimates for Type 1 and Type 2 samples that are censored at only
one extreme. Their simplified estimates, called restricted maximum likelihood
estimators, do not require the use of a table or an optimization procedure to
produce results. Schneider [42] expanded those results to the case with censoring
at both extremes.
56


Saw [41] suggested a bias correction for the ML estimates, providing tables
for several censoring values. Schneider [43] generalized the correction to a for-
mula which approximated the tables linearly and showed that the correction is
also helpful for the restricted maximum likelihood estimators.
Atkinson [1] considers the problem of censored mixture data from Gaussian
populations. He compares EM and two hybrid variants of EM that use different
optimization procedures for some of the steps. Using censoring points of 76%,
32% and 14% of the data, and sample sizes of 50, 100, and 200 data points,
he finds that all three methods have poor convergence to results for censored
data. He tried each method on each combination of censoring and sample size
until 500 convergent estimates were obtained, and says Typically, this process
produced at least 10,000 fresh data sets in an attempt to obtain 500 convergent
estimates for each algorithm. He does find that the hybrid EM method with
Newton-Raphson steps is more accurate and efficient than EM.
Reid [40] investigated the use of robust estimators for censored data, by
deriving the influence curve of the Kaplan-Meier estimate of the distribution
function. Green and Crowley [16] built on Reids work by comparing the Kaplan-
Meier median to alternative robust measures, and suggesting that trimmed
means are good robust estimates for censored data. This early work was focused
primarily on Type 2 censoring, which is more common in survival analysis, ver-
sus the Type 1 (fixed cutoff) censoring of more interest here. More relevant to
our work is that of Singh and Nocerino [46], who also had Type 1 censored data.
They compare trimmed means, Windsorized means, OLS methods for replac-
ing censored values, EM for replacing censored values, Cohens ML estimates,
57


and Schneiders bias-corrected RMLE estimates using a robust measure of lo-
cation instead of the sample mean. Their simulation experiments were small
(N = 5 25), included left-censoring and sometimes the inclusion of large out-
liers. For N > 10, methods that replaced the censored values had high MSE
rates. MLE and RMLE do consistently well, but RMLE often outperformed
MLE. None of these studies considered censoring in the context of mixtures.
3.2 Methods for fitting Censored Mixtures
In this section we develop two new methods for fitting censored mixtures
of Gaussians. The first method finds the maximum likelihood estimates by
applying EM to the censored likelihood. The second method extends closed
form results for censored values to the case of mixtures, while still using EM for
estimating the probabilities that each observation belongs to each component.
This method includes a robust variation.
3.2.1 Maximum Likelihood via EM
Consider a fc-component mixture with censoring on both ends. On the left,
l observations are censored at xi, and on the right, u observations are censored
at xu. For convenience of indexing, assume that the {xj} are sorted so that
Xi < x2 < < xn.
The likelihood for a censored distribution with cdf F(x) and pdf f(x) pa-
rameterized by 9 is
(nu)
C(X-,9) = F(xl)l[l-F(xu)]u J] /(*<)
i=(*+l)
In the case where f(x) is a mixture of k distributions, each with parameters
9j and mixing proportion 7Tj, 9 = {9j, Kj}j=i and this can be written as:
58


k k (nu) k
£(X;0) = C}2*jFj{xi))l[l - n ^2*jfi(Xi)
j=1 j=1 i=l+1 j=1
Since we do not know to which component each Xi belongs, we cannot di-
rectly estimate this likelihood. Let Ay be 1 if belongs to component j, and
0 otherwise. Then we can write the likelihood as:
£(X;0) =
+
+
l k
n 22 xjnjF'jixi)
1=1 j=1
n fc
Ajj7Tj(l Fj{xu))
i=nu+1j=l
(n-u) k
]^[ ) XijTTj fj (Xi)
i=l+1 j 1
with log-likelihood:

+
+
l / k
22log ( &ijffjFj(xi)
i=1 \j=l
n [" k
22 log J^AyTT^l-Fj(a:u))
i=nu+1 Lj=l
(n-u) / k
22log[22
i=l+l \j=1
Using the property that the Ay = I(j)(i) to tell us which component Xi
belongs to, we can rearrange slightly:
59


i(x-d) =
+
+
l k
EE lg(Aij7TjFjfa))
i=1 j=1
n k
log ('1 ~ Fi M)]
i=nu+1 j=l
(n-u) k
EE log (AijlTjfjiXi))
i=l+1 i1
Then the likelihood for each component can be maximized separately and
we can solve for the ML estimates of 9 using the following EM algorithm.
1. Choose 9^
2. While 0W 9(n-V > e:
(a) E-step: Compute ^ = £[Ajj|#^]
(b) M-step: For each component, estimate new values of 9 by maximizing
l n nu
w-x) = E 7ij log Fj{xi)+ 7ij log(l-Fj(xu))+ 7ij log(fj(xi))
i=1 i=nu+1 i=(+l
3.2.2 Restricted Maximum Likelihood
While Maximum Likelihood is a feasible solution to this problem, requiring
the explicit maximization is computationally intensive which introduces its own
errors. Of course, as we have shown earlier, maximum likelihood methods are
not robust to data that does not fit the distributional assumptions in the model.
In this section we will show a modification of maximum likelihood that attempts
to address both of these problems.
60


Persson and Rootzen [36] suggested using a restricted maximum likelihood
approach for censored data to avoid the computational problems involved in
direct maximization. Their method, for one-sided censoring, used the fraction
of censored observations as an estimate of F(x), removing the cdf from the like-
lihood to be maximized and making it possible to directly compute estimates.
Schneider [42] later extended Persson and Rootzens method to problems with
censoring on both ends. Haas and Scheff [17], Lechner [30], and Singh and No-
cerino [46] all found that these REML methods compared favorably to MLE, in
bias and MSE as well as ease of computation. All three studies found that REML
and MLE methods performed better than competing methods for censored data:
a variety of methods for replacing the censored values. Additionally, Haas and
Scheff [17] showed that REML was the most robust to departures from Gaus-
sian assumptions. Singh and Nocerino [46] modify REML to use M-estimates in
place of the sample mean and standard deviation and find that this modification
is robust to outliers.
To use REML in the context of mixture models, we modify Schneiders
equations for a single component (j fixed) to include the EM-weights as follows:
61


n
n l n
(n ^ ^ lij)i = jl'y ^ lij)i = (u "] lij)
i=l 1=1 i=nu+1
Vi = 7,i(x, *,), y = = E=1 s
HUl)
n n
{uu)
41 1(fi),u 4,1(1 ft*'2' (4K)-4>M),Z (*()-$(,))
C = y-ut + (uu)
n
n0
s- = ic + i(C2 + 4s2)1/2
P'REML = AmLE S*(Zi Zu)
creml = (o'mle s*2(u[Zi uuZu (Zi Zu)2)y/2
Following Singh and Nocerino, we consider replacing the sample mean and
standard devation in the formulas with robust estimates. The resulting robust
REML (RREML) uses and S^:
Prreml = A^bi s*(Zi Zu)
0RREML = {Sli S*2(uiZi UuZu (Zi Zu)2))1/2
3.3 Simulation Results
We choose a three-component mixture for our censoring simulations because
the effects of the censoring should be mostly felt by the end components, but
we also want to test that the methods do well on the relatively uncensored
62


Mean Weight Std. Dev.
300 0.333 200
1500 0.222 400
3000 0.444 300
Table 3.1: True parameter values of mixture components.
middle components. The true parameters of the components generated in these
simulations can be found in Table 3.1.
Preliminary simulations (see Figures 3.1, 3.2, 3.3) showed that changes in h
values used to simulate the data had very little effect on the estimates, regardless
of the method used. Taking this into account, we will show only results for h = 0
and h 0.2 for the remainder of this section.
3.3.1 10% Censoring
We consider four possible methods for fitting censored mixtures: EM, us-
ing the usual assumption of a mixture of Gaussians; EM4C, the modified EM
described above which uses the likelihood for a censored mixture of Gaussians;
REML, adjusted from the case of single Gaussian distribution to work with mix-
tures; and RREML, the same REML adjustment, only using the biweight and
Sbi as the location and scale measures. In the case of RREML, the biweight and
Sbi are computed using just the uncensored observations.
Our first experiment censors 10% of the observations on either end of the
mixture. With 9,000 observations, this removes about one third of the data from
63


150
100
50
0
-50
-100
3 3 3 3 3
0 0.05 0.1 0.2 0.3
>'j'l t. j^'VI
nr^r8
2 2 2 2 2
0 0.05 0.1 0.2 0.3
* - 1 rti 1. 0 -v i-e H
~T$3WNj=9? tjpC3pt=Jdi r) i j l4> LqJ 7TImnp- q"
1 1 1 1 1
0 0.05 0.1 0.2 0.3

=4= F#? ^ ?=*=
Q)
(0
E
to
LU
C
o
s
o
150
100
50
0
-50
-100
EM EM4C REMLRREML EM EM4C REMLRREML EM EM4C REMLRREML EM EM4C REMLRREML EM EM4C REMLRREML
Figure 3.1: Location estimates for all five h values, 10% censoring from each
end, 50 simulation runs. Components are in separate rows, with the first com-
ponent on the bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right.
the first component and about one fourth of the data from the third component.
Figure 3.4 displays boxplots of the location estimates less their true values.
REML performs poorly, but all three other methods are consistently close to
the true values for all components and both h values. The weight and scale
estimates, Figures 3.5 and 3.6, reveal that EM does a poor job of fitting these
parameters. EM4C estimates the weight and scale well for the Gaussian case,
but does slightly less well on the second component for h = 0.2. RREML
performs well on all parameters.
3.3.2 25% Censoring
64


Figure 3.2: Weight estimates for all five h values, 10% censoring from each end,
50 simulation runs. Components are in separate rows, with the first component
on the bottom, h values increase from left to right, so that the Gaussian case is
on the left and the heaviest tails are on the right.
For the second experiment, we increase the censoring to 25% of the data
from each end, removing half of the total data. In this mixture, that removes
more than two thirds of the first component and more than half of the third
component. REML performed so wildly and inconsistently in the preliminary
simulations at this censoring level that it was left out of the results.
EM failed to converge for h = 0.2 when subject to heavy censoring. For the
most part, both EM4C and RREML produced good estimates of the parameters.
RREML did have 14 outliers where it could not distinguish the third component
from the second component, a failure rate of 1.4%. Figures 3.7, 3.8, 3.9, and
3.10 show boxplots of the estimates. In this case, EM4C appeared to perform
65


0.5
0.0
-0.5
0C
O
3 3 3 3 3
0 0.05 0.1 0.2 0.3

* HH- t5 ^"
2 2 2 2 2
0 0.05 0.1 0.2 0.3
c$p T Hfw-
"9* s#?
1 1 1 1 1
0 0.05 0.1 0.2 0.3

54? 54= C&ZJ a- ^ ^ SJ=T~ a?
0.5
0.0
- -0.
LU
cc
O
0.5
0.0
-0.5
EM EM4C REMLRREML EM EM4C REMLRREML EM EM4C REMLRREML EM EM4C REMLRREML EM EM4C REMLRREML
Figure 3.3: IQR estimates for all five h values, 10% censoring from each end,
50 simulation runs. Components are in separate rows, with the first component
on the bottom, h values increase from left to right, so that the Gaussian case is
on the left and the heaviest tails are on the right.
equally well for the heavier-tailed case, h = 0.2.
3.4 Conclusion
While it appears that to some extent, censoring mitigates the effect of heavy
tails by censoring the tails, we did see that RREML performed substantially
and consistently better than REML alone. Since the biweight is robust to many
kinds of data problems, it handles having a large amount of the data take on
values too close in fairly well, seldom producing wildly out of range estimates.
EM4C performed surprisingly well. It does make the assumption of Gaussian
components, and in the lower censoring case did do slightly worse on heavy-tailed
components. But EM4C performed very well in the case of heavy censoring.
66


3 3
0 0.2

i-
2 2
0 0.2
4 '"* -*- ry~i j-li C3
nr'' ~t=^=3nn "r t ' ' E53 i r-yn * r
1 1
0 0.2

* .4. -4;|
EM EM4C REML RREML EM EM4C REML RREML
Figure 3.4: Location estimates for h = 0 and 0.2, 10% censoring from each
end, 1000 simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so that the
Gaussian case is on the left and the heaviest tails are on the right.
This may be the case of heavy-tail effects being masked by censoring.
Both EM4C and RREML appear to be viable methods for fitting censored
mixtures. In the following section, we will consider their performance on data
derived from a real-world problem.
67


3 3
0 0.2

T r ^ t-
2 2
0 0.2
* ^
w 4 - -i-
1
0 0.2

* ^ -r?f=
EM EM4C REML RREML EM EM4C REML RREML
Figure 3.5: Weight estimates for h = 0 and 0.2, 10% censoring from each
end, 1000 simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so that the
Gaussian case is on the left and the heaviest tails are on the right.
68


3 3
0 0.2

1 f I- *
2 2
0 0.2

-T- r
1
0 0.2

- -f I-
EM EM4C REML RREML EM EM4C REML RREML
Figure 3.6: IQR estimates for h = 0 and 0.2, 10% censoring from each end, 1000
simulation runs. Components are in separate rows, with the first component on
the bottom, h values increase from left to right, so that the Gaussian case is on
the left and the heaviest tails are on the right.
69


1000
500
0
-500
3 3
0 0.2

l i
2 2
0 0.2
1 .... i
f
1
0 0.2


1000
500
0
-500
EM EM4C RREMl EM EM4C RREML
Figure 3.7: Location estimates for h=0 and 0.2, 25% censoring from each
end, 1000 simulation runs. Components are in separate rows, with the first
component on the bottom, h values increase from left to right, so that the
Gaussian case is on the left and the heaviest tails are on the right.
70


Figure 3.8: Weight estimates for h=0 and 0.2, 25% censoring from each end,
1000 simulation runs. Components are in separate rows, with the first compo-
nent on the bottom, h values increase from left to right, so that the Gaussian
case is on the left and the heaviest tails are on the right.
71


Figure 3.9: IQR estimates for /i=0 and 0.2, 25% censoring from each end, 1000
simulation runs. Components are in separate rows, with the first component on
the bottom, h values increase from left to right, so that the Gaussian case is on
the left and the heaviest tails are on the right.
72


3 3
0 0.2

1 I I ;
2 2
0 0.2
r.>TT
- -
1 1
0 0.2


EM4C RREML EM4C RREML
Figure 3.10: IQR estimates for just the EM4C and RREML methods. Com-
ponents are in separate rows, with the first component on the bottom, h values
increase from left to right, so that the Gaussian case is on the left and the
heaviest tails are on the right.
73


4. Case Study: Images of Biological Cells
4.1 Problem Description
Peskin et al., who provided our initial motivation with their study of tumor
imaging [37], provided the dataset that can be modeled as a censored mixture:
a collection of images of biological cells. In laboratories across the countries,
slides of cells are prepared and then photos are taken of them. The light level,
focus, and number of cells per slide can vary greatly across images. See Figure
4.1 for an example. The National Institute of Standards and Technology (NIST)
then has the problem of trying to identify the types of cells from the images.
The current process is for a technician to examine the image over a lightbox and
determine the shape of the cell by taking measurements from the image edge to
the edge of the cell. An automated process to find the cell edges would be more
consistent and much faster.
Because the images are black and white, each pixel takes on a single value
between 0 and 4095. If we consider the distribution of these values, as in the
histogram in Figure 4.2, the data appear as a mixture distribution with three
components: a large component of dark (low-valued) slide background, a broad
middle component for the fuzzy edge of the cell, and a concentrated component
of high values for the center of each cell. Depending on the light level of the
image, this mixture may be censored at both ends. Peskin et al. [37] ignored the
censoring and use regular EM to estimate Gaussian mixtures. However, often
EM would fail to converge, forcing them to fall back on traditional methods
74


.J*-
well lhra.red.tif. 0018
Figure 4.1: An example image of a slide of biological cells
to identify the cells in the images. As there is no reason to believe that the
components of this mixture are Gaussian, and with the substantial censoring of
the data, this problem motivated our research into new methods for censored
mixtures. By fitting these data with our methods, we can estimate the proba-
bility that a particular pixel is on the edge of a cell. These probable edges can
then be used for determining the shape and thus, type, of a cell.
4.2 Choosing good starting values for parameter estimates
The data in this application is not balanced across components of similar size
like the scenarios we looked at earlier, so we will need to use a more sophisticated
method to choose the starting values for the EM algorithm. Following Fowlkes
75


Censored Cell Data
>.
o
c
(D
3
cr
S
LL
0 1000 2000 3000 4000
x
Figure 4.2: Example histogram of slide image pixel values
[13], we start by considering a Gaussian Q-Q plot of the data. Figure 4.3 shows
three distinct sections, just as we saw in the histograms. Figure 4.4 shows the
Gaussian Q-Q plot for just the uncensored values, which still displays three
distinct sections. While we expect that the components overlap to some degree,
separating out the three obvious sections will give us data that largely belongs
to a single component, helping us find starting values for EM. Once we have
each section, we can apply Gaussian Q-Q plots to each section (an example is
shown in Figure 4.5) and fit a line to the Q-Q plot to find starting values for the
means and standard deviations. Because we are using a resistant method to fit
76


Gaussian Q-Q Plot
Theoretical Quantiles
Figure 4.3: A Gaussian Q-Q plot of the values from an image of a biological
slide.
the line, we do not require that the breakpoints be perfectly chosen.
To choose the breakpoints between sections, we fit a resistant line to a
subset of the Q-Q data, starting with the smallest values. We then consider the
residuals of that fitted line to identify the area where the Q-Q plot curves. This
gives us our first breakpoint. We repeat the process, fitting the resistant line to
a subset of the points after the first breakpoint, and looking for large residuals
only in the upper portion of the fitted area. If the mixture were composed of
more than three components, we would continue in the same fashion to find
more breakpoints. We use the iteratively calculated resistant line from Hoaglin
77


Gaussian Q-Q Plot of Uncensored Values with Breakpoints
Theoretical Quantiles
Figure 4.4: A Gaussian Q-Q plot of the uncensored values only from an image
of a biological slide.
et al. [23] to fit the Q-Q values of each section and find the starting means and
standard deviations. We use the proportion of data in each section as starting
values for the component mixture proportions.
78


Gaussian Q-Q Plot of First Componont
ThsoreScrt Quantiles
-2 0
Tbaocrtcsl Quantiles
-4 *2 0 2 4
Theoretical Quanttes
Figure 4.5: Individual Gaussian Q-Q plots for the three sections of values in
a biological slide dataset.
79


Mean Weight Std. Dev.
275 0.85 20
1300 0.10 900
3500 0.05 370
Table 4.1: True values for simulating data approximating slide images
4.3 Simulation
In order to test how well our methods work against data distributed like
the slide image values, we used our starting method on a real dataset to pick
values for a simulation of three components. The values used for this simulation
experiment appear in Table 4.1. The previous chapter showed that RREML
and EM4C were the most effective methods with censored data, so we only
consider those two methods here. Each image in the real data is composed of
more than 300,000 pixels. We used a sample size of 100,000 in these simulations
to investigate the complications that may arise from a large sample size.
80


4.3.1 Results
800
600
400
200
0
CD -200
<3 -400
>
CD

i
l (/)
0)
Tc
E
To
UJ
c
o
To o 800
o 600
_i 400
200
0
-200
-400
3 3
0 0.2

r* "T q <>|i>
2 2
0 0.2
1 ... .A..
-t- ^
1 1
0 0.2

q q
800
EM4C
RREML
EM4C
RREML
Figure 4.6: Boxplots of mean estimates less true values for data approximating
slide images. Components are in separate rows, with the first component on the
bottom, h values increase from left to right, so that the Gaussian case is on the
left and the heaviest tails are on the right.
Figures 4.6, 4.7, 4.8 show the normalized estimates for the means, weights,
and IQRs respectively. When h = 0, both methods generally perform very
well on all measures. RREML conflated the second and third components 34
times out of 1000 simulation runs. As in the previous chapter, RREML either
finds good estimates of the true values or collapses two of the components. By
81


Figure 4.7: Boxplots of weight estimates less true values for data approximat-
ing slide images. Components are in separate rows, with the first component on
the bottom, h values increase from left to right, so that the Gaussian case is on
the left and the heaviest tails are on the right.
checking for estimated weights very close to zero, it is simple to find the few
occasions when RREML does not find all of the components.
The results for h 0.2 are interesting for both methods. Figure 4.6 shows
that both methods underestimate the true center of the second component.
The reason for this becomes clear when examining the weight and IQR esti-
mates (Figures 4.7 and 4.8). By increasing the tail size of all the components,
but especially of that extremely large first component, 85% of the mixture, we
get many more observations that appear to be part of the second two compo-
82