Citation
Identification of multivariate outliers in large data sets

Material Information

Title:
Identification of multivariate outliers in large data sets
Creator:
Werner, Mark
Publication Date:
Language:
English
Physical Description:
225 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Outliers (Statistics) ( lcsh )
Multivariate analysis ( lcsh )
Multivariate analysis ( fast )
Outliers (Statistics) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 219-225.
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Mark Werner.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
53987746 ( OCLC )
ocm53987746
Classification:
LD1190.L622 2003d W47 ( lcc )

Full Text
IDENTIFICATION OF MULTIVARIATE OUTLIERS
IN LARGE DATA SETS
by
Mark Werner
B.Sc., Applied Mathematics and Physics, University of Stellenbosch 1993
B.Sc.(Hons.), Applied Mathematics, University of Stellenbosch, 1994
M.Sc., Applied Mathematics, University of Stellenbosch, 1996
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2003


by Mark Werner
All rights reserved.


This thesis for the Doctor of Philosophy-
degree by
Mark Werner
has been approved
by
Burt Simon
2780
Date


Werner, Mark (Ph.D., Applied Mathematics)
Identification of Multivariate Outliers
in Large Data Sets
Thesis directed by Professor Karen Kafadar
ABSTRACT
In this thesis a new algorithm is proposed for detecting outliers in large and
very large data sets. This procedure uses Tukeys biweight function to assign
weights to data values in each dimension, then reassigns a weight of one to those
points with weight above a certain cutoff value and zero to those below. The
sample mean and covariance can be efficiently calculated over those observations
with weight equal to one, leading to robust Mahalanobis distances for all the
observations. We estimate the density of these Mahalanobis distances and deter-
mine a rejection point where the slope of the density is sufficiently close to zero.
All observations with Mahalanobis distance greater than this rejection point are
declared outliers. This algorithm demonstrates very good outlier identification
properties, especially on non-normal data, an area where most existing methods
do not perform as well. We successfully model and predict its error rates for a
mixture of chi-squared distributions and various parameters such as amount of
contamination, separation of chi-squared distributions, and data dimension. It is
computationally fast, and is very capable of analyzing data sets containing both
111


large number of observations and high dimensions. To analyze performance of
the robust estimation procedure from a theoretical perspective, we also calculate
several important asymptotic robustness properties at the standard Normal dis-
tribution. Among others, this estimator possesses a low gross error sensitivity,
a high relative efficiency, and a high breakdown point, as well as considerable
flexibility that enables it to adapt to a variety of data situations and contami-
nation levels. Compared to other methods that we have examined, this outlier
algorithm is considerably faster, and with special reference to non-normal data,
offers excellent all-round performance.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Karen Kafadar
IV


DEDICATION
This thesis is dedicated to a wonderful Mom and Dad, who lovingly sup-
ported me throughout my academic career: To my Mom, who never faltered in
her belief that I would not only finish, but finish well. And to my Dad, who
regretted not being able to take a course in Statistics, and who told me, What-
ever else you study at University, be sure to take a course in Statistics it may
be one of the most useful courses you will ever take. Thanks, Mom and Dad!


ACKNOWLEDGMENT
I would like to sincerely thank and acknowledge the significant role that the
following people undertook in helping this thesis to materialize.
My adviser, Prof. Karen Kafadar, for her insightful suggestions regard-
ing how to approach many of the problems addressed in this thesis. Her
knowledgeable advice frequently led me to fruitful research avenues, and
by example showed me how to conduct high quality statistical research. I
also greatly appreciate her patience in reviewing and providing construc-
tive criticism during the preliminary drafts of this thesis.
My thesis committee for their helpful suggestions, moral support and con-
stant encouragement to finish and get out of here!
The administrative staff Marcia Kelly, Liz Lhotka and Dana Franklin
for their continual helpfulness especially regarding administrative matters,
and for bearing with me when I missed the odd deadline.
The faculty of the CU-Denver Mathematics Department for imparting
some of their knowledge to me, always encouraging me to aim higher and
still higher, and for providing me the opportunity to obtain a competitive
degree.


CONTENTS
Figures ............................................................... xi
Tables............................................................... xiii
Chapter 1 2
1. Introduction......................................................... 1
1.1 Univariate Outliers................................................ 4
1.2 Multivariate Outliers.............................................. 7
2. Previous Outlier Identification Methods............................. 12
2.1 The Influence Function............................................ 14
2.2 M-estimators...................................................... 19
2.2.1 Monotone M-estimators........................................... 19
2.2.2 Redescending M-estimators....................................... 22
2.2.3 Computational Issues of M-Estimators............................ 25
2.3 S'-Estimators..................................................... 26
2.4 Combinatorial Estimators.......................................... 28
2.5 Compound Estimators............................................... 31
2.6 BACON............................................................. 32
2.7 Non-Distance Based Methods........................................ 34
2.7.1 The Stahel-Donoho Estimator..................................... 35
2.7.2 Kurtosis........................................................ 37
2.8 Types of Outliers................................................. 39
viii


3. The Development of a New Method................................. 42
3.1 Preliminary Investigation..................................... 42
3.2 The Development of a New Method............................... 46
3.2.1 Obtaining Robust Estimates..................................... 46
3.2.2 Determing the Rejection Boundary............................... 53
3.2.3 Computing the Sample Density................................... 58
3.2.4 Modifications to the Original Proposal...................... 60
3.3 A Summary of Our Proposed Method.............................. 71
4. Results from the Proposed Method................................ 73
4.1 Computational Details............................................ 73
4.2 Choice of Parameters for our Method............................ 79
4.3 Skewed Data: A Mixture of x2 Distributions..................... 85
4.3.1 A Mixture of x2,s with 82 = 12.5............................... 85
4.3.2 A Mixture of x2s with 52 = 32 ................................ 90
4.3.3 A Mixture of x2s with 52 = 50 ................................ 95
4.3.4 A Mixture of x2s with 52 = 112.5.............................. 99
4.3.5 A Mixture of x2s with 82 = 200 .............................. 106
4.4 Symmetric Data: A Mixture of Gaussians........................ Ill
4.4.1 An Investigation of Shift Outliers.......................... Ill
4.4.2 A Further Investigation of Shift Outliers................... 123
4.5 Correlated Data............................................... 125
4.5.1 A Mixture of Correlated Normals.............................. 127
4.5.2 A Mixture of Correlated Poissons............................. 132
IX


4.6 Examination of a Real Data Set.................................. 137
4.7 Computational Time.............................................. 141
4.8 Predicting Our Models Performance.............................. 147
5. Theoretical Properties and Future Work.......................... 167
5.1 The Influence Function for Our Estimator........................ 168
5.1.1 Examining the Accuracy of our Estimator..................... 180
5.2 The Calibration Stage........................................... 188
5.3 Different Distributions......................................... 189
5.4 An Optimization Approach ..................................... 191
5.5 Conclusions..................................................... 193
Appendix
A. Abbreviated Fortran code for the proposed algorithm............. 195
References......................................................... 219
a
X


FIGURES
Figure
1.1 Bivariate outliers that are not univariate outliers............ 9
2.1 p- and (/-functions for the sample median and mean........... 21
2.2 ^-functions for the biweight and other robust estimators....... 23
3.1 Robust Mahalanobis distances with estimated density and slope . . 56
3.2 Weighting function for the biweight and our proposed method ... 63
4.1 Outliers from xlo with Mahalanobis distances...................... 87
4.2 Outliers from X13 with Mahalanobis distances...................... 92
4.3 Outliers from xls with Mahalanobis distances...................... 97
4.4 Outliers from xlo with Mahalanobis distances..................... 103
4.5 Outliers from Np(3 1,1) with Mahalanobis distances............. 112
4.6 Outliers from Np(2 1,1) with Mahalanobis distances............. 118
4.7 Pairwise scatter plots of wire elements....................... 138
4.8 Pairwise scatterplots of log(wire elements)................... 139
4.9 Robust Mahalanobis distances for all wire elements.............. 142
4.10 Robust Mahalanobis distances for non-zero wire elements......... 143
4.11 Outlier error rates for 5C = 5..................i.............. 151
4.12 Outlier error rates for dc = 8................................... 152
4.13 Outlier error rates for 5C = 10.................................. 153
4.14 Inlier error rates for 6C 5.................................. 157
xi


4.15 Inlier error rates for 5C = 8........................................ 158
4.16 Inlier error rates for 5C = 10....................................... 159
4.17 Scaled actual versus fitted outlier error rates..................... 162
4.18 Scaled actual versus fitted inlier error rates ..................... 163
4.19 Actual vs. predicted outlier errors ................................ 165
4.20 Actual vs. predicted inlier errors.................................. 166
5.1 0-function for our robust estimation procedure .................... 175
5.2 Accuracy of the robust location estimate.............................. 185
5.3 Accuracy of the robust scale estimate................................. 186
xii


TABLES
Table
3.1 Sample 2x2 table illustrating the notation used in this thesis ... 43
3.2 Preliminary simulations with multivariate normal outliers ....... 45
3.3 Preliminary simulations with x2 outliers........................... 47
3.4 Examining the effect of the Winsorization divisor................ 67
3.5 Examining the effect of Winsorization............................ 70
4.1 Examining the effect of cutoff, Winsorization and MINWT.......... 80
4.2 Simulation results for xlo outliers at p = 5, part 1.............. 88
4.3 Simulation results for xlo outliers at p = 5, part 2.............. 88
4.4 Simulation results for xi.54 outliers at p = 10, part 1 .......... 89
4.5 Simulation results for xi.54 outliers at p = 10, part 2 .......... 89
4.6 Simulation results for X7.5 outliers at p = 20, part 1............ 91
4.7 Simulation results for x?o outliers at p = 20, part 2............. 91
4.8 Simulation results for X13 outliers at p = 5, part 1.............. 93
4.9 Simulation results for X13 outliers at p = 5, part 2.............. 93
4.10 Simulation results for X10.66 outliers at p = 10, part 1.......... 94
4.11 Simulation results for X10.66 outliers at p = 10, part 2.......... 94
4.12 Simulation results for xi outliers at p = 20, part 1.............. 96
4.13 Simulation results for xi outliers at p = 20, part 2.............. 96
4.14 Simulation results for X15 outliers at p = 5, part 1.............. 98
xiii


4.15 Simulation results for X15 outliers at p = 5, part 2..................... 98
4.16 Simulation results for xf2.07 ou^ers P = 10) Par^ 1................ 100
4.17 Simulation results for Xi2.07 outliers at p = 10, part 2................ 100
4.18 Simulation results for xlo outliers at p = 20, part 1................... 101
4.19 Simulation results for xlo outliers at p = 20, part 2................... 101
4.20 Simulation results for xlo outliers at p 5, part 1.................... 104
4.21 Simulation results for xlo outliers at p = 5, part 2.................... 104
4.22 Simulation results for XI5.01 outliers at p = 10, part 1................ 105
4.23 Simulation results for X15.61 outliers at p = 10, part 2................ 105
4.24 Simulation results for X12.5 outliers at p = 20, part 1 ................ 107
4.25 Simulation results for X12.5 outliers at p = 20, part 2 ................ 107
4.26 Simulation results for X25 outliers at p = 5, part 1.................... 108
4.27 Simulation results for xls outliers at p = 5, part 2.................... 108
4.28 Simulation results for X25 outliers at p = 10, part 1................... 109
4.29 Simulation results for X25 outliers at p = 10, part 2................... 109
4.30 Simulation results for X25 outliers at p = 20, part 1................... 110
4.31 Simulation results for X25 outliers at p 20, part 2................... 110
4.32 Simulation results for Np(3 1,1) outliers at p = 5, part 1 ........... 113
4.33 Simulation results for Np(3 1,7) outliers at p = 5, part 2 ........... 113
4.34 Simulation results for NP(2.12 1,1) outliers at p = 10, part 1 . . . 115
4.35 Simulation results for Np(2.12 1,7) outliers at p = 10, part 2 . . . 115
4.36 Simulation results for jVp(1.5 1,7) outliers at p = 20, part 1 . . . . 116
4.37 Simulation results for Np( 1.5 1,7) outliers at p = 20, part 2 ... . 116
xiv


4.38 Simulation results for Np(2 1,1) outliers at p = 5, part 1....... 119
4.39 Simulation results for Np(2 1,I) outliers at p = 5, part 2 ........... 119
4.40 Simulation results for jVp(1.41 1,1) outliers at p = 10, part 1 . . . 120
4.41 Simulation results for JVP(1.41 1,J) outliers at p = 10, part 2 . . . 120
4.42 Simulation results for Np( 1,1) outliers at p = 20, part 1.............. 121
4.43 Simulation results for NP(1,I) outliers at p = 20, part 2............... 121
4.44 Inlier error rates for shift outliers as described in [44]......... 126
4.45 Simulation results for Np(3.46 1, C5) outliers at p = 5, part 1 . . . 129
4.46 Simulation results for 1VP(3.46 :1, C5) outliers at p 5, part 2 . . . 129
4.47 Simulation results for ./Vp(3.317 1, C10) outliers at p = 10, part 1 . 130
4.48 Simulation results for NP(3.317 1, C10) outliers at p = 10, part 2 . 130
4.49 Simulation results for A^(3.24 1, C20) outliers at p = 20, part 1 . . 131
4.50 Simulation results for ATP(3.24 1, C20) outliers at p = 20, part 2 . . 131
4.51 Simulation results for Poisson(25) outliers at p = 5, part 1....... 133
4.52 Simulation results for Poisson(25) outliers at p = 5, part 2....... 133
4.53 Simulation results for Poisson(24.36) outliers at p = 10, part 1 . . . 135
4.54 Simulation results for Poisson(24.36) outliers at p = 10, part 2 . . . 135
4.55 Simulation results for Poisson(24) outliers at p = 20, part 1 .... 136
4.56 Simulation results for Poisson(24) outliers at p = 20, part 2 .... 136
4.57 Computation times for our algorithm, in seconds ....................... 144
4.58 Computation times for BACON*, in seconds............................... 146
4.59 Computation times for Kurtosisl*, in seconds........................... 146
4.60 Outlier error rates for regression model: e = 5%, 10%................... 149
xv


4.61 Outlier error rates for regression model: e = 15%, 20% ........... 150
4.62 Inlier error rates for regression model: e = 5%, 10%............... 155
4.63 Inlier error rates for regression model: e = 15%, 20% ............. 156
5.1 Asymptotic Numerical Robustness Properties of Various Estimators 178
5.2 Accuracy of our robust estimator for e = 5%, 10%................... 183
5.3 Accuracy of our robust estimator for e = 20%, 30%.................. 184
xvi


1. Introduction
Awareness of outliers has existed for at least several hundred years. In 1620,
Francis Bacon [4] wrote,
Whoever knows the ways of Nature will more easily notice her de-
viations; and, on the other hand, whoever knows her deviations will
more accurately describe her ways.
Even as early as the first published work on least squares in 1805, Legendre
[48] explicitly mentioned rejection of outliers to improve accuracy and reduce
error:
If among these errors are some which appear too large to be ad-
missible, then those equations which produced these errors will be
rejected, as coming from too faulty experiments, and the unknowns
will be determined by means of the other equations, which will then
give much smaller errors.
Building on this idea, in 1887 Edgeworth [20] elegantly stated:
The method of Least Squares is seen to be our best course when
we have thrown overboard a certain portion of our data a sort
of sacrifice which often has to be made by those who sail upon the
stormy seas of Probability.
1


The outlier challenge is one of the earliest of statistical interests, and since
nearly all data sets contain outliers of varying percentages, it continues to be
one of the most important. Sometimes outliers can grossly distort the statistical
analysis, at other times their influence may not be as noticeable. Statisticians
have accordingly developed numerous algorithms for the detection and treatment
of outliers, but most of these methods were developed for univariate data sets.
This thesis focuses instead on multivariate outlier detection.
Especially when using some of the common summary statistics such as the
sample mean and variance, outliers can cause the analyst to reach a conclusion
totally opposite to the case if outliers werent present. For example, a hypothesis
might or might not be declared significant due to a handful of extreme outliers.
In fitting a regression line via least squares, outliers can sufficiently alter the
slope so as to induce a sign change. For a graphical illustration of the latter, see
[70, p. 5].
Sometimes outliers can draw attention to important facts, or lead to new
discoveries, e.g. Rayleighs discovery of argon resulting from unexpected dif-
ferences in measured weights of nitrogen [82, pp. 49-53]. During one of my
consulting projects, I was asked to evaluate patient satisfaction across various
categories (quality of physician care, attentiveness of nurses, etc.) from different
hospitals. One of the hospitals consistently received the highest score, in terms
of patient satisfaction, across all categories except for smoothness of billing pro-
cess. Since it was almost flawless in every other category, I thought there had
to be a mistake in the data. After discussing this peculiarity with the project
2


coordinator, I learned that this hospital had recently switched to a different ac-
counting package, which led to many incorrect accounts and frustration among
the patients. With reference to the distribution of patient satisfaction indices,
the billing score from this hospital was clearly an outlier it was an extremely
unusual data point. We certainly did not want to discard this point; on the con-
trary, it highlighted a very important fact this hospital needed to significantly
improve its billing process, or risk losing business. Often it will be important to
calculate how unlikely a particular observation is, but in this case simple visual
inspection was sufficient. It was certainly beneficial to become aware of this
outlier. The hospital took appropriate action and when I analyzed the following
years data, their billing score was on par with the rest of their scores.
Another example where outliers themselves are of primary importance in-
volves air safety, as discussed in Kafadar and Morris [46]. On one of the four
hijacked flights on September 11, 5 of the 80 passengers satisfied the follow-
ing criteria: they were not U.S. citizens, they had connections to a particular
foreign country, had purchased one-way tickets at the gate, had paid in cash,
and none had checked luggage. One or two passengers satisfying these criteria
might not be too rare, but five passengers on the same flight, especially with
the benefits of hindsight, would now seem to be highly unusual. We of course
fully recognize the fact that before September 11, these passengers might not
have raised sufficient concern to spur further investigation because this type of
attack had rarely been encountered before. Some level of concern was raised by
an automated screening process, but the recommended action was to re-screen
3


the checked bags. Both the identification of outliers and the appropriate action
are therefore important. It would be ideal to research and identify criteria that
potentially troublesome passengers might possess; a successful (and fast) outlier
detection algorithm is imperative to correctly identify outliers amongst large
masses of regular passenger data, so that a human agent can be alerted to the
possibility of trouble and investigate the matter in person. Further applications
of outlier identification to homeland security are described in an article by Banks
[5]. Our goal in this thesis is to identify the outliers in a data set as successfully
as possible, after which the analyst can decide what to do with them.
1.1 Univariate Outliers
In univariate data, the concept of outlier seems relatively simple to define.
Outliers are those points located far away from the majority of the data; they
probably do not follow the assumed model. A simple plot of the data, such
as scatterplot, stem-and-leaf plot, QQ-plot, etc., can often reveal which points
are outliers. This is sometimes called the interoccular test because it hits you
between the eyes.
Relying solely on non-visual means, however, even univariate outliers might
not be as easy to identify as would appear at first sight. Consider a simple
situation: identifying outliers in a normally distributed data set. A naive outlier
identification rule might be to calculate the sample mean and standard deviation,
and classify as outliers all points further than 2 or 3 standard deviations away
from the mean. For Gaussian data, this would label as outliers those points
far enough from the main data that we think there is only a 4.56% or 0.26%
4


chance, respectively, that they belong to this distribution. For n truly Gaussian
observations, this would also cause 0.0456n and 0.0026n regular data points to be
labeled as outliers false positives. Conversely, outliers that an algorithm fails
to detect are referred to as false negatives. It is an unfortunate fact, however,
that the presence of two or more outliers could render some or most of the outliers
invisible to this method. If there is one or more distant outlier and one or more
not so distant outlier in the same direction, the more distant outlier(s) could
significantly shift the mean in that direction, and also increase the standard
deviation, to such an extent that the lesser outlier(s) falls less than 2 or 3
standard deviations from the sample mean, and goes undetected. This is called
the masking effect, and results in this particular method and all related methods
being unsuitable for use as outlier identification techniques. We illustrate with
a simple example, borrowed from Becker and Gather [8].
Consider a data set of 20 observations taken from a iV(0,1) distribution:
-2.21, -1.84, -0.95, -0.91, -0.36, -0.19, -0.11, -0.10, 0.18, 0.30
0.31, 0.43, 0.51, 0.64, 0.67, 0.72, 1.22, 1.35, 8.1, 17.6,
where the latter two observations were originally 0.81 and 1.76, but the decimal
points were entered at the wrong place. It seems clear that these 2 observations
should be labeled as outliers; let us apply the above method. The mean of this
data set is 1.27 while the standard deviation is 4.35. Two standard deviations
from the mean, towards the right, would be 9.97, while three standard deviations
would be 14.32. Both criteria regard the point, 8.1, as expected with reasonable
probability and do not consider it an outlier. Additionally, the three standard
deviation boundary for detecting outliers seems rather extreme for a iV(0,1)
5


data set, surely a point wouldnt have to be as large as 14.32 to be classified as an
outlier. The masking effect occurs quite commonly in practice and we conclude
that outlier methods based on classical statistics are unsuitable for general use,
particularly in situations requiring non-visual techniques such as multivariate
data. It is worth noting, however, that if instead of the sample mean and
standard deviation, robust estimates of location and scale were used (such as
the sample median, and median absolute deviation, MAD), both outliers would
be detected without difficulty.
If it is known that at most one outlier is present, the naive method would
find it. An improved version of this rule forms the basis of Grubbs test for a
single outlier [29], which measures how many standard deviations away from the
mean the smallest and largest observations are located, and can be extended to
Grubbs test for k outliers, where k has to be known in advance. Thode [37]
compares Grubbs and a variety of other outlier tests, including those that do
not require k to be specified in advance, but we believe that better methods
exist for both univariate and multivariate outliers.
Related to the masking effect is the concept of breakdown point of an esti-
mator. Intuitively, the breakdown point is the least amount of contamination
that could change the estimator so much that it fails to meaningfully describe
the data set. As originally defined, the breakdown point e* depends on the
distance between two distributions (e.g. the Prohorov or Kolmogorov distance,
or total variation distance [35, p. 97]), but an empirically more useful version
is the finite-sample breakdown point e*, formally defined with reference to the
6


estimator Tn =T(xi,..., xn), as
<(Tn;xi,... ,xn) := -max{m;maxili...)imsupy ... ,zn)\ < oo},
Tb
where the sample (zi,...,zn) is obtained by replacing the m data points
Xix,..., xim with arbitrary values y^,, yim. In most cases, lim^oo e* = e*.
The highest possible value of the breakdown point is 50%, because if more than
half the data were outliers, it would be unclear which data were from the main
distribution and which were outliers.
Neither the sample mean nor the sample variance has a high breakdown
point. In fact, since it takes only one extremely distant point to shift or in-
crease these respective estimators so far as to be essentially useless, both have a
breakdown point of 0 for the finite-sample breakdown point). On the other
hand, the median has a breakdown point of 50%, because half of the observa-
tions (strategically selected to be as malicious as possible) need to be moved
arbitrarily far away before the median becomes unusable.
1.2 Multivariate Outliers
Multivariate outliers pose challenges that do not occur with univariate data.
For instance, visual methods simply do not work. Even plotting the data in
bivariate form with a systematic rotation of coordinate pairs will not help. It
is possible (and occurs frequently in practice) that points which are outliers
in bivariate space, are not outliers in either of the two univariate subsets
see Figure 1.1. Generalization to higher dimensions leads to the fact that a
7


multivariate outlier does not have to be an outlier in any of its univariate or
bivariate coordinates, at least not without some kind of transformation, such as
a 45 rotation of the axes in Figure 1.1. Gnanadesikan and Kettenring [28] state
that visual detection of multivariate outliers is virtually impossible because the
outliers do not stick out on the end.
A successful method of identifying outliers in all multivariate situations
would be ideal, but is unrealistic. By successful, we mean both highly sensi-
tive., the ability to detect genuine outliers, and highly specific, the ability to not
mistake regular points for outliers. In the words of Gnanadesikan and Ketten-
ring [28], it would be fruitless to search for a truly omnibus outlier detection
procedure. Thus we introduce the problem of multivariate outlier detection.
Intuitively, we label a point an outlier because it is sufficiently far away
from the majority of the data. An important tool for quantifying far away, is
the Mahalanobis distance, defined as
MDi = y/fo T(X))TC(X)-1(xj T(X))
for each point x$, where T(X) is the coordinate-wise sample mean of the data
set X, and C(X) is the sample covariance matrix. Cleary, the Mahalanobis
distance relies on classical location and scale estimators. As such, it is subject
to the masking effect, and is not suitable for general use in contaminated data.
There are several qualities we would like an estimator to possess. We would
like it to have a high breakdown point, but this could be outweighed by other
criteria. (If we were forced to use an estimator with breakdown point near 50%,
8


CM .
O _ *
00

CO -
'St # *<
CM - i i i i i
2 4 6 8 10
X
Figure 1.1: Bivariate outliers that are not univariate outliers
9


we should probably first do something else to clean the data.) For realistic
applications, a breakdown point of 20% is usually satisfactory.
Another desirable property of an estimator is affine equivariance. A location
estimator T G is affine equivariant if and only if for any vector b G and
any nonsingular p x p matrix A,
Tn(AX + b) = AT(X) + b.
A scale estimator Cn G PDS(p) (the set of positive-definite symmetric p x p
matrices) is affine equivariant if and only if for any vector b G 5RP and any
nonsingular p x p matrix A,
Cn(AX + b) = ACn(X)Ar.
If an estimator is affine equivariant, stretching or rotating the data wont
affect the estimator. Dropping this requirement greatly increases the number of
available estimators, and in many cases, non-affine equivariant estimators have
superior performance to affine equivariant estimators.
A further important and desirable feature of an estimator is a computation-
ally fast algorithm to compute it. It is not uncommon to measure data set sizes
in terabytes or petabytes, and some real time applications require the outliers
to be detected within seconds or at most a few minutes (as in the case of air
passenger check-in). In these situations, slow algorithms such as the Minimum
Volume Ellipsoid (discussed on p. 28) are essentially useless, that is, algorithms
that are theoretically capable of locating the outliers, but would take millennia
10


to do so.
As previously mentioned, outlier detection in all data situations by the same
algorithm is not possible, but we can still aim for a reasonable diversity. This is
called the breadth of algorithm, and we therefore strive to develop an algorithm
with a large breadth.
No estimator possesses all of these properties high breakdown point,
affine equivariance, computational speed, breadth. Rocke and Woodruff [68]
stated that it is entirely appropriate to develop special methods to handle special
cases. For large multivariate data sets, computational speed seems to be one of
the most difficult goals to achieve.
In this thesis, I examine some of the leading outlier detection methods, and
also propose a new algorithm, comparing and evaluating its success and speed
in a variety of situations, and investigate some useful numerical properties of its
robust estimator. Specifically, in chapter two, I discuss existing outlier detection
methods, pointing out strengths and weaknesses and identifying areas where
improvements are needed. In chapter three, I propose a new algorithm that has
several significant advantages. In chapter four, I compare the performance of the
proposed algorithm to existing methods in a variety of data situations, and in
chapter five, I perform a theoretical analysis of the robust estimation procedure
and discuss avenues for future research and possible improvement.
11


2. Previous Outlier Identification Methods
Outlier identification and robust location/scale estimation are closely re-
lated. Recall that the classical Mahalanobis distance has a breakdown point of
0. To increase this breakdown point, Campbell [11] proposed the use of robust
estimators, and specifically, a class called M-estimators, for T(X) and C(X).
This step marked an important improvement in identifying outliers, with the ro-
bust version not suffering from the masking effect like classical version. Clearly,
the performance of this robust version depends on the specific T(X) and C(X).
As initially introduced, M-estimators were monotone functions and were shown
by Maronna [52] to have a breakdown point no larger than This is ob-
viously an improvement over the classical version, but still quite inadequate
in higher dimensions with p = 20, the breakdown point is less than 5%.
Later, more sophisticated M-estimators were developed with breakdown point
approaching 50%. Nevertheless, using the Mahalanobis distance with robust
estimators of location and scale was an important conceptual breakthrough and
led to significantly improved versions of the Mahalanobis distance. Obtaining
robust location and scale estimates remains one of the most difficult challenges
in outlier identification; without which, all distance-based methods would fail
(this includes the majority of outlier identification algorithms.)
Using a robust estimate of the location and scale of the primary distribution,
the modified Mahalanobis distance can provide a reasonably good estimate of
how far a particular observation x* is from the center of this distribution. A suf-
12


ficiently large value of this distance indicates that the observation x* is likely an
outlier. In 1996, Rocke and Woodruff [66] stated that all known affine equivari-
ant methods of solving the outlier identification problem consist of the following
two stages:
Phase I: Estimate a location and scale.
Phase II: Calibrate the scale estimate so that it can be used to suggest
which points are far enough from the location estimate to be considered
outliers.
Several methods yield acceptably accurate estimates of the mean, but per-
formance regarding robust estimation of the covariance matrix varies widely.
Additionally, the Mahalanobis distance is much more sensitive to small changes
(errors) in the covariance estimate C(X) than it is to changes in the mean
estimate T(X).
Quantifying the magnitude of Mahalanobis distance at which an observation
is labeled an outlier is another critical element of these algorithms. In some sit-
uations, such as normally distributed data, the distribution of squared classical
Mahalanobis distances follows an F distribution, as shown in [43, ch. 5] and [51,
pp. 73-75], but not in all situations. Hadi et al. [32] develop robust location
and scale estimates and accordingly robust Mahalanobis distances for all the
observations, classifying those points as outliers which have robust Mahalanobis
distance larger than p, where cnpr is a correction factor based on n,
p and r (the number of points on which the location and scale estimates were
based). We address this issue in detail in section 3.2.2; our study reveals that
13


while this correction factor works well for the classic Mahalanobis distance on
normally distributed data, it can be quite inaccurate (too high) for modified
Mahalanobis distances from non-normal data and fails to detect the majority of
outliers even when an apparently clear separation exists between both groups.
2.1 The Influence Function
For a given sample of size n, Hoaglin et al. [38] (and to a lesser extent,
Thode [37]) discuss the sensitivity curve SC{x\Tn), which is computed by cal-
culating the estimator Tn with and without an observation at the point x, and
is proportionate to the size of the sample:
SC(x\Tn) = n(Tn(x\,.. xn\, x) Tn\(x\,..., xn_i)).
The sensitivity curve describes the effect of an individual observation on the
estimator Tn for a specific data set. It is not hard to compute empirically, but
is not of much theoretical use. When we let the sample size n tend towards
infinity, the resulting limit is called the influence function. In the absence of
a sample data set, the influence function is defined with reference to a specific
distribution, F.
Hampel [33] introduced the concept of the influence function which he and
his collaborators popularized in a book [35]. Although the influence function
approach has been used primarily (but not exclusively) on univariate data, our
experience has been that it can be advantageous from a computational point of
view to examine each component separately. We can then combine the respec-
14


tive estimates to calculate the Mahalanobis distances and locate the outliers in
multivariate space.
Suppose we have a univariate sample (xi,..xn) with empirical distribution
Fn. Letting Ax be the point mass 1 in X, we can define Fn by Fn = ^ Y^=i AXi-
To estimate the location parameter 0 of the underlying distribution F1 we
consider the statistics Tn = T(xi,... ,xn) = T(Fn). We can think of these
as a sequence of statistics {Tn;n > 1}, one for each possible sample size
n. Asymptotically, we consider estimators which are functionals. That is,
limn-tooTnfei,... ,xn) = T(F), where the convergence is in probability, and
we say that T(F) is the asymptotic value of {Tn\ n > 1} at F. The influence
function IF of the estimator T at the distribution F is given by
IF(x.T,F) = u^l-FI^Aznn
' e->0 e
If we replace F by Fn_i F and put e = we can see that IF measures
approximately n times the change in T caused by an additional observation at
x when T is applied to a large sample of size n 1. As discussed in Hampel
et al. [35], the importance of the influence function lies in the fact that it
can describe the effect of an infinitesimal contamination at the point x on the
estimate T, standardized by the mass of the contaminant. It gives us a picture
of the asymptotic bias caused by contamination in the data.
The influence function has a number of very appealing properties. Ham-
pel [33] shows that under conditions usually met in practice, Tn is asymptot-
ically normal with asymptotic variance equal to the integral of IF2 over F;
15


Huber [40] also discusses these regularity conditions. That is, the distribution
of (y/n[Tn T(F)]) tends to N(0, V(T, F)), where V(T,F) is the asymptotic
variance and is equal to
This formula shows that a bounded influence function implies a bounded
asymptotic variance, and cautions against using estimators with unbounded
influence functions (for example, the sample mean).
Another important property of the influence function is the supremum of the
absolute value, from which Hampel et al. [35] define the gross-error sensitivity
7* of T at F as
where the supremum is over all x where IF(x\T,F) exists. The gross-error
sensitivity measures the worst influence which a small amount of contamination
can have on the value of an estimator, and can be regarded as an upper bound
on the asymptotic bias of the estimator T. Ideally, 7* (T, F) is finite. In many
cases, putting a bound on 7*(T, F) is the first step in making T robust.
The influence function can also tell us more about T through the local-shift
sensitivity X* Slight changes in some observations, perhaps due to rounding
or grouping, could have a measurable effect on the estimate. The effect of
shifting an observation slightly from the point a; to a neighboring point y can
be measured by means of IF(y,T,F) IF(x-,T, F), because an observation is
j'(T,F)=sup\IF(x-,T,F)\,
X
16


added at y and removed at x. This can be approximated by the slope of the
influence function. A* is the worst of these normalized differences:
_\IF(r,T,F)-IF(X;T,F)\
A -- blip -------j-----j--------.
Xjty \y-X\
Recall that the earliest treatment of outliers was to simply reject them
entirely, as in Legendre [48] and Edgeworth [20]. In terms of the influence
function, this means that the IF vanishes completely in a certain region (usually
beyond a certain distance from the location estimate). In other words, if the IF
is zero in some region, then all the observations in that region have no effect on
T. For a symmetric F (and putting the center of symmetry at zero), Hampel et
al. [35] define the rejection point p* as
p* = inf{r > 0; IF(x\T,F) = 0 when |a:| > r}.
If no such r exists, then p* = oo by definition. All observations farther away
than p* are rejected completely.
Figure 2.1 shows a scaled version of the influence function, the ^-function,
for the sample mean and median. For T equal to the sample mean, it can be
shown that IF(x;T,F) = x, the asymptotic efficiency e = 1 (the reciprocal of
the variance) the maximum possible efficiency, and the local-shift sensitivity
A* = 1, also the lowest possible value for meaningful estimators. However, both
the gross-error sensitivity 7* = 00 and the rejection point p* 00, which are
very serious short-comings in the presence of outliers. We deduce that the mean
is least susceptible to rounding errors and is very efficient, but can break down
17


completely if even just a single far-away outlier is present.
The influence of the sample median has a discontinuity at the origin due to
the fact that it changes discontinuously if one of the two center values is changed,
and is horizontal (but non-zero) everywhere else. The local-shift sensitivity is
A* = oo because of this discontinuity, while at F = N(0,1) the median has
efficiency e = | = 0.637 and for F = N(p,,a2) an asymptotic efficiency of
e = Hampel et al. [35] show that the gross-error sensitivity of the median
is 7* = 1.253, which is in fact the minimum value at F N(0,1). Regardless
of the medians other shortcomings, no estimator can beat the median when
it comes to robustness. Finally, the rejection point p* = oo, which may seem
surprising at first glance, but the median does not reject outliers, it only lessens
their influence, to a considerable degree. It counts rather than measures all
observations except for the middle one or two, respectively, for an uneven or
even sample size.
Based on the above comparison, we can easily conclude that the median is
much more robust than the mean, at the cost of efficiency and local-shift sensitiv-
ity. There is still room for improvement over the median, however, particularly
regarding its efficiency e = | = 0.637. Using the IF-based criteria mentioned
thus far, we explicitly state four goals that we would like our estimator to attain:
Possess a low gross-error sensitivity 7*.
Possess a finite rejection point p*.
Possess efficiency e close to 1.
18


Possess a low local-shift sensitivity A*.
Having developed these criteria wherewith we can evaluate the performance
of various estimators, we now examine several candidates. We start with a class
of estimators which although not too successful themselves, paved the way for
other, modified and improved, estimators.
2.2 M-estimators
2.2.1 Monotone M-estimators
The maximum likelihood estimator (MLE) is defined as the value of Tn =
T(xx,... ,xn) that maximizes n?=if(xi,Tn), where / is the estimated density of
F and is implicitly dependent on the estimated parameter Tn. In practice, we
usually calculate the MLE as the value of Tn such that
n
Tn = argnun In f(xi,T)].
i=1
Huber [39] generalized this minimization problem to
71
Tn = argimny~]p(xi,T).
i=1
Suppose that p has derivative -0, ^(x, 9) = -§gp(x, 6). Then the estimate Tn will
also satisfy
n
^2ip(xi,Tn) = 0,
i=l
19


although a Tn can exist as a solution to the latter equation without solving the
minimization problem. An estimator Tn defined by either of these two equations
is called an M-estimator. Huber proposed the name M-estimator because these
are generalized maximum likelihood estimators. Of these two equations, the
^-function is more useful in practice. In fact, Hampel et al. [35] show an
important connection between the IF and the -^-function:
^ 1 -1 M'Kv,9)h%F)dF(vy
That is, the ^-function is simply a scalar multiple of the IF. Within this
framework, we present the p- and ^-functions of the sample mean and median
in Figure 2.1. For the sample mean, p{x) = x2 and ip(x) = x (ignoring the
constant 2 as is customary, since it doesnt affect the minimization problem),
while for the sample median, p(x) = |m| and p(x) sgn(s).
Recall that the sample median has good robustness properties, but suffers
from low efficiency and infinite local-shift sensitivity A*, while the sample mean
has the highest possible efficiency at F = N(0,l) and the lowest possible A*.
Huber [39] combined these properties to produce what he called the Huber esti-
mator:
ipb(x) := min{5, maxis, 6}} x min (1,
V \X\J
for 0 < b < oo. Hubers estimator has a breakdown point of 50%, just like the
sample median, but has much better local-shift sensitivity. For b also has the same optimal efficiency as the sample mean. The rejection point
20


Figure 2.1: p- and -0-functions for the sample median and mean
psi-function for mean
-2-1012
psi-function for median
-1.0 -0.5 0.0 0.5 1.0
x o -
_JL
ro -
rho-function for mean
0 12 3 4
rho-function for median
0.0 0.5 1.0 1.5 2.0


p*, however, is still unbounded because for |rc| > b, the IF resembles that of the
sample median. We present a graphical version in Figure 2.2, together with 3
other robust estimators.
It would be beneficial to take note of Winsors principle, quoted by Tukey
[81, p. 457], which states that all distributions are approximately normal in
the middle. Applied to M-estimators, this calls for the middle portion of their
ip-functions to resemble the function that is best/most efficient for Gaussian
data, namely the sample mean. Thus, ^-functions with optimal efficiency will
be approximately linear near the origin: ip(u) u. At this point, we can refine
some of the properties that we would like our idealised estimator to possess:
We would like the IF to be linear in a vicinity of zero this will provide
good efficiency (close to e = 1) as well as small local-shift sensitivity A* for
those observations near the center (assuming a symmetric distribution).
We would like the IF to be bounded.
We would like the IF1 to return to zero (and remain at zero) for some dis-
tance sufficiently far away from the origin resulting in a finite rejection
point p*.
The class of M-estimators that satisfies the latter objective are accordingly
called redescending M-estimators.
2.2.2 Redescending M-estimators
Redescending M-estimators are those M-estimators that are able to reject
extreme outliers completely. That is, there exists an r, 0 < r < oo, such that
22


Andrews sine function
to
CO
Hubers estimator
Hampels estimator


i>{x) =0 V |x| > r.
One of the earliest proposals for a redescending M-estimator was Hampels
three-part redescending M-estimator, first introduced by F.R. Hampel in the
Princeton Robustness Study [3], an extensive theoretical and Monte Carlo study
of different robust estimators published in 1972. Hampels estimator is defined
by
1pa,b,r(x) = X o < |rc| < a
= a sign(:r) a < \x\ < b
= a^Jsign(a;) b < \x\ < r
- 0 |a;| > r,
where 0 the Princeton Robustness Study, and is shown in Figure 2.2.
The lack of differentiability of i/ja,b,r is not ideal, however, and a smooth
'(/-function would be preferred. This led to the development of Andrews sine
function, also introduced in the Princeton Robustness Study [3] and defined by
tysin{a)(x) ~ Sin i[7ra,7ra](^')j
where I(x) is the indicator function. We graph the (/-function in Figure 2.2.
In 1974 Beaton and Tukey [7] proposed the biweight function (short for
^square weight), a smooth function that has been successfully used in a wide
variety of applications, and depicted in Figure 2.2. In the context of ^-functions,
it can be defined as
24


Ai(r){x) = x{r2 z2)2I[-r,r](o;).
The term bisquare refers to the square of the second term, which ensures
continuity for both ip(x) and ip'(x) (useful during certain optimization routines).
The biweight has been used in as diverse applications as biostatistics (quantal
bioassay Miller and Halpern [56]), materials science (regression of rubber
abrasion loss Cleveland [14]) and economics (computation of Nigerian cur-
rency exchange rate Okafor [58]). It has also encouraged further research
regarding various properties of the ^-function and accordingly, forms the ba-
sis of several successful algorithms including the biflat estimate introduced by
Rocke [64].
2.2.3 Computational Issues of M-Estimators
Computing an M-estimator involves a numerical optimization problem
which frequently does not have a unique solution. Upon re-examination of the
defining equation,
n
Y^ip(xi,Tn) = 0,
i=1
it is clear that for redescending M-estimators, the LHS will vanish for all \x\ suf-
ficiently large. It might be possible to find the global minimum of p(u, Tn)
instead, but this is not always computationally feasible. Rocke and Woodruff
[84] develop a compound estimator, formally introduced on page 31, which con-
sists of two stages: the first utilizes a combinatorial estimator such as MVE
or MCD, defined on page 28, to locate a good starting point, which can then
25


be used for the minimization stage. Hampel et al. [35] propose to either (a)
select the solution which is nearest to the sample median, or (b) use Newtons
method starting with the sample median. Our experience with the biweight has
been that starting from the sample median, convergence to the desired value is
successfully and quickly achieved.
M-estimators are not invariant with respect to scale, as defined up to this
point. Fortunately, scale invariance can easily be achieved by defining Tn as the
solution to:
where Sn is a robust estimate of scale determined beforehand. The median
absolute deviation (MAD) has the maximal breakdown 50% and is recommended
for use in M-estimators by Hampel et al. [35, p. 105], and by Huber [40, p.
107], who called the MAD the single most useful ancillary estimate of scale.
Because the expected value of the MAD is 0.6745cr for a Gaussian distribution,
a robust scale estimate based on the MAD is usually defined as
Sn = 1.483 MAD{x} 1.483 medianj{|:rj median.,!^} |}.
See for example, Mosteller and Tukey [57, p. 205], Various software packages
scale the MAD differently; the Splus routine mad defines the MAD as above.
2.3 S'-Estimators
A class of estimators related to M-estimators are the 5-estimators, intro-
duced by Rousseeuw and Yohai [72] and so called because they are based upon
26


minimization of a scale statistic. An important distinction between S- and M-
estimators is that S-estimators provide simultaneous estimates of both location
and scale, while M-estimators produce estimates for only one of these quantities.
For a multivariate data set (x*,... ,xn), the S-estimate of location and scale is
defined as the vector TeSRp and matrix C 6 PDS(p) that jointly solve
for bo a tuning constant. The constraint equation is often compactly written as
where d; is a robust version of the Mahalanobis distance, p is usually chosen to
be a scaled version of some base function, such as the biweight. Lopuhaa [50]
showed that the breakdown point is given by the ratio of b0 to the maximum of
p. Rocke [64] showed that S-estimators in high dimension can be very sensitive
to outliers, even with high breakdown point, illustrating yet again the difficulty
of obtaining a robust estimate of multivariate scale.
Multivariate S-estimators pose a significant computational challenge. The
S-estimator is defined as the location estimate T and PDS matrix C that achieve
the unique global minimum; it can be very difficult to guarantee finding a global
minimum. In practical settings, the computed S-estimator may differ greatly
min |C|
subject to
27


from the theoretical 5-estimator. This is illustrated with the bushfire data
set. Originally published in a CSIRO report [12] and reprinted in Maronna and
Yohai [53], this data set helped to locate bushfire scars in Australia by satellite
measurements of n = 38 pixels in p = 5 frequency bands. Using the same
theoretical 5-estimator, Becker and Gather [8] labelled different observations
as outliers than Maronna and Yohai [53]. Becker and Gather [8] ascribe this
to the fact that different computational algorithms were used to minimize the
objective function, and that due to a different starting point, Maronna and
Yohais algorithm may have converged to a local minimum instead of the global
minimum. In the words of Rocke and Woodruff [84], the algorithm is the
estimator.
For difficult.minimization problems like 5-estimation, a good starting point
is of the utmost importance; it is worth investing considerable time and effort
in pursuit of such. Rocke and Woodruff [84] proposed a procedure called a com-
pound estimator, which obtains initial estimates of location and scale through
one type of robust estimator, which are then used as starting points for the more
powerful 5-estimator. The estimator that Rocke and Woodruff use to obtain
their initial estimates is a combinatorial estimator, described in the following
section.
2.4 Combinatorial Estimators
In 1985 Rousseeuw [69] introduced the Minimum Volume Ellipsoid (MVE)
and Minimum Covariance Determinant (MCD). These are both affine equivari-
ant estimators with high breakdown points of 1w/2E~p+1 ; approaching for large
28


n. The location estimate provided by the MVE is the center of the ellipsoid
of minimal volume containing approximately half of the points in X, while the
covariance estimate is provided by the shape matrix of this ellipse. Lopuhaa and
Rousseeuw [50] add a small modification, which results in a slightly higher break-
down point of and which calculates the location estimate T G
and covariance estimate C £ PDS(p) according to
min |C|
subject to
#{*: (x T)rC-1(xi T) < c2} >
where # denotes cardinality. For Gaussian data, Lopuhaa and Rousseeuw [50]
show that a good value for c2 is Xo.5)P> the value for which PMis{(X/x)tS-1(X
A4) < c2} = Rousseeuw and Leroy [70, p. 259] give an algorithm for the MVE.
The MCD is similar to the MVE and has the same objective function. The
difference is that in the constraint, the MCD requires only that the estimate be
formed using half the points, instead of the ellipse containing half the points. It
follows that both the MVE and MCD are too cumbersome to compute precisely,
except in small data sets. Several different algorithms have been proposed,
but of necessity they can be only approximations. These are usually based on
various optimization techniques applied to smaller subsamples of the data, and
as such can never guarantee the desired result. Standard optimization techniques
cannot be applied in the usual sense because the criteria are nonconvex and
n + p +1
2
29


nondifferentiable. There has been some innovative research done in this regard,
with optimization methods ranging from Glovers tabu search [26], [27], applied
to the MOD by Rocke and Woodruff [84], to a steepest descent algorithm with
random restarts proposed by Hawkins [36]. This is another case where the
algorithm is the estimator.
It later became apparent that the MCD was vastly superior. Rocke and
Woodruff [84] found that with p = 25 and 15% contamination, the MCD took 32
seconds to produce good enough starting values for the ^-estimator to converge,
whereas after 8,192 seconds the MVE had still not converged. For the 20 x 5
wood specific gravity data set [70, p. 243], Cook and Hawkins [15] had to
examine 57,000 cases (ellipses) before finding the MVE. Since this was still
not guaranteed to be the correct answer, they allowed the algorithm to run to
completion, i.e., full enumeration, which took 50 minutes on an IBM PC/AT.
Even on this outdated computer, for a mere 20 x 5 data set we could certainly
wish for more.
It appears that an inherent weakness of both the MVE and MCD is that the
computation time grows too fast to be useful in moderate and large data sets.
Hawkins [36] proposed an efficient version of the MVE where the search time
grows with pn. This leads to what Rocke and Woodruff [84] call the perverse
situation, in which the analysis can actually be made worse by the collection
of more data as the size of the data set grows, so too does deterioration in
the algorithms performance.
30


2.5 Compound Estimators
In 1994, Rocke and Woodruff [84] proposed a new class of estimators called
compound estimators, which utilize the complementary strengths of combina-
torial and ^-estimators. To remove the paradox of additional data degrading
the estimators performance, they randomly partition the data into L^yJ cells,
where y(p) is the minimum number of observations required in each cell. They
then apply the MCD algorithm to each cell, constrained by a predetermined
time limit. If the researcher can afford to use t seconds for this stage of the
algorithm, the MCD algorithm is only allowed to spend -j seconds searching
7(p)
each cell. This will result in Ly^yJ different location and scale estimates; the best
of these is chosen as the starting point for the S-estimator algorithm, calculated
on the unpartitioned data set. The final solution is given by the S'-estimator;
that is, the location and scale estimates producing the smallest determinant of
the covariance matrix, |C|.
Successful execution of this method requires coming fairly close to the de-
sired location and shape in the first stage (MCD applied to each data cell), which
would lead the ^-estimation algorithm to converge to the correct root. More
data would help, not hinder, this estimator, as there is an increased chance that
one of the initial L^yJ cells would contain fewer outliers, provide a good starting
point and aid convergence to the desired minimum. (Only one such lucky cell
is required.) Rocke and Woodruff distribute C code for this algorithm, which
31


they call MULTOUT.
2.6 BACON
Hadi et al. [32] describe a fast, robust method called BACON Blocked
Adaptive Computationally-efficient Outlier Nominator. BACON is an iterative
method, the core of which consists of identifying a basic subset presumed to
be outlier free, from which the sample mean and covariance matrix can be
estimated, and followed by robust Mahalanobis distances.
To construct the first basic subset, Hadi et al. [32] provide two options. The
first simply computes the sample mean and covariance of all n observations, fol-
lowed by the Mahalanobis distances (or discrepancies di, as Hadi, et al. [32],
call them). Those observations with the cp smallest di s form the initial basic
subset, where c = 5 is typical and not very influential in the algorithms conver-
gence. This procedure is affine equivariant but not at all robust; our experience
has been that in data sets containing a substantial amount of outliers, the algo-
rithm falsely converges and fails to detect them. The second option computes
the coordinate-wise median m, and thereafter ||x; m|| for all n observations.
Those cp observations with the least distance to m form the first basic subset.
This start is robust, but not affine equivariant. Subsequent iterations are affine
equivariant, however, so the overall algorithm is nearly affine equivariant. We
list the steps required for BACON as follows:
1. Select an initial basic subset according to either of the two options, and
calculate the sample mean T6 and covariance Cf, of this subset.
2. Compute the discrepancies di, according to
32


MTb, Cb) = y/ixi-TjTC^te-Tt).
3. Set the new basic subset to contain all points with discrepancy di less
than cnpj. /r?, where cnpr = cnp + Chr is a correction factor; h
V
[_?t+p+i j ^ r -g sjze 0f current basic subset, chr = max {0, |=^}, and
Cnp = 1 + S + = 1 + S + assuming n + p + 1 is even.
4. Repeat steps 2 and 3 until the size of the basic subset remains constant.
5. Nominate the observations excluded by the final basic subset as outliers.
Hadi et al. [32] add the correction factor in step 3 because when
the elements of the covariance matrix tend to be smaller than they should be.
This method is a faster version of a forward search algorithm proposed by
Hadi in [30] and modified in [31]. These methods are similar to BACON but
in step 3, add only the single point with the least di provided it is less than a
certain threshold value). Several variations of forward search algorithms have
been examined in the literature e.g. in [37] and [6], but they usually converge
very slowly for large n. Similarly, backward search algorithms (also discussed
in [37] and [6]) have been proposed which start with all n observations, and
at each step discard the single observation with the furthest distance, provided
it is above a certain minimum threshold. BACON is a significant computa-
tional improvement because it can move large blocks of observations in a single
step. Our study of BACON indicates that it is indeed computationally fast, and
demonstrates very good success amongst normally distributed data, but does
33


not perform as well on non-normal data. We will analyze its performance in
greater detail in chapter 4.
2.7 Non-Distance Based Methods
All of the methods mentioned so far base their outlier identification proce-
dures on a robust version of the Mahalanobis distance. While this is certainly the
most popular approach, it is not the only one. In 1974 Friedman and Tukey [25]
introduced a novel procedure which they called projection pursuit. Their pur-
pose was to find interesting structure in the data or unexpected characteristics
that might not be readily apparent. In addition to several other applications not
originally envisioned by Friedman and Tukey including even multiple regression
[24], projection pursuit also lends itself very well to outlier identification.
For data sets following an elliptically symmetric distribution like the multi-
variate normal, the usual summary statistics such as sample mean, covariance,
etc. provide enough information to generally understand it. With other data
sets, however, important information or structures are frequently not adequately
captured by these statistics. The goal is thus to pursue projections in which the
data reveal interesting structures when viewed by the researcher in low dimen-
sion, relying on what Friedman and Tukey [25] call the human gift for pattern
recognition. The projection pursuit algorithm assigns a numerical index, called
the projection index, to certain one- or two-dimensional projections of the data
characterizing the amount of structure present in that projection, as can be
calculated through the data density variation. This projection index is then
numerically maximized with respect to the parameters defining it to obtain the
34


most interesting projection. The projection index can, and should, according
to Huber [41], be affine equivariant. After a certain projection has been viewed,
the particular structure that it revealed is removed, so that other interesting
projections can be displayed, similar to principal components analysis, which is
actually a special case of projection pursuit.
In a comprehensive invited article, Huber [41] states that projection pursuit
methods are able to ignore noisy and information-poor representations. This
gives them a distinct advantage over methods based on inter-point distances
such as multidimensional scaling and clustering techniques, which have both
been used for outlier identification see [78], [59] and [67]. Friedman and Tukey
[25] originally decided to define data as uninteresting if it appeared to have a
normal distribution; Huber [41] gave heuristic arguments to support this idea.
To highlight interesting features of the data, they therefore define the projection
index to measure, and subsequently maximize, departure from normality. Due to
a renewal of interest in projection pursuit in the 1980s, Friedman [23] updated
their original algorithm. According to Huber [41], projection pursuit methods
have primarily one serious drawback their high demand on computational
time. This will subsequently be shown to be quite severe indeed.
2.7.1 The Stahel-Donoho Estimator
The most well-known outlier identification method based upon the projec-
tion pursuit concept, is the Stahel-Donoho estimator, an affine equivariant esti-
mator with a breakdown point approaching 50%. This estimator was developed
by Stahel and Donoho in [79] and [19], and became better known when Maronna
35


and Yohai [53] published an analysis of it in 1995.
The Stahel-Donoho estimator is defined as a weighted mean and a weighted
covariance matrix, where each observations weight is inversely proportional to
how outlying it is. This outlyingness measure is based upon the projection
pursuit idea that if a point is a multivariate outlier, there must be some one-
dimensional projection of the data in which this point is a univariate outlier.
Using a particular observation as a reference point, the Stahel-Donoho algorithm
determines which directions have optimal values for a pair of robust, univari-
ate location/scale estimators, then uSies these estimators to assign weights to
the other points. Employing the median and the MAD (for a breakdown point
approaching 50%), necessitates solving a global optimization problem with dis-
continuous derivatives, for each observation. Selecting smooth estimators does
not help very much because this results in the objective function possessing
several local maxima, so that gradient search methods are rendered unsuitable.
Maronna and Yohai [53] infer that the reason not much attention has been paid
to the Stahel-Donoho estimator is due to other more tractable estimators such
as the MVE and A-estimators. Maronna and Yohai [53] present an algorithm
for this estimator, which works very well for p = 2, and state further that no
known feasible method exists for large p. A simulation study by Juan and Prieto
[44] with n = 100 and p = 5,10,20 comparing the MCD, MULTOUT, and the
Stahel-Donoho estimator, showed the latter outperforming the other two esti-
mators, although they all approached 100% failure rate at contamination level
e = 10% for p = 20, at e = 15%forp > 10, and at e = 20%forp > 5. We
36


conclude that this method is not viable for large data sets, but it has given rise
to at least one other successful method of outlier identification.
2.7.2 Kurtosis
One way of reducing the computational cost of the Stahel-Donoho estima-
tor is to reduce the number of projections that need to be examined. Pena and
Prieto [60] propose a method which they call Kurtosisl. Kurtosisl involves pro-
jecting the data onto a set of 2p directions, where these directions are chosen to
maximize and minimize the kurtosis coefficient of the data along them. Kurtosis
is a measure of how peaked or flat the distribution is. Data sets with high kur-
tosis tend to have a sharp peak near the mean, decline rather rapidly, and have
heavy tails, while data sets with low kurtosis tend to have a flattened peak near
the mean; the uniform distribution is an extreme example. The kurtosis coeffi-
cient is also inversely related to bimodality. A small number of outliers would
thus cause heavy tails and a larger kurtosis coefficient, while a large number
of outliers would start introducing bimodality and the kurtosis would decrease.
Viewing the data along those projections that have the maximum and minimum
kurtosis values would display the outliers in a more recognizable representation.
For large data sets, the Stahel-Donoho algorithm requires projecting the
data onto randomly generated projections because full enumeration is infeasible,
and requires investigation along a very large number of projections in order to
be successful. By drastically reducing these directions, Kurtosisl aims to signif-
icantly improve computational speed without sacrificing accuracy. Successfully
solving the optimization routine to find the maximum or minimum kurtosis val-
37


ues necessitates reaching the global maximum or minimum. Since this is not
efficient, we have to settle for a set of p local maximizers and p local minimizers.
Pena and Prieto [60] show how computing a local maximizer/minimizer would
correspond to finding either (a) the direction from the center of the data straight
to the outliers, which is exactly what we are looking for, or else (b) a direction
orthogonal to it. They then project the data onto a subspace orthogonal to
the computed directions and rerun the optimization routine. This process gets
repeated p times, so in total 2p directions are examined, and we expect with
high probability that the outliers lie along at least one of these directions.
For each of these 2p directions, Pena and Prieto [60] use a backward search
algorithm based on the univariate median and MAD to find potential outliers.
Based upon the sample mean and covariance of all points not labeled as po-
tential outliers, robust Mahalanobis distances MD* are calculated for all the
observations. Those points with MDf1 < %q 99p are not considered outliers and
are put back in the set of inliers. This process is repeated until convergence
(usually only a few iterations.)
Our study of this method shows that it does indeed do a good job of detecting
outliers, for a wide variety of outlier types and data situations, although it
tends to produce a relatively large number of false positives. An interesting
feature is that its success rate remains approximately the same for outliers both
near and far, whereas distance-based methods demonstrate a marked increase
in performance for distant outliers. Although it is a significant improvement
over the Stahel-Donoho estimator, computational speed still remains a serious
38


weakness, especially with regard to p.
2.8 Types of Outliers
Certain types of outliers are easier to find than others. Some algorithms
generally succeed at detecting a certain type of outlier, but fail miserably with
another type of outlier. Clearly there are an infinite number of outlier types, but
some are more common than others. It is common practice when generating data
for the purpose of testing outlier algorithms, to assume the primary distribution
is an uncorrelated multivariate normal, and that the outliers come from another
multivariate normal with mean pt and covariance matrix E. If the estimator is
affine equivariant, we can assume without loss of generality that the primary
distribution has mean 0 and covariance I. Rocke and Woodruff [65] mention a
useful framework to consider multivariate normal outlier types when the primary
distribution is NP(0,1):
Shift Outliers: The outliers have distribution Np(pb,T).
Point Outliers: These outliers form a densely concentrated cluster, tending
towards a point mass.
Symmetric Linear Outliers: Given a unit vector u, the outliers have dis-
tribution uiVp(0, E), for arbitrary E.
Asymmetric Linear Outliers: Given a unit vector u, and its coordinate-
wise absolute value |z/|, the outliers have distribution [u|A^,(0, E).
Radial Outliers: The outliers have distribution IV(0, kl), where k > 1.
39


Linear outliers are a commonly used outlier type. Radial outliers are too
easy to reveal possible differences between outlier algorithms (all of the outliers
will usually be found by all of the algorithms). Rocke and Woodruff [66] show
that for distance-based methods, point outliers and shift outliers are the most
difficult to detect. For shift outliers, the expectation of the Mahalanobis distance
between the outlier and primary distribution means is least when the covariance
matrices are similar.
Point outliers are also difficult for most distance-based methods, and these
outliers are perhaps best detected with algorithms specifically designed to detect
such densely concentrated outlier clusters. Standard clustering techniques as
described in, for example, Johnson and Wichern [43, ch. 12], do not appear to
perform very well in detecting clusters of this type. In order to detect densely
concentrated clusters, several authors have developed special techniques for this
purpose, such as Pena and Prieto [59] and Rocke and Woodruff [67].
We briefly comment on the near exclusive use of the multivariate normal
distribution when generating sample data sets for outlier algorithms. Our study
shows that an algorithms performance can change drastically when switching
to a different type of distribution. For example, BACON demonstrates virtually
perfect performance in a variety of normally distributed outlier types, but when
analyzing data from a mixture of %2 distributions, shows near 100% failure rate
in many situations. There is no guarantee that observed data follows a normal
distribution in practice, and we feel that in general, more testing of algorithms
on non-normal data should be conducted.
40


We thus conclude our tour of past and current outlier methods and in the
following chapter, propose a new algorithm for outlier detection.
41


3. The Development of a New Method
3.1 Preliminary Investigation
As a preliminary step to developing a new outlier detection method, we
briefly examined and compared existing methods to determine possible areas of
improvement. We selected four algorithms that appeared to have good poten-
tial for finding outliers: MCD (using-an improved version called fastMCD [71]
implemented as cov.mcd in Splus), BACON (a DOS executable version of the
code was obtained from the author, Prof. A.S. Hadi), MULTOUT the hybrid
algorithm of Rocke and Woodruff (a C version of the code was obtained from the
authors website) and Kurtosisl (a MATLAB version of the code was obtained
from the authors). In accordance with most other authors, such as Rocke and
Woodruff [66], we use a multivariate version of Tukeys contaminated Normal
model [81], (1 e)lV(0,I) + eN(5e, AI), where e = (1,...,1)T, and 5 = 10.
We considered p = 5,10,20 and e = 10%, 20%, 30%, 40%. The values of A we
consider are A = 0.1 and A = 1.0 which correspond to a concentrated cluster of
outliers (point outliers) arid shift outliers, respectively. Radial outliers, A = 5.0,
are not very interesting or informative since all the methods yielded 2% error
or less. None of the algorithms except cov.mcd were suitable for repeated sim-
ulatons in the given format, so for each parameter combination we carried out
10 simulations with n = 1,000 points, generated by the Splus random number
generator. The exception was Kurtosisl, which due to MATLABs restriction
42


on large data sets, was designed for only n = 750; these points were randomly
drawn from the original set of 1,000 points.
Frequently, results of this type are presented in a 2 x 2 table showing success
and failure in identifying outliers and non-outliers, which we henceforth call in-
liers. It is also possible to calculate various indices such as sensitivity, specificity
and positive/negative predictive power, terms which are commonly used in the
medical literature. Throughout this thesis, we present the percentage false neg-
atives followed by the percentage false positives in the same cell; we henceforth
refer to these respective values as the outlier error rate and inlier error rate.
We propose these names because the outlier error rate specifies the percentage
errors recorded within the group of true outliers, and similarly for the inlier
error rate. Note that these respective error rates correspond to the complement
of the sensitivity and the complement of the specificity. This is more compact
than a series of 2 x 2 tables and allows us to easily focus on bringing the error
rates down to zero. Nevertheless, 2x2 tables are very descriptive and frequently
used in existing literature, so we illustrate the connection between our notation
Table 3.1: Sample 2x2 table illustrating the notation used in this thesis
Predicted Outliers Predicted Inliers
True Outliers a b
True Inliers c d
Outlier Error Rate = -fr. Inlier Error Rate = -7-7.
43


and a standard 2x2 table in Table 3.1.
Results from these simulations are shown in Table 3.2. The values 50.0/10.3
in the upper left cell of this table mean that for this simulation, MCD failed
to identify 50% of the outliers, and mistakenly identified 10.3% of the regular
points [inliers] as outliers. This corresponds to sensitivity 50.0%, specificity
89.7%, positive predictive value 67.6% and negative predictive value 80.7%.
Computational time was not a big issue for these simulations due to the
small data size. For BACON, we used the second option of obtaining an initial
subset (coordinate-wise medians) because performance with the other option
was markedly inferior. MULTOUTs C code took slightly longer to execute
than the Kurtosisls MATLAB code (on the order of one minute), which might
be a source of concern for larger data sets, but did not pose any difficulties for
n = 1,000. Examination of table 3.2 reveals that both MCD and MULTOUT
are incapable of detecting concentrated outlier clusters. (This could be due to
the algorithms using an initial value originating from a cell consisting solely of
outliers, which would have a significantly smaller covariance than a cell consist-
ing soley of inliers, followed by convergence to the wrong solution.) BACON
demonstrates a perfect score, while Kurtosisl is usually quite accurate with a
few exceptions. The exceptions for Kurtosisl occur when it missed all or nearly
all of the outliers on one or two of the ten runs. For 5 = 10, however, near per-
fect performance should be expected as there is virtually zero overlap between
the inlier and outlier groups.
44


Table 3.2: Preliminary simulations with multivariate normal outliers
MCD BACON MULTOUT Kurtosisl
p = 5 e = 0.3 A = 0.1 50.0/10.3 0/0 0/0.7 0/0.8
A = 1 0/2.7 0/0 0/0.8 0/0.5
e = 0.4 O 1 i1 100/35.6 0/0 100/16.9 0/0.9
A = 1 0/2.5 0/0 0/0.8 10.0/0.7
p 10 e = 0.2 \i o 11 50.0/7.6 0/0 0/1.0 0/1.0
A = 1 0/2.7 0/0 0/0.8 0/1.0
e = 0.3 A = 0.1 100/22.7 0/0 70.0/5.8 0/0.9
A = 1 0/2.8 0/0 0/1.0 0/1.0
6 = 0.4 A = 0.1 100/49.1 0/0 100/22.1 0/0.8
A = 1 0/2.5 0/0 0/0.7 0/0.8
p = 20 e = 0.1 A = 0.1 0/3.6 0/0 20.0/1.2 0/1.0
A = 1 0/3.6 0/0 0/0.9 0/1.1
e = 0.2 A = 0.1 100/17.6 0/0 100/5.3 0/1.0
A = 1 0/3.0 0/0 0/0.9 18.5/1.2
e = 0.3 A = 0.1 100/38.2 0/0 100/12.9 11.0/5.6
A = 1 76.0/2,9 0/0 29.5/1.1 0/1.2
e = 0.4 A = 0.1 100/76.0 0/0 100/34.5 0/1.0
A = 1 96.0/3.5 0/0 99.1/1.6 0/1.0
45


The data in Table 3.2 were generated from multivariate Normal distribu-
tions. An interesting question not usually addressed in the literature is how
these methods would perform with non-Normal data. We therefore carried out
a similar experiment, but this time we generated the data from a mixture of un-
correlated x2 distributions the primary distribution (describing the inliers)
is xb outliers drawn from xh- These results can be seen in Table 3.3, also
i
based on the mean over 10 iterations.
Several differences are apparent from the multivariate Normal data. The
most noticeable is BACONs significant inability to detect x2 outliers for p = 5
and the higher contamination levels for p = 10. MCD continues to produce rela-
tively high numbers of false positives, while MULTOUT and Kurtosisl produce
better overall results.
3.2 The Development of a New Method
3.2.1 Obtaining Robust Estimates
Since it was our purpose to develop an outlier detection method suitable for
large data sets, we considered computational speed a very important criterion.
We therefore chose not to pursue the projection pursuit type algorithms, such as
Kurtosisl, but to focus on distance-based methods instead. A key ingredient of
a successful distance-based method is a robust estimate of the covariance matrix.
Despite being more sensitive to outliers (i.e. more difficult to estimate robustly)
than the location estimate, the Mahalanobis distance is also extremely sensitive
to an incorrect scale estimate and could be rendered incapable of separating the
46


Table 3.3: Preliminary simulations with x2 outliers
MCD BACON MULTOUT Kurtosisl
P5 e = 5% 0/12.2 46.6/0.01 0/6.2 0.2/6.9
e = 10% 0/12.4 91.4/0:03 0/6.56 0.2/6.65
e = 20% 0/10.3 97.0/0.01 0/5.2 0.2/6.7
e = 30% 0/9.2 99.3/0 1.0/4.0 5.5/5.6
e = 40% 0/6.3 99.5/0 80.1/1.0 17.2/3.6
p=10 e = 5% 0/13.7 0.8/0.2 0/7.4 0/7.9
e = 10% 0/12.6 0.4/0.2 0/7.5 0/7.6
e = 20% 0/11.6 10.3/0.2 0/5.7 0/8.0
e = 30% 0/9.5 39.3/0.2 0/5.7 0/7.3
e = 40% 0/7.9 79.1/0.1 0/4.9 5.6
p=20 e = 5% 0/15.4 0/0.2 0/8.0 0/7.6
e = 10% 0/13.8 0/0.3 0/7.3 0/8.3
e = 20% 0/12.4 0/0.3 0/6.3 0/7.6
e = 30% 0/9.5 0/0.4 0/6.0 0/6.5
e = 40% 0/9.3 0/0.2 0/5.0 0/5.6
47


outliers and inliers. We therefore devoted considerable time to obtaining a fast,
robust estimate of the covariance matrix.
Of the four estimators shown in Figure 2.2, we prefer Andrews sine function
and Tukeys biweight function, because they are smoother than the other two
estimators. The biweight has been successfully used in a variety of situations, so
we also decided to use it. For p 2, Mosteller and Tukey [57, ch. 10] calculate a
highly robust covariance matrix based on biweight robust regression. To achieve
this, however, they sacrifice component invariance, i.e. the results from calculat-
ing var(x) followed by var(y) are different than first calculating var(y) followed
by var(x), and hence cdv(x,y) also differs from cov(y,x). Mosteller and Tukey
[57] provide an extension for p 3 and higher, but the computational burden
increase dramatically because estimation of the off-diagonal elements involves
multiple regression procedures. This procedure would not be feasible for high
dimensions. We felt that the price of invariance and additional computational
burden were too high for large data sets, so did not pursue this direction further.
Analyzing each component separately, we subsequently investigated the
weights assigned to observations by the biweight function. Let the univari-
ate observations be Xi,... ,xn with corresponding weights wi,..., wn; Xbw the
location estimate, which we initially take to be the median; S a scale estimate
based on the median absolute deviation from Xbw: S = median{|a:i z^l}; and
c a tuning constant. The weights Wi are then iteratively calculated according to
48


(1_(2i_|^)2)2> when(
Wi = V J
Ej 3*bw
cS
0,
otherwise,
Mosteller and Tukey [57] recommend c = 6 or c = 9; Hoaglin et al. [38]
examine the biweights robustness properties, such as gross error sensitivity, and
efficiency at c = 9 because less data is rejected, but better (i.e., lower) gross
error sensitivity at c = 6. Kafadar [45] also conducted a study investigating the
biweights efficiency for different values of c, sample sizes and distributions and
found that its efficiency increases with both sample size and tuning constant c at
the Gaussian. At other distributions such as One-Wild and Slash, its efficiency
increases with sample size but decreases with c. Note that c = oo corresponds
to the sample mean, because then all the observations have weight equal to one.
For instance, at c = 6, Hoaglin et al. [38] show that the asymptotic variance of
the biweight at the standard Gaussian is 1.094, so that its asymptotic efficiency
relative to the mean (on Gaussian data) is e = 91.4%, and the gross error
sensitivity is 7* = 1.68. For comparison, recall that the mean has 7* = 00 and
for the median, 7* = 1.25. The asymptotic efficiency at c = 9 is e = 98.2%,
with 7* = 2.05. Mosteller and Tukey [57, p. 206] note that the performance
of an estimator with efficiency above 90% is indistinguishable in practice from
one with an efficiency of 100%, so from a theoretical viewpoint, c = 6 seems
preferable when dealing with outliers. Since the expected value of the MAD
performance (efficiency) for these and other values. The biweight has a higher
49


equals 0.6745 equals roughly 4a. We found that c = 6 works better than c = 9 because it
limits potential outliers from being included in the initial location and scale
estimates; it is very important that these estimates accurately describe only
the inliers. We did not investigate c 3; this would yield a rejection point of
0.6745 3cr 2a, which would reject too many inliers and underestimate the
scale.
Another computational matter is convergence. Kafadar [45] found that three
to six iterations were sufficient for convergence to 0.0005 of the biweight scale
statistic. For our intended applications, namely large data sets, computational
speed is very important, so we used 3 iterations. We do not store the value of the
biweight location or scale, and will shortly discuss how we found better results
when reassigning the biweight weights to either zero or one, so the only critical
aspect is whether the weights are above or below the reassignment boundary.
After fitting the biweight to each of the p components, we used the weights
thus obtained to calculate a weighted mean and covariance matrix. For the
covariance, we used the product of the weights for the different coordinates as
J2k=i_wk^k(xk~x )(xk~xl) ^ where the superscript indicates the
2^k=i wkwk
dimension number (1.. .p) and the subscript the observation number (1... n).
Investigation into the covariance matrices obtained in this manner revealed that
they were frequently too small. This occurred because most of the inliers re-
ceived weights in the range 0.90-0.99, so that their full contributions to the
sample covariance were not counted. This was significant because points near
follows: C(i,j)
50


the edge of the primary distribution received reduced weights, yet their full
weights were necessary for an accurate estimate. To correct this problem, we
focused attention on producing an initial, smaller set of outlier-free data from
which we could calculate an unweighted sample mean and covariance. This is
a similar idea to BACON, but we achieve the result in a very different man-
ner. To this end, we reassign a weight of 0 to all points with biweight-weight
of less than 0.3, and a weight of 1 to those points with biweight-weight of 0.3
and greater. This value of 0.3 was initially arbitrarily chosen; we selected a
relatively small value so as not to discard too much data. We subsequently refer
to the value where this separation is made as the cutoff value determining a
suitable cutoff value for a wide variety of data situations is an important part
of this investigation.
We illustrate the superiority of this modification with a tricky set of cor-
related Poisson data, following a method described in [42, ch. 37]. Consider
a data set with n 10,000 and p = 5. Generate the first 8,000 observations
from a Poisson(lO) distribution, and the first component of the next 2,000 ob-
servations from a Poisson(40) distribution. The remaining 4 components of the
last 2,000 observations are again generated from a Poisson(lO) distribution. To
obtain correlated data, we generate another 10, 000 x 1 vector Y from a Pois-
son(10) distribution, and add this vector to each component of the data matrix.
Let Ylt...,Ys denote the columns of the data matrix before addition of Y, and
Xi,...,X5 the columns after addition. It is straightforward to compute the
theoretical covariance matrix: cov(Ai, X2) = cov(Yi + Y, Y2 + Y) = var(Y) = 10
51


and var(Xi) = var(Fi + Y). For the set of inliers the first 8,000 observa-
tions the variance of each column Xk is 10 + 10 = 20, k = 1,..., 5. Using
the original set of biweight weights, the following sample covariance matrix was
obtained:
15.29 6.49 6.47 6.57 6.68
6.49 13.35 6.01 6.28 6.36
6.47 6.01 12.49 6.08 6.24
6.57 6.29 6.08 12.97 6.47
6.68 6.36 6.24 6.47 13.42
Using a cutoff value of 0.3, we obtained the following sample covariance matrix:
20.19 8.91 8.99 9.16 9.38
8.91 19.31 8.77 9.21 9.33
8.99 8.77 18.35 8.95 9.10
9.16 9.21 8.95 19.14 9.48
9.38 9.33 9.10 9.48 19.45
which is a considerable improvement and only slightly smaller than the theo-
retical inlier covariance matrix: 20s on the diagonal and 10s everywhere else.
Considering the fact that there is considerable overlap between the outliers and
inliers, and also that the data are not Normally distributed, the latter covariance
estimate is quite satisfactory. It leads to adequate separation between inlier and
outlier Mahalanobis distances, so that for this particular data set and cutoff
value, the outlier identification error is virtually zero. For comparison, we also
52


examine the results obtained by some of the other outlier algorithms. For the
same data set, BACON did not detect any outliers and both its initial and final
covariance estimates were very similar, the final estimate being
169.68 9.54 9.52 9.69 9.42
9.54 20.33 9.82 10.20 10.23
9.52 9.82 19.81 10.14 10.07
9.69 10.20 10.14 20.28 10.46
9.42 10.23 10.07 10.46 20.34
3.2.2 Determing the Rejection Boundary
An important issue is how to actually reject the outliers. Ideally, our robust
estimates of location and scale will lead to robust Mahalanobis distances demon-
strating a good separation between outliers and inliers. Johnson and Wichern
[43, ch. 5] state that for multivariate normal data, the classical Mahalanobis
distance based on the sample mean and covariance matrix is follows an F dis-
tribution. This implies that an appropriate boundary value could be found by
selecting a suitable F percentile. Our study showed that boundary values cal-
culated in this manner were frequently suboptimal. BACON uses a modified
Xp percentile as described on page 33, we found that the rejection values it ob-
tained were often too high. Plotting the robust Mahalanobis distances would
easily allow perfect classification through visual inspection, but the computed
rejection point led to unnecessary misclassification errors. We therefore decided
to design a new criterion that didnt require any assumptions and would work
53


for robust Mahalanobis distances computed from a wide variety of data types.
Intuitively, if robust location and scale estimates have been computed that
accurately describe the inliers and reveal the unexpectedly large displacement of
the outliers, the robust Mahalanobis distances will separate into two categories
a group with smaller values (inliers) and a group with larger values (out-
liers) For sufficient data, if we were to calculate the empirical density of these
distances, we expect to see a peak describing the inliers, after which the density
would return close to zero, and could either reveal a smaller second peak (if the
outliers Mahalanobis distances were sufficiently concentrated) or simply remain
close to zero. In practice, we found that even with 40% outliers, the first peak
was very well defined but the second peak was virtually nonexistent because in
contrast to the inliers, the outliers distances were widely dispersed.
The value of Mahalanobis distance at which the density first approaches
zero could therefore serve as a final rejection boundary. Any observations with
Mahalanobis distance less than this value would be classified as inliers, and
otherwise as outliers. This rejection point takes into account the unique charac-
teristics of each data set and doesnt require any assumptions other than robust
location/scale estimates that accurately describe the inliers. When the peak
flattens out and the density returns to zero, the data (inliers) upon which the
peak is calculated have vanished. We illustrate this concept in Figure 3.1. The
robust Mahalanobis distances calculated by our method, using a cutoff of 0.3,
are shown in the first graph, with the corresponding density in the second. (We
attribute the sharp rise in density for robust Mahalanobis distance greater than
54


100 to be a result of numerical instabilities of the density algorithm at this end-
point.) The data for these figures were generated with n~ 10, 000 and p 10;
the first 6,000 points from an uncorrelated xl and the remaining 4,000 points
from an uncorrelated X13 Both graphs also show the rejection value, 18.975,
yielding 2.45% outlier error and 1.08% inlier error. For comparison, BACON
failed to detect nearly all of the outliers in the same data set (98.5% error among
outliers, no error among inliers). When BACON was modified to compute a re-
jection point based upon the density method thus described, its outlier error
rate dropped to 77.4%. (That is, we inserted our density estimation routine into
the BACON program and used this instead of BACONs modified xl percentile
to determine the final rejection boundary.) An even more impressive improve-
ment occurred when the outliers were drawn from %f5, for 20% contamination
in p = 5 dimensions. When computing the rejection point according to its usual
modified xl percentiles, BACON produced outlier and inlier error rates of 98.1%
and 0.01%, but when modified to use the density method, these respective er-
ror rates became 38.9% and 3.9%. (The fact that vastly different error rates
were produced by our method and BACON, utilizing the same outlier rejection
procedure, points to the fact that robust location/scale estimates are crucial
to overall performance, leading to good separation between outlier and inlier
Mahalanobis distances.)
In practice, rather than requiring the estimated density to actually reach
zero, it is better to determine the rejection point as the first time that the
slope of this density enters a small interval containing zero. As the density
55


Estimated density Mahalanobis Distance
o
lO
0 2000 4000 6000 8000 10000
Observation no.
Figure 3.1: Robust Mahalanobis distances with estimated density and slope
56


decreases, the slope is negative and increasing (approaching zero). Using the
slope would give better results particularly for the situation where there is very
little separation between the two groups, and the density peak returns to zero
very gradually. This can be clearly seen in Figure 3.1, where even among highly
contaminated data, imposing a smallness condition on the slope results in a very
accurate classification scheme. We estimate the slope through first differences,
which we have usually found to be sufficiently smooth for our purposes, but
we acknowledge that smoothing the slope estimate could improve performance.
Occasionally, we did experience some difficulty with the slope being zero before
the peak had ended, but this was actually due to non-smoothness of the esti-
mated density curve itself. This problem was solved by imposing the additional
requirement that the density curve be less than a certain percentage of its peak
value; we found good results with 0.3. An alternative solution (which we didnt
investigate) would be to smooth these first differences. Examination of several
empirical density curves revealed that departure from monotonicity following
the densitys peak were not uncommon, and were sometimes longer than just
a few points. We leave the investigation of smoothing the densitys slope for
future work, however.
Computing a confidence interval on the slope requires knowledge of the true
density, and, as previously stated, we cannot assume a y2 distribution for the
Mahalanobis distances. We therefore decided to empirically compare results
from four half-interval lengths, centered around zero: (a) 0.001, (b) 0.0005,
(c) 0.0001, and (d) 0.00005. The first value results in a higher sensitivity to
57


outliers, but also more false negatives, than the other values, and conversely for
the fourth value. With the small number of simulations performed up to this
point, it was impossible to tell which of these four values yielded the best overall
success rates, so throughout most of our investigation we examined results from
all four values. Only near the end of our study, when we had run thousands of
simulations, did we feel comfortable in selecting (c) 0.0001 as yielding the overall
best success rates in a wide variety of data situations. The first value resulted
in an unacceptably high number of false negatives in many situations, with
opposite results for the fourth value. It is worth noting that these cutoff values
are not dependent on the scale of the data, because the Mahalanobis distance
utilizes a covariance estimate and is accordingly scale invariant. This is one
reason why it is essential to obtain an accurate covariance estimate. The best
interval length will, however, depend on the acceptable error rates as described
above.
3.2.3 Computing the Sample Density
There are several methods of computing the sample density. We followed
the ideas of Silverman [77], and especially his use of the kernel density estimate.
Let f(x) be the estimated density of / at x. We can calculate f(x) by
where w(xi,x) is the weight contributed by the observation Xi at the point x,
and w(xi,x) satisfies J^oow(xi,x)dx and w(xi,x) > £Nx,y. It is also clear that
58


the smoothness properties of / will depend heavily upon this weighting function.
The kernel density is obtained by letting the kernel function K(x, Xi, h) replace
the weighting function and scaling appropriately:
The required conditions on K are always satisfied if K is chosen to be
a probability density, as is usually done. Specifically, these conditions are
Epanechnikov [22] showed that the choice of kernel is not crucial to the mean
integrated square error of the estimated density, so computational efficiency is
usually the deciding factor. The normal distribution is a common choice for
K{x), which we utilize. Among other results, the normal density will ensure
that K(x) is a smooth curve possesseing derivatives of all orders. The window
width h determines the degree to which far-away observations influence the den-
sity estimate at a particular point. It also helps to determine the smoothness of
the resulting /. An overly small value of h will result in a non-smooth density
estimate with individual bumps around each observation, tending towards a col-
lection of Dirac delta functions as h > 0, while if h is too large, the resulting /
so that the kernel density estimate is given by
joQK{x)dx = 1 and K(x) > (Nx. (A symmetric K(x) is very useful, but not
absolutely essential.) If K(x) is a probability density, so will f(x) and more-
over, f{x) will inherit all the continuity and differentiability properties of K(x).
59


will contain only one very smooth peak over the mean of the data. In our con-
text, choosing h too small will cause the density estimate to be overly influenced
by random fluctuations and lead to a non-smooth slope estimate. Letting h be
too large will cause the outliers to affect the density peak intended to described
the inliers, and extend the densitys return to zero, which is critical in small
separation situations. For data containing outliers, Silverman [77] recommends
a robust window width,
h = 0.9 An~1/5
where A = min{standard deviation, interquartile range/1.34}. Silverman [77]
also discusses an adaptive window width which is narrower when the data are
dense and wider otherwise, but we found satisfactory results with the fixed width
kernel so we did not investigate this further.
To our knowledge, the estimated density of robust Mahalanobis distances
has not been previously used in outlier identification, and we consider this to be
an important contribution to outlier identification.
3.2.4 Modifications to the Original Proposal
For most of this investigation, we also computed a similar estimate of loca-
tion, namely the coordinate-wise location estimate Xbw provided by the biweight
function. After we were able to efficiently run thousands of simulations, however,
we rejected this variation. The weighted mean gave slightly better results, in
terms of producing more separation of outlier and inlier Mahalanobis distances,
and produced overall error rates approximately 0.5% to 0.75% lower than the
60


coordinate-wise biweight location estimate.
Another variation that we examined was what we refer to as the MINWT
option. After computing the biweight weights and reassigning them to either
zero or one, MINWT consists of setting all p weights of an observations com-
ponents equal to the minimum weight across these components. That is, if at
least one component receives a weight of zero, the other components of that
observation also receive a weight of zero. The purpose here is to make abso-
lutely sure that no outliers receive a weight of one and influence the sample
mean/covariance calculation. Since the classical sample mean and covariance
are very sensitive to outliers, they could easily be adversely affected if sufficient
care isnt taken to ensure an outlier-free subset. If a particular observation is an
outlier, some of its components might be slightly inside the region of acceptance,
and receive weights above the cutoff value. These weights would be raised to one,
influencing the robust estimates more than was intended. The MINWT option
is certainly not very efficient and would only be feasible when there is sufficient
data to discard questionable points. For certain difficult situations, this option
did in fact lead to significantly better results; notably for heavy contamination
(30% 40%), and very little outlier separation.
Initially, our cutoff value of 0.3 was somewhat arbitrarily chosen as a value
that might provide an initial outlier-free subset. Subsequent analysis indicated
that 0.3 might not be the best choice; better results were found using a higher
cutoff. For the location estimate Xbw, define u(x) as a measure of the scaled
distance from Xf,w:
61


The biweight weighting scheme can then be written in the form
w(u) = (1 v2)2.
The u-value corresponding to w{u) = 0.3 is
u
(0.3) = \J 1 Vo! = 0.6725
For a cutoff of 0.3, if a point x satisfies |tt(a:)| > 0.6725, then w(x) = 0, otherwise
smooth biweight weighting function to a step function with discontinuities at
u 0.6725. We depict these two functions graphically in Figure 3.2.
It can be seen from this figure that the area beneath the curve of the mod-
ified weighting function is considerably larger than that beneath the biweight.
Intuitively, it seems that this would result in the overall weights being too high,
and allowing too many outliers to influence the robust estimates. It might be
good for the area beneath the two curves to be at least reasonably similar so
that when the weights are changed, certain properties are preserved, such as
the mean weight w. It is not hard to calculate the cutoff value at which the
two weighting schemes will have the same area under the curve. For ease of
computation, we consider only u > 0.
w(x) = 1 for \u(x)\ < 0.6725. The effect of this reassignment is to change the
62


Figure 3.2: Weighting function for the biweight and our proposed method
63


so that if the area under the curve for the step function is to equal 0.53, the
cutoff value needs to be
w(0.53) = (1 (0.53)2)2 = 0.52.
Using a cutoff value of 0.52, we did in fact find increased success at outlier
detection than with 0.3. We also continued the trend of examining higher cutoff
values and found that outlier detection success continued to increase, with a
markedly smaller increase in the inlier error rate For instance, on a 10,000 x 5
data set with 8,000 observations from uncorrelated xl and 2,000 observations
from uncorrelated a cutoff value of 0.3 resulted in an outlier error rate of
7.21% and an inlier error rate of 1.90%. The corresponding figures for 0.5 cutoff
were 3.41% outlier error and 1.63% inlier error, while for 0.8 cutoff they were
1.94% and 1.81%, respectively. Using a higher cutoff such as 0.8 would lead
to overall lower weights being assigned, and more conservative location/scale
estimates, yielding a higher sensitivity to outliers. We surmise that perhaps the
benefits of the same area under the curve are not as important as other aspects
of the algorithm. At this point, we leave the cutoff as a parameter yet to be
ascertained, and return to it in the following chapter.
A higher cutoff will reject more data, and it is certainly possible that the
covariance estimate will be too small to give an accurate description of the
inliers. We therefore investigated Winsorization of the covariance matrix. Win-
sorization is a robust technique most often applied to the mean, and related to
the trimmed mean. Given n ordered observations y^,... ,y(n), the symmetric
64


Winsorized mean is given by
Vw ~{g V{g+1) + V{g+1) + V(g+2) + + V{n~g) + 9 IHn-g))-
Th
Winsorization has the property that it greatly lessens the influence of outliers
by effectively moving them to the boundary of acceptability, but does not reject
them completely. Dixon [18] notes that this procedure is called Winsorization in
honor of Charles P. Winsor, who actively promoted its use. Winsorized variances
are not as commonly used, however.
Our situation is not as straightforward as computing the Winsorized vari-
ance on a set of observations {yi}, but the principles can still be applied. If
an observation Xi is beyond the region of acceptability, it will be given a low
weight Wi regardless of which side of the location estimate it is situated on. An
observation with weight equal to (or very close to) the cutoff point will lie on
the boundary of the region of acceptability; this is the point where we would
like to start Winsorizing. We therefore search for the observation with the
closest possible weight to the cutoff value (but not smaller), and record its value
xw. When computing the variance summation, we effectively replace all ob-
servations with weight less than the cutoff point, with this recorded value xw.
Assume that the weight of xw is the kth order statistic among the sequence of
n weights {tOj} for that particular dimension, i.e. the weight of xw is the kth
smallest for that dimension. The numerator is thus a sum of squared devia-
tions: k (xw x)2 + ~~ *)2> where ^ is the location estimate for that
dimension.
65


Determining a suitable divisor required further investigation. The divisor for
the Winsorized mean is always taken to be n, but for the Winsorized variance,
several authors [83],[85] use h = n 2g, while [37] uses n{ 1 ^£)2. [73] utilize
h = n 2g, but also state that using n would result in more conservative
confidence intervals. A larger divisor (such as n) would result in a smaller
covariance matrix, and increased sensitivity to outliers, a desirable property. We
accordingly conducted a small experiment to examine the effect of this divisor
on outlier and inlier error rates; the results are shown in Table 3.4. For these
comparisons, we ran 1,000 simulations on a 10,000 x 5 data set, with inliers
drawn from uncorrelated xi and 20% outliers from uncorrelated x25, as well
as the MINWT option. Note that for 20% contamination, a divisor of 0.8n
corresponds to n g. For this particular data situation, the best error rates
can be obtained with a divisor larger than n. Employing a type of argument
commonly used against overfitting, however, we are not sure this would be the
best strategy for all data situations especially for low contamination levels, a
large divisor would create problems and cause too many inliers to be rejected.
A divisor of n seems to be a good compromise; it is reasonably conservative by
not allowing outliers to heavily influence the initial location/scale estimates. We
henceforth use this value during Winsorization.
It is also worth noting that we Winsorize only the diagonal elements of the
covariance matrix, not the off-diagonal elements. Moving outlying observations
an unspecified distance to the boundary of acceptability will clearly affect how
the X*1 coordinate varies with the coordinate, Ylk(xl ~ £J)- Thus,
66


Table 3.4: Examining the effect of the Winsorization divisor
Divisor Outlier Error % Inlier Error %
0.7n 2.18 1.70
0.75n 2.13 1.73
0.8n 2.05 1.78
0.85n 2.00 1.82
0.9n 1.94 1.86
0.95n 1.88 1.90
l.On 1.81 1.95
1.05n 1.77 1.99
l.ln 1.73 2.02
1.15n 1.68 2.07
1.2 n 1.64 2.11
1.25 n 1.60 2.15
1.3 n 1.57 2.18
1.35 n 1.53 2.22
1.4n 1.49 2.27
1.4577. 1.46 2.30
1.5n 1.42 2.34
1.6t7 1.36 2.42
1.7n 1.30 2.51
2.On 1.15 2.73
67


the only values we use in computing the off-diagonal elements are those with
weight equal to one.
Generally, we found that Winsorizing the variance gave better results. We
present some examples of covariance matrices obtained with and without Win-
sorization. Consider a 10,000 x 5 data set drawn from a mixture of uncorrelated
xl and xj5, a situation where observations from the two groups overlap. We
consider contamination levels of 5%, 10% and 20%. Without Winsorization, the
covariance matrix of the observations in the initial subset, using a cutoff of 0.7
with 5% contamination, was estimated as
5.48 -0.08 0.09 0.03 0.003
-0.08 5.28 -0.0004 0.01 -0.01
0.09 -0.0004 5.31 -0.05 0.06
0.03 0.01 -0.05 5.52 0.08
0.003 -0.01 0.05 0.08 5.21
Recall that for yj, the theoretical variance is 10. Employing Winsorization and
with the rest of the parameters kept constant, the diagonal elements were
{8.40 8.19 8.23 8.45 7.96}.
(The off-diagonal elements were the same since we only Winsorize the variance
estimates.) For 10% contamination and without Winsorization, the estimated
diagonal elements were
{6.21 6.13 6.24 6.40 5.88},
68


and with Winsorization
{10.23 10.06 10.11 10.45 9.72}.
Finally, for 20% contamination, the diagonal elements without Winsorization
were estimated to be
{9.31 9.13 9.19 9.34 9.23},
and with Winsorization as
{16.21 16.03 16.06 16.42 15.94}.
Winsorization increases the variance because to some extent, it still considers the
outliers. We computed the error rates from the above situations, based on 1,000
simulations and no MINWT, and present the results in Table 3.5. As before, we
utilize the format out.err%/in.err% to denote the outlier and inlier percentage
errors. It is not possible to draw any conclusions from this table. Based on
the actual covariance estimates, however, it seems likely that Winsorization
could be beneficial for low contamination levels, but as the variance estimate is
inflated with increasing percentage of outliers, augmenting the variance through
Winsorization becomes undesirable.
Another variation which was we tried was wpct. Instead of a fixed cutoff,
select the observations for the initial subset according to the upper wpct per-
centage of weights, where wpct is determined in advance. Not surprisingly, this
method produced great results if wpct was close to the percentage of actual in-
69


Table 3.5: Examining the effect of Winsorization
Contamination Level Without Winsorization With Winsorization
5% 0.55/4.63 0.75/3.88
10% 0.91/3.77 1.19/3.08
20% 4.55/2.64 4.31/2.00
liers in the data, but faltered if there was a big discrepancy. Sometimes general
characteristics of the type of data being investigated can lead the researcher to
develop an approximate estimate of the contamination level. For instance, many
data sets usually contain less than 5% outliers. Several outlier methods require
the number of outlier to be specified in advance, such as Grubbs test for k out-
liers in a specified tail [29], [37], so employing wpct would put our procedure in
this class of algorithms, wpct is a good alternative to other algorithms in this
class because it does not require the exact amount of outliers to be specified in
advance. It will still yield good results as long as the two percentages are fairly
close, and demonstrates a relatively gradual drop in performance with increas-
ing discrepancy. For an unknown percentage of outliers, however, a fixed cutoff
value is to be preferred.
This algorithm is computationally appealing, as will be analyzed in more
detail on page 141. Since each dimension is analyzed separately, computational
burden grows relatively slowly with increasing dimension. This is in marked con-
trast to most other methods, such as the MCD, Kurtosisl and even MULTOUT
(which depends on the MCD in its initial stage), where increasing p leads to sig-
70


nificant increase in computational time. Fitting the actual biweight function is
not computationally intense, and once the weights have been assigned, calculat-
ing the sample mean and covariance matrix are the fastest possible methods of
obtaining meaningful location and scale estimates. At this stage, estimation of
the sample density is the most time-consuming step of the algorithm, but as we
discuss on page 75, faster methods exist for this step, which lead to considerable
improvement.
3.3 A Summary of Our Proposed Method
Before proceeding to the next chapter, we summarize our proposed method.
1. Calculate the biweight weights for each data point using the algorithm
described on page 48, fitting each component separately.
2. If an observations weight is less than the cutoff value, reassign that weight
to zero, otherwise reassign it to one. The optimal cutoff value still needs
to be investigated.
3. Option MINWT: Select the minimum weight from each observations com-
ponents as the weight for the other p1 components too. The effectiveness
of this option still needs to be determined. The order of this step can also
be interchanged with the previous step.
4. Compute the weighted mean and covariance of all the observations, which
amounts to the sample mean and covariance of those observations in the
initial subset. The effect of Winsorizing the variance still needs to be
investigated.
71


5. Calculate a robust Mahalanobis distance for all n observations, using the
mean and covariance matrix computed in the previous step.
6. Calculate the density of these Mahalanobis distances, using the Normal
density as the kernel and window width as described on page 60, and use
first differences to approximate the slope.
7. There will be a peak describing the inliers, where the slope has a large
positive value followed by a large negative value and increases towards
zero. Find the value of Mahalanobis distance where this negative slope
first enters a confidence interval containing zero. A good value for the
half-length of this interval is 0.0001. Let the value of Mahalanobis distance
where the slope is sufficiently close to zero (and where the density is also
less than 0.3 of its maximum value), be called the rejection point.
8. Classify all points as inliers or outliers based on whether their Mahalanobis
distance is smaller or larger than the rejection point.
72


4. Results from the Proposed Method
4.1 Computational Details
Having developed an algorithm for identifying outliers, it is necessary to
evaluate its performance over a wide range of data types. We also need to ex-
amine its computational feasibility on larger data sets, the intended primary
application. So far we have tested our method in Splus, which is very slow,
and particularly inept at handling large loops. We therefore decided to imple-
ment it as a Fortran program, which could be used to rapidly run thousands
of simulations for diverse data types and different parameter values. Fortran
was selected because of the abundance of existing mathematical and statistical
subroutines, some of which utilized complex numerical techniques and had been
written to be computationally efficient. The specific version used was Fortran
90, a considerable improvement over Fortran 77. I did not make a full study
of all the improvements offered by F90, and there are almost certainly several
of my routines that could be optimized for F90. I am also aware of the devel-
opment of F95, but decided to designate this upgrade as a future project. F95
is an object oriented language and I have had only limited experience with the
object oriented approach.
A very important issue that arises in these types of numerical investigations
is the randomness of the pseudo-random number generator (PRNG). For the
programs presented in this thesis, we utilize the random number generators
73


described in the second volume of Numerical Recipes for Fortran [62, ch. B7],
updated and improved for F90. The period of this random number generator is
2.3 x 1018, which more than suffices for our applications. The PRNG presented
in the second volume of Numerical Recipes [62] is a faster and more efficient
version of the authors ran2 routine in the first volume [61]. In the first volume
[61], the authors claim that within the limits of floating point precision, ran2
provides perfect random numbers. As a practical definition of perfect, they
offered $1,000 to the first reader who could prove that ran2 fails to generate
random numbers, by means of a statistical test or otherwise. Ten years later
when the second volume [62] was published, the authors stated that the award
had not yet been claimed. This PRNG produces sequences of random 17(0,1)
numbers, from which appropriate transformations can be made to obtain random
data from other distributions. Some of the transformation routines are also
provided in the first volume of Numerical Recipes [61, ch. 7]. For instance,
to obtain normally distributed random numbers from 17(0,1) deviates, we use
the Box-Muller transformation (described in [62], with corresponding Fortran
code). Xdf deviates can be easily obtained from Gamma deviates, since the Chi-
squared distribution is a special case of the Gamma distribution, but in general,
obtaining random Gamma deviates is a nontrivial problem. To generate these
deviates, we utilize Fortran routines posted at ranlib [63], a division of the Netlib
repository, which are based on a modified rejection technique described in [2].
Based on these algorithms, we therefore assume that the data generated for the
purpose of this thesis are sufficiently random.
74


One aspect of our algorithm which necessitated some change, was the density
calculation. The method described in the previous chapter utilizes an inefficient,
complete double summation. For each point at which we wish to estimate the
density, we have to compute the influence of all n Mahalanobis distances, even
though many of these distances are too far removed to have a measurable effect
on the density at that particular point.) In Applied Statistics algorithm AS 176,
Silverman [76] describes and provides Fortran code for a routine based on the
Fast Fourier Transform which is significantly faster than the complete double
summation. Part of his algorithm involves building a histogram of the data with
2fc cells (we use k = 9), and then applying the FFT to both the data and the
kernel K(x). (Recall that we used a Gaussian kernel.) We found that utilizing
this procedure reduced the computation time by approximately a factor of 10.
Another issue that arose during the practical implementation of our algo-
rithm was finding the desired rejection point, specifically concerning smooth-
ness of the density estimate. The first point where the slope enters the interval
[-0.0001,0.0001] is not always at the end of the main density peak, due to small
numerical inaccuracies in the estimated density. Sometimes the estimated slope
assumed very small negative values within the first few data points. We could
easily solve this problem by requiring the rejection point to be greater than half
the mean Mahalanobis distance. A second problem arose because the density
peak did not always return to zero in a smooth curve sometimes there were
short plateaus, or other deviations from what should be a monotonic decreasing
segment. At times, the lengths of these plateaus were such that that removing
75


them through a smoothing process would require the smoothing window to be
sufficiently large as to have other undesirable effects on the estimated density,
such as increase in mean integrated square error. An alternative (and faster)
solution would be to constrain the rejection point to regions where the density
was less than a certain fraction of its maximum peak value. After some inves-
tigation, we settled on a value of 0.3 for this fraction. For the majority of data
situations, the algorithms performance was not sensitive to this value of 0.3,
but for high contamination (35% 40%) the density curve returned to zero very
slowly and raising this value to 0.4 did produce slightly better results.
A few other miscellaneous notes regarding our programs computational
specifics are in order. We use double precision in all of our programs to guard
against round-off error. To calculate the median (used for both the biweight
algorithm and BACON), we employ a version of Quicksort which sorts only as
much data as necessary to locate the median; specifically, a subroutine written by
Alan Miller [55]. For the interquartile range (to determine the optimal window
width for density estimation), I used the function selection from Numerical
Recipes [61, p. 334], which calculates the kth smallest value in an array. Finally,
to calculate the inverse of the PDS covariance matrix for Mahalanobis distance
calculation, I used the subroutine dpodi, from the Unpack [49] repository of linear
algebra public domain software, another subdivision of Netlib.
Naturally, we want to compare our results with the best existing outlier
methods. None of the other algorithms we examined up to this point were in a
form suitable for rapid multiple iterations, so we wrote Fortran code for these
76


algorithms too, in order to run thousands of iterations and effectively compare
performances. BACON has usually produced good results up to this point, so it
was a natural choice for further investigation. The executable version provided
by the author ran on an MS-DOS platform; since it was not possible to view
or modify the source code, I wrote a Fortran program based directly on the
article from Hadi et al. [32], which will henceforth be referred to as BACON*
to distinguish it from the author original implementation. I again utilized some
of the linear algebra routines from Unpack [49], and additionally also a collection
of routines from cdflib [10] to calculate the inverse x2 percentiles. The latter
routines were written by Brown and Lovato (MD Anderson Cancer Center),
and based on an algorithm by DiDonato and Morris [17] as well as tables from
Abramowitz and Stegun [1].
Another candidate that has performed well up this point is Kurtosisl. It
does not appear to be perturbed by difficult data types such as concentrated
outlier clusters or non-Normal data. It also is very appealing in that it rep-
resents an innovative class of outlier algorithms, based on improved projection
pursuit methods, and redesigned so as to be significantly more computation-
ally feasible. The original MATLAB code was clearly not suitable for our use
since it restricted the data set to n = 750, while the MATLAB platform was
also very slow and not intended for these type of simulations. I therefore wrote
Fortran code based directly on the MATLAB program, which will henceforth
be referred to as Kurtosisl*. In addition to the routines already mentioned,
Kurtosisl* also necessitated a large number of other linear algebra routines
77


such as Cholesky decomposition, simultaneous solution of systems of equations
(including also special cases for under/over-determined systems of equations),
eigenvalue/eigenvector calculation (for several different matrix types), QR fac-
torization and LQ factorization. I was able to obtain Fortran subroutines for
these procedures from lapack [47] and eispack [21], both special sections of the
Netlib repository. Although the portion of the code that I wrote was only about
1,500 lines, after adding all the required subroutines, the total program length
was nearly 18,000 lines. I am indeed very grateful to the authors of these routines
for making them available on the public domain.
We did not examine the MULTOUT algorithm during the large-scale sim-
ulation stage. Preliminary results seemed to indicate that it did not perform
as well on non-standard data types such as outlier clusters and possibly also in
higher dimensions such as p = 20. Juan and Prieto [44] conducted a simula-
tion experiment using Normal data with small separations, comparing the MCD,
MULTOUT and other methods, in which MULTOUT demonstrated significantly
decreased performance (100% failure rate in detecting outliers) in higher dimen-
sions (p = 10,20).
We have not examined MULTOUT sufficiently to draw definite conclusions,
but we do note that decreased performance in higher dimensions could be a
result of utilizing the MCD for the initial stage. The MCD does not handle
increased dimension very well; subdividing the data into smaller cells reduces n
but not p. Though it did not influence our decision, we also note that the C
implementation of MULTOUT took very long to execute, even longer than our
78


Fortran version of Kurtosisl* (which took a very long time, as will be discussed
later). The other algorithm we previously examined, the MCD, is for reasons
already discussed not a viable option for use in large data sets, and also fared the
worst of the 4 methods we examined. Thus we decided to compare our method
to only BACON* and Kurtosisl*.
The choice of data is also important. Most outlier algorithms in the liter-
ature are tested on Normally distributed data, even though there is no reason
to believe that the majority of applications and real world data sets satisfy this
assumption. When these algorithms are tested on real data sets, these sets are
typically of small n and p. We felt that the x2 distribution offered a good al-
ternative to the normal, due to its assymmetry and longer tails. For instance,
when mixing x2 with X15 or Xio> there is significant overlap between the two
groups and the algorithms could be tested under extreme circumstances.
4.2 Choice of Parameters for our Method
With the limited investigations performed up to this point, it has not been
possible to determine the best value for the cutoff parameter ctf, or the success
of Winsorization and MINWT. We therefore ran 1,000 simulations with cutoff
values from 0.5 to 0.9, and the permutation set of Winsorization crossed with
MINWT. We used a data set of size 10, 000 x 5, with 8,000 observations drawn
from uncorrelated x! and 2,000 from uncorrelated Xis- This resulted in some,
but not excessive, outlier overlap, and a moderate contamination level. The re-
sults are presented in Table 4.1, utilizing the format outlier, error %/inlier. error %
79


Table 4.1: Examining the effect of cutoff, Winsorization and MINWT
Wins. Wins. No Wins. No Wins.
ctf MINWT No MINWT MINWT No MINWT
0.5 3.34/1.79 8.98/2.24 3.04/2.37 11.78/2.78
0.55 2.96/1.76 7.57/2.16 2.49/2.34 9.45/2.85
0.6 2.68/1.75 6.27/2.11 2.13/2.31 7.58/2.74
0.65 2.50/1.74 5.24/2.03 1.86/2.32 6.00/2.66
0.7 2.30/1.77 4.36/1.97 1.64/2.36 4.63/2.60
0.75 2.13/1.80 3.61/1.92 1.48/2.40 3.49/2.55
0.8 1.98/1.84 2.93/1.90 1.31/2.50 2.54/2.57
0.85 1.81/1.95 2.48/1.92 1.16/2.71 1.92/2.69
0.9 1.79/2.13 2.27/2.04 1.05/3.34 1.46/3.13
0.95 1.74/2.58 1.98/2.49 0.76/10.00 0.89/10.12
80


discussed in Table 3.1 on p. 43.
We notice from Table 4.1 that as the cutoff increases, the outlier error rate
decreases monotically while the inlier error rate follows a quadratic curve; we
indicate the minimum of this quadratic in bold type. For increased cutoff, the
variance estimate shrinks because a greater portion of the outlying points are
assigned a weight of zero and are not included in the estimate. A smaller, more
accurate variance estimate causes distant points to receive large robust Maha-
lanobis distances, with a greater possibility of being labelled as outliers. Since
the outliers clearly do lie farther away, a smaller variance estimate results in
more outliers possessing large robust Mahalanobis distances, a greater separa-
tion from the inliers, and an increased likelihood of being correctly labeled as
outliers. Hence the outlier error rate decreases with increasing cutoff.
Investigation of the quadratic trend of the inlier error rates showed that for
lower cutoff values, the off-diagonal elements of the covariance estimate were
considerably larger than zero, e.g. for a cutoff of 0.5, the off-diagonal elements
were on the order of 2.5. This may be due to unintended correlations between
the inliers and outliers. As the cutoff increases, fewer outliers are assigned a
weight of one, and the off-diagonal elements decrease to zero. The improved
estimates of these off-diagonal elements result in lower robust Mahalanobis dis-
tances for the inliers, and thus fewer of them are incorrectly rejected. The error
rate therefore decreases. When these off-diagonal elements are sufficiently close
to zero, however, improving them no longer leads to lower inlier error rates,
which subsequently start increasing. In contrast to the outlier error rate, a
81


smaller variance estimate leads to higher inlier error rates because some of the
inliers receive large robust Mahalanobis distances, and are incorrectly labelled
as outliers. When we artificially force the off-diagonal covariance elements to
be zero, the inlier error rate increases monotically with the cutoff. These two
opposite trends highlight the difficulty in obtaining a uniformly best cutoff value.
It appears that the combination of No Winsorization and No MINWT does
not lead to as good results as the other 3 combinations, so we do not consider it
further. A possible explanation for this is that without any kind of adjustment,
20% outliers is a sufficiently heavy contamination level to adversely affect the
covariance estimate, and leads to an inadequate separation between inlier and
outlier Mahalanobis distances. We also selected 2 values of the cutoff to in-
vestigate at each of the remaining three factor combinations, placing somewhat
more weight on lower outlier error, while still maintaining reasonable inlier er-
rors. For Winsorization and No MINWT, we investigate cutoff values of 0.85
and 0.9, while for MINWT, and both levels of Winsorization, we investigate
cutoff values of 0.8 and 0.85.
We examine dimensions p = 5,10, 20 and contamination levels e = 5%, 10%,
20%, 30%, 35%, 40%. e = 35% was examined because the algorithm sometimes
demonstrated a marked increase in error rate between e = 30% and e = 40%;
adding this value showed that the sudden increase usually occurred between e =
35% and e = 40%. We are not aware of significant outlier research or testing on
non-normal data, so we investigate a mixture of %2 distributions as an example
of highly skewed data. It is still important, however, to verify performance of our
82


method on normally distributed data, so we also investigate two types of shift
outliers, the most difficult type of Gaussian outliers /citerocke96. We are also not
aware of significant outlier research on correlated data, so the third category of
outlier distributions we examine is a mixture of correlated multivariate normals.
The multivariate normal distribution was chosen as an example of correlated
data because it will be easy to compare results directly with the uncorrelated
shift outliers.
The computational cost of running these simulations is reasonable, so we
follow a complete factorial experimental design [9]. We ran 1,000 simulations
for our proposed method, as well as for BACON*, but only 10 iterations for
Kurtosisl*, which still took roughly 3-5 times longer than 1,000 iterations with
our method. We will conduct a more complete investigation of computational
time in section 4.7, but at this stage, it is important to note that the reported
results for Kurtosisl* do not have the same accuracy as the other methods.
A simple geometric argument reveals that for a given inlier distribution,
if the same outlier distribution is used for p = 5, p = 10 and p = 20, the
average distance to the outliers will increase with increased dimension, and an
algorithms success rate will increase with dimension, provided it is capable of
analyzing large dimensions (the MVE and MCD are examples of algorithms
that do not satisfy the latter condition.) For instance, the one-dimensional
distance from the origin to a point 5 units away is clearly 5, while the two-
dimensional distance from the origin to (5,5) is -\/52 + 52 = 7.07. To compare
the effect of dimension on outlier detection sucess, we therefore need to scale the
83


distance to the outlier mean so that it remains approximately constant for all
dimensions. The Euclidean distance could be used to calculate the appropriate
scale factors for uncorrelated data with the same variance in each dimension,
but for a general data set, the Mahalanobis distance between two populations
is a more appropriate metric [75, p. 36] that takes into account not only the
variance of the data, but also the correlations between components. For a given
separation between the inlier and outlier distribution means, which we will call
S, the expected value of the Mahalanobis distance needs to remain constant
regardless of dimension:
E[(^in - - Mout)] = 52
where nin is the mean of the inlier distribution, fj,out the mean of the outlier
distribution, and S is a convenient scale matrix [51, p. 31], which we take
to be the covariance matrix of the inlier distribution. Setting (d,..., d)T =
E[(fj.in nout)T, i.e. d is the one-dimensional difference between fjbin and fJ.0Ut,
we can evaluate the expected value of the Mahalanobis distance as
= d2T,UEUcij
where c1-7 are the elements of the inverse covariance matrix E-1. For a given
value of <52, we therefore determine nout such that d v S £ For
the simulations in the following sections, we will examine several values of 62,
and among others, note how rapidly success at outlier detection improves with
84