IDENTIFICATION OF MULTIVARIATE OUTLIERS

IN LARGE DATA SETS

by

Mark Werner

B.Sc., Applied Mathematics and Physics, University of Stellenbosch 1993

B.Sc.(Hons.), Applied Mathematics, University of Stellenbosch, 1994

M.Sc., Applied Mathematics, University of Stellenbosch, 1996

A thesis submitted to the

University of Colorado at Denver

in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Applied Mathematics

2003

by Mark Werner

All rights reserved.

This thesis for the Doctor of Philosophy-

degree by

Mark Werner

has been approved

by

Burt Simon

2780

Date

Werner, Mark (Ph.D., Applied Mathematics)

Identification of Multivariate Outliers

in Large Data Sets

Thesis directed by Professor Karen Kafadar

ABSTRACT

In this thesis a new algorithm is proposed for detecting outliers in large and

very large data sets. This procedure uses Tukeys biweight function to assign

weights to data values in each dimension, then reassigns a weight of one to those

points with weight above a certain cutoff value and zero to those below. The

sample mean and covariance can be efficiently calculated over those observations

with weight equal to one, leading to robust Mahalanobis distances for all the

observations. We estimate the density of these Mahalanobis distances and deter-

mine a rejection point where the slope of the density is sufficiently close to zero.

All observations with Mahalanobis distance greater than this rejection point are

declared outliers. This algorithm demonstrates very good outlier identification

properties, especially on non-normal data, an area where most existing methods

do not perform as well. We successfully model and predict its error rates for a

mixture of chi-squared distributions and various parameters such as amount of

contamination, separation of chi-squared distributions, and data dimension. It is

computationally fast, and is very capable of analyzing data sets containing both

111

large number of observations and high dimensions. To analyze performance of

the robust estimation procedure from a theoretical perspective, we also calculate

several important asymptotic robustness properties at the standard Normal dis-

tribution. Among others, this estimator possesses a low gross error sensitivity,

a high relative efficiency, and a high breakdown point, as well as considerable

flexibility that enables it to adapt to a variety of data situations and contami-

nation levels. Compared to other methods that we have examined, this outlier

algorithm is considerably faster, and with special reference to non-normal data,

offers excellent all-round performance.

This abstract accurately represents the content of the candidates thesis. I

recommend its publication.

Signed

Karen Kafadar

IV

DEDICATION

This thesis is dedicated to a wonderful Mom and Dad, who lovingly sup-

ported me throughout my academic career: To my Mom, who never faltered in

her belief that I would not only finish, but finish well. And to my Dad, who

regretted not being able to take a course in Statistics, and who told me, What-

ever else you study at University, be sure to take a course in Statistics it may

be one of the most useful courses you will ever take. Thanks, Mom and Dad!

ACKNOWLEDGMENT

I would like to sincerely thank and acknowledge the significant role that the

following people undertook in helping this thesis to materialize.

My adviser, Prof. Karen Kafadar, for her insightful suggestions regard-

ing how to approach many of the problems addressed in this thesis. Her

knowledgeable advice frequently led me to fruitful research avenues, and

by example showed me how to conduct high quality statistical research. I

also greatly appreciate her patience in reviewing and providing construc-

tive criticism during the preliminary drafts of this thesis.

My thesis committee for their helpful suggestions, moral support and con-

stant encouragement to finish and get out of here!

The administrative staff Marcia Kelly, Liz Lhotka and Dana Franklin

for their continual helpfulness especially regarding administrative matters,

and for bearing with me when I missed the odd deadline.

The faculty of the CU-Denver Mathematics Department for imparting

some of their knowledge to me, always encouraging me to aim higher and

still higher, and for providing me the opportunity to obtain a competitive

degree.

CONTENTS

Figures ............................................................... xi

Tables............................................................... xiii

Chapter 1 2

1. Introduction......................................................... 1

1.1 Univariate Outliers................................................ 4

1.2 Multivariate Outliers.............................................. 7

2. Previous Outlier Identification Methods............................. 12

2.1 The Influence Function............................................ 14

2.2 M-estimators...................................................... 19

2.2.1 Monotone M-estimators........................................... 19

2.2.2 Redescending M-estimators....................................... 22

2.2.3 Computational Issues of M-Estimators............................ 25

2.3 S'-Estimators..................................................... 26

2.4 Combinatorial Estimators.......................................... 28

2.5 Compound Estimators............................................... 31

2.6 BACON............................................................. 32

2.7 Non-Distance Based Methods........................................ 34

2.7.1 The Stahel-Donoho Estimator..................................... 35

2.7.2 Kurtosis........................................................ 37

2.8 Types of Outliers................................................. 39

viii

3. The Development of a New Method................................. 42

3.1 Preliminary Investigation..................................... 42

3.2 The Development of a New Method............................... 46

3.2.1 Obtaining Robust Estimates..................................... 46

3.2.2 Determing the Rejection Boundary............................... 53

3.2.3 Computing the Sample Density................................... 58

3.2.4 Modifications to the Original Proposal...................... 60

3.3 A Summary of Our Proposed Method.............................. 71

4. Results from the Proposed Method................................ 73

4.1 Computational Details............................................ 73

4.2 Choice of Parameters for our Method............................ 79

4.3 Skewed Data: A Mixture of x2 Distributions..................... 85

4.3.1 A Mixture of x2,s with 82 = 12.5............................... 85

4.3.2 A Mixture of x2s with 52 = 32 ................................ 90

4.3.3 A Mixture of x2s with 52 = 50 ................................ 95

4.3.4 A Mixture of x2s with 52 = 112.5.............................. 99

4.3.5 A Mixture of x2s with 82 = 200 .............................. 106

4.4 Symmetric Data: A Mixture of Gaussians........................ Ill

4.4.1 An Investigation of Shift Outliers.......................... Ill

4.4.2 A Further Investigation of Shift Outliers................... 123

4.5 Correlated Data............................................... 125

4.5.1 A Mixture of Correlated Normals.............................. 127

4.5.2 A Mixture of Correlated Poissons............................. 132

IX

4.6 Examination of a Real Data Set.................................. 137

4.7 Computational Time.............................................. 141

4.8 Predicting Our Models Performance.............................. 147

5. Theoretical Properties and Future Work.......................... 167

5.1 The Influence Function for Our Estimator........................ 168

5.1.1 Examining the Accuracy of our Estimator..................... 180

5.2 The Calibration Stage........................................... 188

5.3 Different Distributions......................................... 189

5.4 An Optimization Approach ..................................... 191

5.5 Conclusions..................................................... 193

Appendix

A. Abbreviated Fortran code for the proposed algorithm............. 195

References......................................................... 219

a

X

FIGURES

Figure

1.1 Bivariate outliers that are not univariate outliers............ 9

2.1 p- and (/-functions for the sample median and mean........... 21

2.2 ^-functions for the biweight and other robust estimators....... 23

3.1 Robust Mahalanobis distances with estimated density and slope . . 56

3.2 Weighting function for the biweight and our proposed method ... 63

4.1 Outliers from xlo with Mahalanobis distances...................... 87

4.2 Outliers from X13 with Mahalanobis distances...................... 92

4.3 Outliers from xls with Mahalanobis distances...................... 97

4.4 Outliers from xlo with Mahalanobis distances..................... 103

4.5 Outliers from Np(3 1,1) with Mahalanobis distances............. 112

4.6 Outliers from Np(2 1,1) with Mahalanobis distances............. 118

4.7 Pairwise scatter plots of wire elements....................... 138

4.8 Pairwise scatterplots of log(wire elements)................... 139

4.9 Robust Mahalanobis distances for all wire elements.............. 142

4.10 Robust Mahalanobis distances for non-zero wire elements......... 143

4.11 Outlier error rates for 5C = 5..................i.............. 151

4.12 Outlier error rates for dc = 8................................... 152

4.13 Outlier error rates for 5C = 10.................................. 153

4.14 Inlier error rates for 6C 5.................................. 157

xi

4.15 Inlier error rates for 5C = 8........................................ 158

4.16 Inlier error rates for 5C = 10....................................... 159

4.17 Scaled actual versus fitted outlier error rates..................... 162

4.18 Scaled actual versus fitted inlier error rates ..................... 163

4.19 Actual vs. predicted outlier errors ................................ 165

4.20 Actual vs. predicted inlier errors.................................. 166

5.1 0-function for our robust estimation procedure .................... 175

5.2 Accuracy of the robust location estimate.............................. 185

5.3 Accuracy of the robust scale estimate................................. 186

xii

TABLES

Table

3.1 Sample 2x2 table illustrating the notation used in this thesis ... 43

3.2 Preliminary simulations with multivariate normal outliers ....... 45

3.3 Preliminary simulations with x2 outliers........................... 47

3.4 Examining the effect of the Winsorization divisor................ 67

3.5 Examining the effect of Winsorization............................ 70

4.1 Examining the effect of cutoff, Winsorization and MINWT.......... 80

4.2 Simulation results for xlo outliers at p = 5, part 1.............. 88

4.3 Simulation results for xlo outliers at p = 5, part 2.............. 88

4.4 Simulation results for xi.54 outliers at p = 10, part 1 .......... 89

4.5 Simulation results for xi.54 outliers at p = 10, part 2 .......... 89

4.6 Simulation results for X7.5 outliers at p = 20, part 1............ 91

4.7 Simulation results for x?o outliers at p = 20, part 2............. 91

4.8 Simulation results for X13 outliers at p = 5, part 1.............. 93

4.9 Simulation results for X13 outliers at p = 5, part 2.............. 93

4.10 Simulation results for X10.66 outliers at p = 10, part 1.......... 94

4.11 Simulation results for X10.66 outliers at p = 10, part 2.......... 94

4.12 Simulation results for xi outliers at p = 20, part 1.............. 96

4.13 Simulation results for xi outliers at p = 20, part 2.............. 96

4.14 Simulation results for X15 outliers at p = 5, part 1.............. 98

xiii

4.15 Simulation results for X15 outliers at p = 5, part 2..................... 98

4.16 Simulation results for xf2.07 ou^ers P = 10) Par^ 1................ 100

4.17 Simulation results for Xi2.07 outliers at p = 10, part 2................ 100

4.18 Simulation results for xlo outliers at p = 20, part 1................... 101

4.19 Simulation results for xlo outliers at p = 20, part 2................... 101

4.20 Simulation results for xlo outliers at p 5, part 1.................... 104

4.21 Simulation results for xlo outliers at p = 5, part 2.................... 104

4.22 Simulation results for XI5.01 outliers at p = 10, part 1................ 105

4.23 Simulation results for X15.61 outliers at p = 10, part 2................ 105

4.24 Simulation results for X12.5 outliers at p = 20, part 1 ................ 107

4.25 Simulation results for X12.5 outliers at p = 20, part 2 ................ 107

4.26 Simulation results for X25 outliers at p = 5, part 1.................... 108

4.27 Simulation results for xls outliers at p = 5, part 2.................... 108

4.28 Simulation results for X25 outliers at p = 10, part 1................... 109

4.29 Simulation results for X25 outliers at p = 10, part 2................... 109

4.30 Simulation results for X25 outliers at p = 20, part 1................... 110

4.31 Simulation results for X25 outliers at p 20, part 2................... 110

4.32 Simulation results for Np(3 1,1) outliers at p = 5, part 1 ........... 113

4.33 Simulation results for Np(3 1,7) outliers at p = 5, part 2 ........... 113

4.34 Simulation results for NP(2.12 1,1) outliers at p = 10, part 1 . . . 115

4.35 Simulation results for Np(2.12 1,7) outliers at p = 10, part 2 . . . 115

4.36 Simulation results for jVp(1.5 1,7) outliers at p = 20, part 1 . . . . 116

4.37 Simulation results for Np( 1.5 1,7) outliers at p = 20, part 2 ... . 116

xiv

4.38 Simulation results for Np(2 1,1) outliers at p = 5, part 1....... 119

4.39 Simulation results for Np(2 1,I) outliers at p = 5, part 2 ........... 119

4.40 Simulation results for jVp(1.41 1,1) outliers at p = 10, part 1 . . . 120

4.41 Simulation results for JVP(1.41 1,J) outliers at p = 10, part 2 . . . 120

4.42 Simulation results for Np( 1,1) outliers at p = 20, part 1.............. 121

4.43 Simulation results for NP(1,I) outliers at p = 20, part 2............... 121

4.44 Inlier error rates for shift outliers as described in [44]......... 126

4.45 Simulation results for Np(3.46 1, C5) outliers at p = 5, part 1 . . . 129

4.46 Simulation results for 1VP(3.46 :1, C5) outliers at p 5, part 2 . . . 129

4.47 Simulation results for ./Vp(3.317 1, C10) outliers at p = 10, part 1 . 130

4.48 Simulation results for NP(3.317 1, C10) outliers at p = 10, part 2 . 130

4.49 Simulation results for A^(3.24 1, C20) outliers at p = 20, part 1 . . 131

4.50 Simulation results for ATP(3.24 1, C20) outliers at p = 20, part 2 . . 131

4.51 Simulation results for Poisson(25) outliers at p = 5, part 1....... 133

4.52 Simulation results for Poisson(25) outliers at p = 5, part 2....... 133

4.53 Simulation results for Poisson(24.36) outliers at p = 10, part 1 . . . 135

4.54 Simulation results for Poisson(24.36) outliers at p = 10, part 2 . . . 135

4.55 Simulation results for Poisson(24) outliers at p = 20, part 1 .... 136

4.56 Simulation results for Poisson(24) outliers at p = 20, part 2 .... 136

4.57 Computation times for our algorithm, in seconds ....................... 144

4.58 Computation times for BACON*, in seconds............................... 146

4.59 Computation times for Kurtosisl*, in seconds........................... 146

4.60 Outlier error rates for regression model: e = 5%, 10%................... 149

xv

4.61 Outlier error rates for regression model: e = 15%, 20% ........... 150

4.62 Inlier error rates for regression model: e = 5%, 10%............... 155

4.63 Inlier error rates for regression model: e = 15%, 20% ............. 156

5.1 Asymptotic Numerical Robustness Properties of Various Estimators 178

5.2 Accuracy of our robust estimator for e = 5%, 10%................... 183

5.3 Accuracy of our robust estimator for e = 20%, 30%.................. 184

xvi

1. Introduction

Awareness of outliers has existed for at least several hundred years. In 1620,

Francis Bacon [4] wrote,

Whoever knows the ways of Nature will more easily notice her de-

viations; and, on the other hand, whoever knows her deviations will

more accurately describe her ways.

Even as early as the first published work on least squares in 1805, Legendre

[48] explicitly mentioned rejection of outliers to improve accuracy and reduce

error:

If among these errors are some which appear too large to be ad-

missible, then those equations which produced these errors will be

rejected, as coming from too faulty experiments, and the unknowns

will be determined by means of the other equations, which will then

give much smaller errors.

Building on this idea, in 1887 Edgeworth [20] elegantly stated:

The method of Least Squares is seen to be our best course when

we have thrown overboard a certain portion of our data a sort

of sacrifice which often has to be made by those who sail upon the

stormy seas of Probability.

1

The outlier challenge is one of the earliest of statistical interests, and since

nearly all data sets contain outliers of varying percentages, it continues to be

one of the most important. Sometimes outliers can grossly distort the statistical

analysis, at other times their influence may not be as noticeable. Statisticians

have accordingly developed numerous algorithms for the detection and treatment

of outliers, but most of these methods were developed for univariate data sets.

This thesis focuses instead on multivariate outlier detection.

Especially when using some of the common summary statistics such as the

sample mean and variance, outliers can cause the analyst to reach a conclusion

totally opposite to the case if outliers werent present. For example, a hypothesis

might or might not be declared significant due to a handful of extreme outliers.

In fitting a regression line via least squares, outliers can sufficiently alter the

slope so as to induce a sign change. For a graphical illustration of the latter, see

[70, p. 5].

Sometimes outliers can draw attention to important facts, or lead to new

discoveries, e.g. Rayleighs discovery of argon resulting from unexpected dif-

ferences in measured weights of nitrogen [82, pp. 49-53]. During one of my

consulting projects, I was asked to evaluate patient satisfaction across various

categories (quality of physician care, attentiveness of nurses, etc.) from different

hospitals. One of the hospitals consistently received the highest score, in terms

of patient satisfaction, across all categories except for smoothness of billing pro-

cess. Since it was almost flawless in every other category, I thought there had

to be a mistake in the data. After discussing this peculiarity with the project

2

coordinator, I learned that this hospital had recently switched to a different ac-

counting package, which led to many incorrect accounts and frustration among

the patients. With reference to the distribution of patient satisfaction indices,

the billing score from this hospital was clearly an outlier it was an extremely

unusual data point. We certainly did not want to discard this point; on the con-

trary, it highlighted a very important fact this hospital needed to significantly

improve its billing process, or risk losing business. Often it will be important to

calculate how unlikely a particular observation is, but in this case simple visual

inspection was sufficient. It was certainly beneficial to become aware of this

outlier. The hospital took appropriate action and when I analyzed the following

years data, their billing score was on par with the rest of their scores.

Another example where outliers themselves are of primary importance in-

volves air safety, as discussed in Kafadar and Morris [46]. On one of the four

hijacked flights on September 11, 5 of the 80 passengers satisfied the follow-

ing criteria: they were not U.S. citizens, they had connections to a particular

foreign country, had purchased one-way tickets at the gate, had paid in cash,

and none had checked luggage. One or two passengers satisfying these criteria

might not be too rare, but five passengers on the same flight, especially with

the benefits of hindsight, would now seem to be highly unusual. We of course

fully recognize the fact that before September 11, these passengers might not

have raised sufficient concern to spur further investigation because this type of

attack had rarely been encountered before. Some level of concern was raised by

an automated screening process, but the recommended action was to re-screen

3

the checked bags. Both the identification of outliers and the appropriate action

are therefore important. It would be ideal to research and identify criteria that

potentially troublesome passengers might possess; a successful (and fast) outlier

detection algorithm is imperative to correctly identify outliers amongst large

masses of regular passenger data, so that a human agent can be alerted to the

possibility of trouble and investigate the matter in person. Further applications

of outlier identification to homeland security are described in an article by Banks

[5]. Our goal in this thesis is to identify the outliers in a data set as successfully

as possible, after which the analyst can decide what to do with them.

1.1 Univariate Outliers

In univariate data, the concept of outlier seems relatively simple to define.

Outliers are those points located far away from the majority of the data; they

probably do not follow the assumed model. A simple plot of the data, such

as scatterplot, stem-and-leaf plot, QQ-plot, etc., can often reveal which points

are outliers. This is sometimes called the interoccular test because it hits you

between the eyes.

Relying solely on non-visual means, however, even univariate outliers might

not be as easy to identify as would appear at first sight. Consider a simple

situation: identifying outliers in a normally distributed data set. A naive outlier

identification rule might be to calculate the sample mean and standard deviation,

and classify as outliers all points further than 2 or 3 standard deviations away

from the mean. For Gaussian data, this would label as outliers those points

far enough from the main data that we think there is only a 4.56% or 0.26%

4

chance, respectively, that they belong to this distribution. For n truly Gaussian

observations, this would also cause 0.0456n and 0.0026n regular data points to be

labeled as outliers false positives. Conversely, outliers that an algorithm fails

to detect are referred to as false negatives. It is an unfortunate fact, however,

that the presence of two or more outliers could render some or most of the outliers

invisible to this method. If there is one or more distant outlier and one or more

not so distant outlier in the same direction, the more distant outlier(s) could

significantly shift the mean in that direction, and also increase the standard

deviation, to such an extent that the lesser outlier(s) falls less than 2 or 3

standard deviations from the sample mean, and goes undetected. This is called

the masking effect, and results in this particular method and all related methods

being unsuitable for use as outlier identification techniques. We illustrate with

a simple example, borrowed from Becker and Gather [8].

Consider a data set of 20 observations taken from a iV(0,1) distribution:

-2.21, -1.84, -0.95, -0.91, -0.36, -0.19, -0.11, -0.10, 0.18, 0.30

0.31, 0.43, 0.51, 0.64, 0.67, 0.72, 1.22, 1.35, 8.1, 17.6,

where the latter two observations were originally 0.81 and 1.76, but the decimal

points were entered at the wrong place. It seems clear that these 2 observations

should be labeled as outliers; let us apply the above method. The mean of this

data set is 1.27 while the standard deviation is 4.35. Two standard deviations

from the mean, towards the right, would be 9.97, while three standard deviations

would be 14.32. Both criteria regard the point, 8.1, as expected with reasonable

probability and do not consider it an outlier. Additionally, the three standard

deviation boundary for detecting outliers seems rather extreme for a iV(0,1)

5

data set, surely a point wouldnt have to be as large as 14.32 to be classified as an

outlier. The masking effect occurs quite commonly in practice and we conclude

that outlier methods based on classical statistics are unsuitable for general use,

particularly in situations requiring non-visual techniques such as multivariate

data. It is worth noting, however, that if instead of the sample mean and

standard deviation, robust estimates of location and scale were used (such as

the sample median, and median absolute deviation, MAD), both outliers would

be detected without difficulty.

If it is known that at most one outlier is present, the naive method would

find it. An improved version of this rule forms the basis of Grubbs test for a

single outlier [29], which measures how many standard deviations away from the

mean the smallest and largest observations are located, and can be extended to

Grubbs test for k outliers, where k has to be known in advance. Thode [37]

compares Grubbs and a variety of other outlier tests, including those that do

not require k to be specified in advance, but we believe that better methods

exist for both univariate and multivariate outliers.

Related to the masking effect is the concept of breakdown point of an esti-

mator. Intuitively, the breakdown point is the least amount of contamination

that could change the estimator so much that it fails to meaningfully describe

the data set. As originally defined, the breakdown point e* depends on the

distance between two distributions (e.g. the Prohorov or Kolmogorov distance,

or total variation distance [35, p. 97]), but an empirically more useful version

is the finite-sample breakdown point e*, formally defined with reference to the

6

estimator Tn =T(xi,..., xn), as

<(Tn;xi,... ,xn) := -max{m;maxili...)imsupy ... ,zn)\ < oo},

Tb

where the sample (zi,...,zn) is obtained by replacing the m data points

Xix,..., xim with arbitrary values y^,, yim. In most cases, lim^oo e* = e*.

The highest possible value of the breakdown point is 50%, because if more than

half the data were outliers, it would be unclear which data were from the main

distribution and which were outliers.

Neither the sample mean nor the sample variance has a high breakdown

point. In fact, since it takes only one extremely distant point to shift or in-

crease these respective estimators so far as to be essentially useless, both have a

breakdown point of 0 for the finite-sample breakdown point). On the other

hand, the median has a breakdown point of 50%, because half of the observa-

tions (strategically selected to be as malicious as possible) need to be moved

arbitrarily far away before the median becomes unusable.

1.2 Multivariate Outliers

Multivariate outliers pose challenges that do not occur with univariate data.

For instance, visual methods simply do not work. Even plotting the data in

bivariate form with a systematic rotation of coordinate pairs will not help. It

is possible (and occurs frequently in practice) that points which are outliers

in bivariate space, are not outliers in either of the two univariate subsets

see Figure 1.1. Generalization to higher dimensions leads to the fact that a

7

multivariate outlier does not have to be an outlier in any of its univariate or

bivariate coordinates, at least not without some kind of transformation, such as

a 45 rotation of the axes in Figure 1.1. Gnanadesikan and Kettenring [28] state

that visual detection of multivariate outliers is virtually impossible because the

outliers do not stick out on the end.

A successful method of identifying outliers in all multivariate situations

would be ideal, but is unrealistic. By successful, we mean both highly sensi-

tive., the ability to detect genuine outliers, and highly specific, the ability to not

mistake regular points for outliers. In the words of Gnanadesikan and Ketten-

ring [28], it would be fruitless to search for a truly omnibus outlier detection

procedure. Thus we introduce the problem of multivariate outlier detection.

Intuitively, we label a point an outlier because it is sufficiently far away

from the majority of the data. An important tool for quantifying far away, is

the Mahalanobis distance, defined as

MDi = y/fo T(X))TC(X)-1(xj T(X))

for each point x$, where T(X) is the coordinate-wise sample mean of the data

set X, and C(X) is the sample covariance matrix. Cleary, the Mahalanobis

distance relies on classical location and scale estimators. As such, it is subject

to the masking effect, and is not suitable for general use in contaminated data.

There are several qualities we would like an estimator to possess. We would

like it to have a high breakdown point, but this could be outweighed by other

criteria. (If we were forced to use an estimator with breakdown point near 50%,

8

CM .

O _ *

00

CO -

'St # *<

CM - i i i i i

2 4 6 8 10

X

Figure 1.1: Bivariate outliers that are not univariate outliers

9

we should probably first do something else to clean the data.) For realistic

applications, a breakdown point of 20% is usually satisfactory.

Another desirable property of an estimator is affine equivariance. A location

estimator T G is affine equivariant if and only if for any vector b G and

any nonsingular p x p matrix A,

Tn(AX + b) = AT(X) + b.

A scale estimator Cn G PDS(p) (the set of positive-definite symmetric p x p

matrices) is affine equivariant if and only if for any vector b G 5RP and any

nonsingular p x p matrix A,

Cn(AX + b) = ACn(X)Ar.

If an estimator is affine equivariant, stretching or rotating the data wont

affect the estimator. Dropping this requirement greatly increases the number of

available estimators, and in many cases, non-affine equivariant estimators have

superior performance to affine equivariant estimators.

A further important and desirable feature of an estimator is a computation-

ally fast algorithm to compute it. It is not uncommon to measure data set sizes

in terabytes or petabytes, and some real time applications require the outliers

to be detected within seconds or at most a few minutes (as in the case of air

passenger check-in). In these situations, slow algorithms such as the Minimum

Volume Ellipsoid (discussed on p. 28) are essentially useless, that is, algorithms

that are theoretically capable of locating the outliers, but would take millennia

10

to do so.

As previously mentioned, outlier detection in all data situations by the same

algorithm is not possible, but we can still aim for a reasonable diversity. This is

called the breadth of algorithm, and we therefore strive to develop an algorithm

with a large breadth.

No estimator possesses all of these properties high breakdown point,

affine equivariance, computational speed, breadth. Rocke and Woodruff [68]

stated that it is entirely appropriate to develop special methods to handle special

cases. For large multivariate data sets, computational speed seems to be one of

the most difficult goals to achieve.

In this thesis, I examine some of the leading outlier detection methods, and

also propose a new algorithm, comparing and evaluating its success and speed

in a variety of situations, and investigate some useful numerical properties of its

robust estimator. Specifically, in chapter two, I discuss existing outlier detection

methods, pointing out strengths and weaknesses and identifying areas where

improvements are needed. In chapter three, I propose a new algorithm that has

several significant advantages. In chapter four, I compare the performance of the

proposed algorithm to existing methods in a variety of data situations, and in

chapter five, I perform a theoretical analysis of the robust estimation procedure

and discuss avenues for future research and possible improvement.

11

2. Previous Outlier Identification Methods

Outlier identification and robust location/scale estimation are closely re-

lated. Recall that the classical Mahalanobis distance has a breakdown point of

0. To increase this breakdown point, Campbell [11] proposed the use of robust

estimators, and specifically, a class called M-estimators, for T(X) and C(X).

This step marked an important improvement in identifying outliers, with the ro-

bust version not suffering from the masking effect like classical version. Clearly,

the performance of this robust version depends on the specific T(X) and C(X).

As initially introduced, M-estimators were monotone functions and were shown

by Maronna [52] to have a breakdown point no larger than This is ob-

viously an improvement over the classical version, but still quite inadequate

in higher dimensions with p = 20, the breakdown point is less than 5%.

Later, more sophisticated M-estimators were developed with breakdown point

approaching 50%. Nevertheless, using the Mahalanobis distance with robust

estimators of location and scale was an important conceptual breakthrough and

led to significantly improved versions of the Mahalanobis distance. Obtaining

robust location and scale estimates remains one of the most difficult challenges

in outlier identification; without which, all distance-based methods would fail

(this includes the majority of outlier identification algorithms.)

Using a robust estimate of the location and scale of the primary distribution,

the modified Mahalanobis distance can provide a reasonably good estimate of

how far a particular observation x* is from the center of this distribution. A suf-

12

ficiently large value of this distance indicates that the observation x* is likely an

outlier. In 1996, Rocke and Woodruff [66] stated that all known affine equivari-

ant methods of solving the outlier identification problem consist of the following

two stages:

Phase I: Estimate a location and scale.

Phase II: Calibrate the scale estimate so that it can be used to suggest

which points are far enough from the location estimate to be considered

outliers.

Several methods yield acceptably accurate estimates of the mean, but per-

formance regarding robust estimation of the covariance matrix varies widely.

Additionally, the Mahalanobis distance is much more sensitive to small changes

(errors) in the covariance estimate C(X) than it is to changes in the mean

estimate T(X).

Quantifying the magnitude of Mahalanobis distance at which an observation

is labeled an outlier is another critical element of these algorithms. In some sit-

uations, such as normally distributed data, the distribution of squared classical

Mahalanobis distances follows an F distribution, as shown in [43, ch. 5] and [51,

pp. 73-75], but not in all situations. Hadi et al. [32] develop robust location

and scale estimates and accordingly robust Mahalanobis distances for all the

observations, classifying those points as outliers which have robust Mahalanobis

distance larger than p, where cnpr is a correction factor based on n,

p and r (the number of points on which the location and scale estimates were

based). We address this issue in detail in section 3.2.2; our study reveals that

13

while this correction factor works well for the classic Mahalanobis distance on

normally distributed data, it can be quite inaccurate (too high) for modified

Mahalanobis distances from non-normal data and fails to detect the majority of

outliers even when an apparently clear separation exists between both groups.

2.1 The Influence Function

For a given sample of size n, Hoaglin et al. [38] (and to a lesser extent,

Thode [37]) discuss the sensitivity curve SC{x\Tn), which is computed by cal-

culating the estimator Tn with and without an observation at the point x, and

is proportionate to the size of the sample:

SC(x\Tn) = n(Tn(x\,.. xn\, x) Tn\(x\,..., xn_i)).

The sensitivity curve describes the effect of an individual observation on the

estimator Tn for a specific data set. It is not hard to compute empirically, but

is not of much theoretical use. When we let the sample size n tend towards

infinity, the resulting limit is called the influence function. In the absence of

a sample data set, the influence function is defined with reference to a specific

distribution, F.

Hampel [33] introduced the concept of the influence function which he and

his collaborators popularized in a book [35]. Although the influence function

approach has been used primarily (but not exclusively) on univariate data, our

experience has been that it can be advantageous from a computational point of

view to examine each component separately. We can then combine the respec-

14

tive estimates to calculate the Mahalanobis distances and locate the outliers in

multivariate space.

Suppose we have a univariate sample (xi,..xn) with empirical distribution

Fn. Letting Ax be the point mass 1 in X, we can define Fn by Fn = ^ Y^=i AXi-

To estimate the location parameter 0 of the underlying distribution F1 we

consider the statistics Tn = T(xi,... ,xn) = T(Fn). We can think of these

as a sequence of statistics {Tn;n > 1}, one for each possible sample size

n. Asymptotically, we consider estimators which are functionals. That is,

limn-tooTnfei,... ,xn) = T(F), where the convergence is in probability, and

we say that T(F) is the asymptotic value of {Tn\ n > 1} at F. The influence

function IF of the estimator T at the distribution F is given by

IF(x.T,F) = u^l-FI^Aznn

' e->0 e

If we replace F by Fn_i F and put e = we can see that IF measures

approximately n times the change in T caused by an additional observation at

x when T is applied to a large sample of size n 1. As discussed in Hampel

et al. [35], the importance of the influence function lies in the fact that it

can describe the effect of an infinitesimal contamination at the point x on the

estimate T, standardized by the mass of the contaminant. It gives us a picture

of the asymptotic bias caused by contamination in the data.

The influence function has a number of very appealing properties. Ham-

pel [33] shows that under conditions usually met in practice, Tn is asymptot-

ically normal with asymptotic variance equal to the integral of IF2 over F;

15

Huber [40] also discusses these regularity conditions. That is, the distribution

of (y/n[Tn T(F)]) tends to N(0, V(T, F)), where V(T,F) is the asymptotic

variance and is equal to

This formula shows that a bounded influence function implies a bounded

asymptotic variance, and cautions against using estimators with unbounded

influence functions (for example, the sample mean).

Another important property of the influence function is the supremum of the

absolute value, from which Hampel et al. [35] define the gross-error sensitivity

7* of T at F as

where the supremum is over all x where IF(x\T,F) exists. The gross-error

sensitivity measures the worst influence which a small amount of contamination

can have on the value of an estimator, and can be regarded as an upper bound

on the asymptotic bias of the estimator T. Ideally, 7* (T, F) is finite. In many

cases, putting a bound on 7*(T, F) is the first step in making T robust.

The influence function can also tell us more about T through the local-shift

sensitivity X* Slight changes in some observations, perhaps due to rounding

or grouping, could have a measurable effect on the estimate. The effect of

shifting an observation slightly from the point a; to a neighboring point y can

be measured by means of IF(y,T,F) IF(x-,T, F), because an observation is

j'(T,F)=sup\IF(x-,T,F)\,

X

16

added at y and removed at x. This can be approximated by the slope of the

influence function. A* is the worst of these normalized differences:

_\IF(r,T,F)-IF(X;T,F)\

A -- blip -------j-----j--------.

Xjty \y-X\

Recall that the earliest treatment of outliers was to simply reject them

entirely, as in Legendre [48] and Edgeworth [20]. In terms of the influence

function, this means that the IF vanishes completely in a certain region (usually

beyond a certain distance from the location estimate). In other words, if the IF

is zero in some region, then all the observations in that region have no effect on

T. For a symmetric F (and putting the center of symmetry at zero), Hampel et

al. [35] define the rejection point p* as

p* = inf{r > 0; IF(x\T,F) = 0 when |a:| > r}.

If no such r exists, then p* = oo by definition. All observations farther away

than p* are rejected completely.

Figure 2.1 shows a scaled version of the influence function, the ^-function,

for the sample mean and median. For T equal to the sample mean, it can be

shown that IF(x;T,F) = x, the asymptotic efficiency e = 1 (the reciprocal of

the variance) the maximum possible efficiency, and the local-shift sensitivity

A* = 1, also the lowest possible value for meaningful estimators. However, both

the gross-error sensitivity 7* = 00 and the rejection point p* 00, which are

very serious short-comings in the presence of outliers. We deduce that the mean

is least susceptible to rounding errors and is very efficient, but can break down

17

completely if even just a single far-away outlier is present.

The influence of the sample median has a discontinuity at the origin due to

the fact that it changes discontinuously if one of the two center values is changed,

and is horizontal (but non-zero) everywhere else. The local-shift sensitivity is

A* = oo because of this discontinuity, while at F = N(0,1) the median has

efficiency e = | = 0.637 and for F = N(p,,a2) an asymptotic efficiency of

e = Hampel et al. [35] show that the gross-error sensitivity of the median

is 7* = 1.253, which is in fact the minimum value at F N(0,1). Regardless

of the medians other shortcomings, no estimator can beat the median when

it comes to robustness. Finally, the rejection point p* = oo, which may seem

surprising at first glance, but the median does not reject outliers, it only lessens

their influence, to a considerable degree. It counts rather than measures all

observations except for the middle one or two, respectively, for an uneven or

even sample size.

Based on the above comparison, we can easily conclude that the median is

much more robust than the mean, at the cost of efficiency and local-shift sensitiv-

ity. There is still room for improvement over the median, however, particularly

regarding its efficiency e = | = 0.637. Using the IF-based criteria mentioned

thus far, we explicitly state four goals that we would like our estimator to attain:

Possess a low gross-error sensitivity 7*.

Possess a finite rejection point p*.

Possess efficiency e close to 1.

18

Possess a low local-shift sensitivity A*.

Having developed these criteria wherewith we can evaluate the performance

of various estimators, we now examine several candidates. We start with a class

of estimators which although not too successful themselves, paved the way for

other, modified and improved, estimators.

2.2 M-estimators

2.2.1 Monotone M-estimators

The maximum likelihood estimator (MLE) is defined as the value of Tn =

T(xx,... ,xn) that maximizes n?=if(xi,Tn), where / is the estimated density of

F and is implicitly dependent on the estimated parameter Tn. In practice, we

usually calculate the MLE as the value of Tn such that

n

Tn = argnun In f(xi,T)].

i=1

Huber [39] generalized this minimization problem to

71

Tn = argimny~]p(xi,T).

i=1

Suppose that p has derivative -0, ^(x, 9) = -Â§gp(x, 6). Then the estimate Tn will

also satisfy

n

^2ip(xi,Tn) = 0,

i=l

19

although a Tn can exist as a solution to the latter equation without solving the

minimization problem. An estimator Tn defined by either of these two equations

is called an M-estimator. Huber proposed the name M-estimator because these

are generalized maximum likelihood estimators. Of these two equations, the

^-function is more useful in practice. In fact, Hampel et al. [35] show an

important connection between the IF and the -^-function:

^ 1 -1 M'Kv,9)h%F)dF(vy

That is, the ^-function is simply a scalar multiple of the IF. Within this

framework, we present the p- and ^-functions of the sample mean and median

in Figure 2.1. For the sample mean, p{x) = x2 and ip(x) = x (ignoring the

constant 2 as is customary, since it doesnt affect the minimization problem),

while for the sample median, p(x) = |m| and p(x) sgn(s).

Recall that the sample median has good robustness properties, but suffers

from low efficiency and infinite local-shift sensitivity A*, while the sample mean

has the highest possible efficiency at F = N(0,l) and the lowest possible A*.

Huber [39] combined these properties to produce what he called the Huber esti-

mator:

ipb(x) := min{5, maxis, 6}} x min (1,

V \X\J

for 0 < b < oo. Hubers estimator has a breakdown point of 50%, just like the

sample median, but has much better local-shift sensitivity. For b

also has the same optimal efficiency as the sample mean. The rejection point

20

Figure 2.1: p- and -0-functions for the sample median and mean

psi-function for mean

-2-1012

psi-function for median

-1.0 -0.5 0.0 0.5 1.0

x o -

_JL

ro -

rho-function for mean

0 12 3 4

rho-function for median

0.0 0.5 1.0 1.5 2.0

p*, however, is still unbounded because for |rc| > b, the IF resembles that of the

sample median. We present a graphical version in Figure 2.2, together with 3

other robust estimators.

It would be beneficial to take note of Winsors principle, quoted by Tukey

[81, p. 457], which states that all distributions are approximately normal in

the middle. Applied to M-estimators, this calls for the middle portion of their

ip-functions to resemble the function that is best/most efficient for Gaussian

data, namely the sample mean. Thus, ^-functions with optimal efficiency will

be approximately linear near the origin: ip(u) u. At this point, we can refine

some of the properties that we would like our idealised estimator to possess:

We would like the IF to be linear in a vicinity of zero this will provide

good efficiency (close to e = 1) as well as small local-shift sensitivity A* for

those observations near the center (assuming a symmetric distribution).

We would like the IF to be bounded.

We would like the IF1 to return to zero (and remain at zero) for some dis-

tance sufficiently far away from the origin resulting in a finite rejection

point p*.

The class of M-estimators that satisfies the latter objective are accordingly

called redescending M-estimators.

2.2.2 Redescending M-estimators

Redescending M-estimators are those M-estimators that are able to reject

extreme outliers completely. That is, there exists an r, 0 < r < oo, such that

22

Andrews sine function

to

CO

Hubers estimator

Hampels estimator

i>{x) =0 V |x| > r.

One of the earliest proposals for a redescending M-estimator was Hampels

three-part redescending M-estimator, first introduced by F.R. Hampel in the

Princeton Robustness Study [3], an extensive theoretical and Monte Carlo study

of different robust estimators published in 1972. Hampels estimator is defined

by

1pa,b,r(x) = X o < |rc| < a

= a sign(:r) a < \x\ < b

= a^Jsign(a;) b < \x\ < r

- 0 |a;| > r,

where 0
the Princeton Robustness Study, and is shown in Figure 2.2.

The lack of differentiability of i/ja,b,r is not ideal, however, and a smooth

'(/-function would be preferred. This led to the development of Andrews sine

function, also introduced in the Princeton Robustness Study [3] and defined by

tysin{a)(x) ~ Sin i[7ra,7ra](^')j

where I(x) is the indicator function. We graph the (/-function in Figure 2.2.

In 1974 Beaton and Tukey [7] proposed the biweight function (short for

^square weight), a smooth function that has been successfully used in a wide

variety of applications, and depicted in Figure 2.2. In the context of ^-functions,

it can be defined as

24

Ai(r){x) = x{r2 z2)2I[-r,r](o;).

The term bisquare refers to the square of the second term, which ensures

continuity for both ip(x) and ip'(x) (useful during certain optimization routines).

The biweight has been used in as diverse applications as biostatistics (quantal

bioassay Miller and Halpern [56]), materials science (regression of rubber

abrasion loss Cleveland [14]) and economics (computation of Nigerian cur-

rency exchange rate Okafor [58]). It has also encouraged further research

regarding various properties of the ^-function and accordingly, forms the ba-

sis of several successful algorithms including the biflat estimate introduced by

Rocke [64].

2.2.3 Computational Issues of M-Estimators

Computing an M-estimator involves a numerical optimization problem

which frequently does not have a unique solution. Upon re-examination of the

defining equation,

n

Y^ip(xi,Tn) = 0,

i=1

it is clear that for redescending M-estimators, the LHS will vanish for all \x\ suf-

ficiently large. It might be possible to find the global minimum of p(u, Tn)

instead, but this is not always computationally feasible. Rocke and Woodruff

[84] develop a compound estimator, formally introduced on page 31, which con-

sists of two stages: the first utilizes a combinatorial estimator such as MVE

or MCD, defined on page 28, to locate a good starting point, which can then

25

be used for the minimization stage. Hampel et al. [35] propose to either (a)

select the solution which is nearest to the sample median, or (b) use Newtons

method starting with the sample median. Our experience with the biweight has

been that starting from the sample median, convergence to the desired value is

successfully and quickly achieved.

M-estimators are not invariant with respect to scale, as defined up to this

point. Fortunately, scale invariance can easily be achieved by defining Tn as the

solution to:

where Sn is a robust estimate of scale determined beforehand. The median

absolute deviation (MAD) has the maximal breakdown 50% and is recommended

for use in M-estimators by Hampel et al. [35, p. 105], and by Huber [40, p.

107], who called the MAD the single most useful ancillary estimate of scale.

Because the expected value of the MAD is 0.6745cr for a Gaussian distribution,

a robust scale estimate based on the MAD is usually defined as

Sn = 1.483 MAD{x} 1.483 medianj{|:rj median.,!^} |}.

See for example, Mosteller and Tukey [57, p. 205], Various software packages

scale the MAD differently; the Splus routine mad defines the MAD as above.

2.3 S'-Estimators

A class of estimators related to M-estimators are the 5-estimators, intro-

duced by Rousseeuw and Yohai [72] and so called because they are based upon

26

minimization of a scale statistic. An important distinction between S- and M-

estimators is that S-estimators provide simultaneous estimates of both location

and scale, while M-estimators produce estimates for only one of these quantities.

For a multivariate data set (x*,... ,xn), the S-estimate of location and scale is

defined as the vector TeSRp and matrix C 6 PDS(p) that jointly solve

for bo a tuning constant. The constraint equation is often compactly written as

where d; is a robust version of the Mahalanobis distance, p is usually chosen to

be a scaled version of some base function, such as the biweight. Lopuhaa [50]

showed that the breakdown point is given by the ratio of b0 to the maximum of

p. Rocke [64] showed that S-estimators in high dimension can be very sensitive

to outliers, even with high breakdown point, illustrating yet again the difficulty

of obtaining a robust estimate of multivariate scale.

Multivariate S-estimators pose a significant computational challenge. The

S-estimator is defined as the location estimate T and PDS matrix C that achieve

the unique global minimum; it can be very difficult to guarantee finding a global

minimum. In practical settings, the computed S-estimator may differ greatly

min |C|

subject to

27

from the theoretical 5-estimator. This is illustrated with the bushfire data

set. Originally published in a CSIRO report [12] and reprinted in Maronna and

Yohai [53], this data set helped to locate bushfire scars in Australia by satellite

measurements of n = 38 pixels in p = 5 frequency bands. Using the same

theoretical 5-estimator, Becker and Gather [8] labelled different observations

as outliers than Maronna and Yohai [53]. Becker and Gather [8] ascribe this

to the fact that different computational algorithms were used to minimize the

objective function, and that due to a different starting point, Maronna and

Yohais algorithm may have converged to a local minimum instead of the global

minimum. In the words of Rocke and Woodruff [84], the algorithm is the

estimator.

For difficult.minimization problems like 5-estimation, a good starting point

is of the utmost importance; it is worth investing considerable time and effort

in pursuit of such. Rocke and Woodruff [84] proposed a procedure called a com-

pound estimator, which obtains initial estimates of location and scale through

one type of robust estimator, which are then used as starting points for the more

powerful 5-estimator. The estimator that Rocke and Woodruff use to obtain

their initial estimates is a combinatorial estimator, described in the following

section.

2.4 Combinatorial Estimators

In 1985 Rousseeuw [69] introduced the Minimum Volume Ellipsoid (MVE)

and Minimum Covariance Determinant (MCD). These are both affine equivari-

ant estimators with high breakdown points of 1w/2E~p+1 ; approaching for large

28

n. The location estimate provided by the MVE is the center of the ellipsoid

of minimal volume containing approximately half of the points in X, while the

covariance estimate is provided by the shape matrix of this ellipse. Lopuhaa and

Rousseeuw [50] add a small modification, which results in a slightly higher break-

down point of and which calculates the location estimate T G

and covariance estimate C Â£ PDS(p) according to

min |C|

subject to

#{*: (x T)rC-1(xi T) < c2} >

where # denotes cardinality. For Gaussian data, Lopuhaa and Rousseeuw [50]

show that a good value for c2 is Xo.5)P> the value for which PMis{(X/x)tS-1(X

A4) < c2} = Rousseeuw and Leroy [70, p. 259] give an algorithm for the MVE.

The MCD is similar to the MVE and has the same objective function. The

difference is that in the constraint, the MCD requires only that the estimate be

formed using half the points, instead of the ellipse containing half the points. It

follows that both the MVE and MCD are too cumbersome to compute precisely,

except in small data sets. Several different algorithms have been proposed,

but of necessity they can be only approximations. These are usually based on

various optimization techniques applied to smaller subsamples of the data, and

as such can never guarantee the desired result. Standard optimization techniques

cannot be applied in the usual sense because the criteria are nonconvex and

n + p +1

2

29

nondifferentiable. There has been some innovative research done in this regard,

with optimization methods ranging from Glovers tabu search [26], [27], applied

to the MOD by Rocke and Woodruff [84], to a steepest descent algorithm with

random restarts proposed by Hawkins [36]. This is another case where the

algorithm is the estimator.

It later became apparent that the MCD was vastly superior. Rocke and

Woodruff [84] found that with p = 25 and 15% contamination, the MCD took 32

seconds to produce good enough starting values for the ^-estimator to converge,

whereas after 8,192 seconds the MVE had still not converged. For the 20 x 5

wood specific gravity data set [70, p. 243], Cook and Hawkins [15] had to

examine 57,000 cases (ellipses) before finding the MVE. Since this was still

not guaranteed to be the correct answer, they allowed the algorithm to run to

completion, i.e., full enumeration, which took 50 minutes on an IBM PC/AT.

Even on this outdated computer, for a mere 20 x 5 data set we could certainly

wish for more.

It appears that an inherent weakness of both the MVE and MCD is that the

computation time grows too fast to be useful in moderate and large data sets.

Hawkins [36] proposed an efficient version of the MVE where the search time

grows with pn. This leads to what Rocke and Woodruff [84] call the perverse

situation, in which the analysis can actually be made worse by the collection

of more data as the size of the data set grows, so too does deterioration in

the algorithms performance.

30

2.5 Compound Estimators

In 1994, Rocke and Woodruff [84] proposed a new class of estimators called

compound estimators, which utilize the complementary strengths of combina-

torial and ^-estimators. To remove the paradox of additional data degrading

the estimators performance, they randomly partition the data into L^yJ cells,

where y(p) is the minimum number of observations required in each cell. They

then apply the MCD algorithm to each cell, constrained by a predetermined

time limit. If the researcher can afford to use t seconds for this stage of the

algorithm, the MCD algorithm is only allowed to spend -j seconds searching

7(p)

each cell. This will result in Ly^yJ different location and scale estimates; the best

of these is chosen as the starting point for the S-estimator algorithm, calculated

on the unpartitioned data set. The final solution is given by the S'-estimator;

that is, the location and scale estimates producing the smallest determinant of

the covariance matrix, |C|.

Successful execution of this method requires coming fairly close to the de-

sired location and shape in the first stage (MCD applied to each data cell), which

would lead the ^-estimation algorithm to converge to the correct root. More

data would help, not hinder, this estimator, as there is an increased chance that

one of the initial L^yJ cells would contain fewer outliers, provide a good starting

point and aid convergence to the desired minimum. (Only one such lucky cell

is required.) Rocke and Woodruff distribute C code for this algorithm, which

31

they call MULTOUT.

2.6 BACON

Hadi et al. [32] describe a fast, robust method called BACON Blocked

Adaptive Computationally-efficient Outlier Nominator. BACON is an iterative

method, the core of which consists of identifying a basic subset presumed to

be outlier free, from which the sample mean and covariance matrix can be

estimated, and followed by robust Mahalanobis distances.

To construct the first basic subset, Hadi et al. [32] provide two options. The

first simply computes the sample mean and covariance of all n observations, fol-

lowed by the Mahalanobis distances (or discrepancies di, as Hadi, et al. [32],

call them). Those observations with the cp smallest di s form the initial basic

subset, where c = 5 is typical and not very influential in the algorithms conver-

gence. This procedure is affine equivariant but not at all robust; our experience

has been that in data sets containing a substantial amount of outliers, the algo-

rithm falsely converges and fails to detect them. The second option computes

the coordinate-wise median m, and thereafter ||x; m|| for all n observations.

Those cp observations with the least distance to m form the first basic subset.

This start is robust, but not affine equivariant. Subsequent iterations are affine

equivariant, however, so the overall algorithm is nearly affine equivariant. We

list the steps required for BACON as follows:

1. Select an initial basic subset according to either of the two options, and

calculate the sample mean T6 and covariance Cf, of this subset.

2. Compute the discrepancies di, according to

32

MTb, Cb) = y/ixi-TjTC^te-Tt).

3. Set the new basic subset to contain all points with discrepancy di less

than cnpj. /r?, where cnpr = cnp + Chr is a correction factor; h

V

[_?t+p+i j ^ r -g sjze 0f current basic subset, chr = max {0, |=^}, and

Cnp = 1 + S + = 1 + S + assuming n + p + 1 is even.

4. Repeat steps 2 and 3 until the size of the basic subset remains constant.

5. Nominate the observations excluded by the final basic subset as outliers.

Hadi et al. [32] add the correction factor in step 3 because when

the elements of the covariance matrix tend to be smaller than they should be.

This method is a faster version of a forward search algorithm proposed by

Hadi in [30] and modified in [31]. These methods are similar to BACON but

in step 3, add only the single point with the least di provided it is less than a

certain threshold value). Several variations of forward search algorithms have

been examined in the literature e.g. in [37] and [6], but they usually converge

very slowly for large n. Similarly, backward search algorithms (also discussed

in [37] and [6]) have been proposed which start with all n observations, and

at each step discard the single observation with the furthest distance, provided

it is above a certain minimum threshold. BACON is a significant computa-

tional improvement because it can move large blocks of observations in a single

step. Our study of BACON indicates that it is indeed computationally fast, and

demonstrates very good success amongst normally distributed data, but does

33

not perform as well on non-normal data. We will analyze its performance in

greater detail in chapter 4.

2.7 Non-Distance Based Methods

All of the methods mentioned so far base their outlier identification proce-

dures on a robust version of the Mahalanobis distance. While this is certainly the

most popular approach, it is not the only one. In 1974 Friedman and Tukey [25]

introduced a novel procedure which they called projection pursuit. Their pur-

pose was to find interesting structure in the data or unexpected characteristics

that might not be readily apparent. In addition to several other applications not

originally envisioned by Friedman and Tukey including even multiple regression

[24], projection pursuit also lends itself very well to outlier identification.

For data sets following an elliptically symmetric distribution like the multi-

variate normal, the usual summary statistics such as sample mean, covariance,

etc. provide enough information to generally understand it. With other data

sets, however, important information or structures are frequently not adequately

captured by these statistics. The goal is thus to pursue projections in which the

data reveal interesting structures when viewed by the researcher in low dimen-

sion, relying on what Friedman and Tukey [25] call the human gift for pattern

recognition. The projection pursuit algorithm assigns a numerical index, called

the projection index, to certain one- or two-dimensional projections of the data

characterizing the amount of structure present in that projection, as can be

calculated through the data density variation. This projection index is then

numerically maximized with respect to the parameters defining it to obtain the

34

most interesting projection. The projection index can, and should, according

to Huber [41], be affine equivariant. After a certain projection has been viewed,

the particular structure that it revealed is removed, so that other interesting

projections can be displayed, similar to principal components analysis, which is

actually a special case of projection pursuit.

In a comprehensive invited article, Huber [41] states that projection pursuit

methods are able to ignore noisy and information-poor representations. This

gives them a distinct advantage over methods based on inter-point distances

such as multidimensional scaling and clustering techniques, which have both

been used for outlier identification see [78], [59] and [67]. Friedman and Tukey

[25] originally decided to define data as uninteresting if it appeared to have a

normal distribution; Huber [41] gave heuristic arguments to support this idea.

To highlight interesting features of the data, they therefore define the projection

index to measure, and subsequently maximize, departure from normality. Due to

a renewal of interest in projection pursuit in the 1980s, Friedman [23] updated

their original algorithm. According to Huber [41], projection pursuit methods

have primarily one serious drawback their high demand on computational

time. This will subsequently be shown to be quite severe indeed.

2.7.1 The Stahel-Donoho Estimator

The most well-known outlier identification method based upon the projec-

tion pursuit concept, is the Stahel-Donoho estimator, an affine equivariant esti-

mator with a breakdown point approaching 50%. This estimator was developed

by Stahel and Donoho in [79] and [19], and became better known when Maronna

35

and Yohai [53] published an analysis of it in 1995.

The Stahel-Donoho estimator is defined as a weighted mean and a weighted

covariance matrix, where each observations weight is inversely proportional to

how outlying it is. This outlyingness measure is based upon the projection

pursuit idea that if a point is a multivariate outlier, there must be some one-

dimensional projection of the data in which this point is a univariate outlier.

Using a particular observation as a reference point, the Stahel-Donoho algorithm

determines which directions have optimal values for a pair of robust, univari-

ate location/scale estimators, then uSies these estimators to assign weights to

the other points. Employing the median and the MAD (for a breakdown point

approaching 50%), necessitates solving a global optimization problem with dis-

continuous derivatives, for each observation. Selecting smooth estimators does

not help very much because this results in the objective function possessing

several local maxima, so that gradient search methods are rendered unsuitable.

Maronna and Yohai [53] infer that the reason not much attention has been paid

to the Stahel-Donoho estimator is due to other more tractable estimators such

as the MVE and A-estimators. Maronna and Yohai [53] present an algorithm

for this estimator, which works very well for p = 2, and state further that no

known feasible method exists for large p. A simulation study by Juan and Prieto

[44] with n = 100 and p = 5,10,20 comparing the MCD, MULTOUT, and the

Stahel-Donoho estimator, showed the latter outperforming the other two esti-

mators, although they all approached 100% failure rate at contamination level

e = 10% for p = 20, at e = 15%forp > 10, and at e = 20%forp > 5. We

36

conclude that this method is not viable for large data sets, but it has given rise

to at least one other successful method of outlier identification.

2.7.2 Kurtosis

One way of reducing the computational cost of the Stahel-Donoho estima-

tor is to reduce the number of projections that need to be examined. Pena and

Prieto [60] propose a method which they call Kurtosisl. Kurtosisl involves pro-

jecting the data onto a set of 2p directions, where these directions are chosen to

maximize and minimize the kurtosis coefficient of the data along them. Kurtosis

is a measure of how peaked or flat the distribution is. Data sets with high kur-

tosis tend to have a sharp peak near the mean, decline rather rapidly, and have

heavy tails, while data sets with low kurtosis tend to have a flattened peak near

the mean; the uniform distribution is an extreme example. The kurtosis coeffi-

cient is also inversely related to bimodality. A small number of outliers would

thus cause heavy tails and a larger kurtosis coefficient, while a large number

of outliers would start introducing bimodality and the kurtosis would decrease.

Viewing the data along those projections that have the maximum and minimum

kurtosis values would display the outliers in a more recognizable representation.

For large data sets, the Stahel-Donoho algorithm requires projecting the

data onto randomly generated projections because full enumeration is infeasible,

and requires investigation along a very large number of projections in order to

be successful. By drastically reducing these directions, Kurtosisl aims to signif-

icantly improve computational speed without sacrificing accuracy. Successfully

solving the optimization routine to find the maximum or minimum kurtosis val-

37

ues necessitates reaching the global maximum or minimum. Since this is not

efficient, we have to settle for a set of p local maximizers and p local minimizers.

Pena and Prieto [60] show how computing a local maximizer/minimizer would

correspond to finding either (a) the direction from the center of the data straight

to the outliers, which is exactly what we are looking for, or else (b) a direction

orthogonal to it. They then project the data onto a subspace orthogonal to

the computed directions and rerun the optimization routine. This process gets

repeated p times, so in total 2p directions are examined, and we expect with

high probability that the outliers lie along at least one of these directions.

For each of these 2p directions, Pena and Prieto [60] use a backward search

algorithm based on the univariate median and MAD to find potential outliers.

Based upon the sample mean and covariance of all points not labeled as po-

tential outliers, robust Mahalanobis distances MD* are calculated for all the

observations. Those points with MDf1 < %q 99p are not considered outliers and

are put back in the set of inliers. This process is repeated until convergence

(usually only a few iterations.)

Our study of this method shows that it does indeed do a good job of detecting

outliers, for a wide variety of outlier types and data situations, although it

tends to produce a relatively large number of false positives. An interesting

feature is that its success rate remains approximately the same for outliers both

near and far, whereas distance-based methods demonstrate a marked increase

in performance for distant outliers. Although it is a significant improvement

over the Stahel-Donoho estimator, computational speed still remains a serious

38

weakness, especially with regard to p.

2.8 Types of Outliers

Certain types of outliers are easier to find than others. Some algorithms

generally succeed at detecting a certain type of outlier, but fail miserably with

another type of outlier. Clearly there are an infinite number of outlier types, but

some are more common than others. It is common practice when generating data

for the purpose of testing outlier algorithms, to assume the primary distribution

is an uncorrelated multivariate normal, and that the outliers come from another

multivariate normal with mean pt and covariance matrix E. If the estimator is

affine equivariant, we can assume without loss of generality that the primary

distribution has mean 0 and covariance I. Rocke and Woodruff [65] mention a

useful framework to consider multivariate normal outlier types when the primary

distribution is NP(0,1):

Shift Outliers: The outliers have distribution Np(pb,T).

Point Outliers: These outliers form a densely concentrated cluster, tending

towards a point mass.

Symmetric Linear Outliers: Given a unit vector u, the outliers have dis-

tribution uiVp(0, E), for arbitrary E.

Asymmetric Linear Outliers: Given a unit vector u, and its coordinate-

wise absolute value |z/|, the outliers have distribution [u|A^,(0, E).

Radial Outliers: The outliers have distribution IV(0, kl), where k > 1.

39

Linear outliers are a commonly used outlier type. Radial outliers are too

easy to reveal possible differences between outlier algorithms (all of the outliers

will usually be found by all of the algorithms). Rocke and Woodruff [66] show

that for distance-based methods, point outliers and shift outliers are the most

difficult to detect. For shift outliers, the expectation of the Mahalanobis distance

between the outlier and primary distribution means is least when the covariance

matrices are similar.

Point outliers are also difficult for most distance-based methods, and these

outliers are perhaps best detected with algorithms specifically designed to detect

such densely concentrated outlier clusters. Standard clustering techniques as

described in, for example, Johnson and Wichern [43, ch. 12], do not appear to

perform very well in detecting clusters of this type. In order to detect densely

concentrated clusters, several authors have developed special techniques for this

purpose, such as Pena and Prieto [59] and Rocke and Woodruff [67].

We briefly comment on the near exclusive use of the multivariate normal

distribution when generating sample data sets for outlier algorithms. Our study

shows that an algorithms performance can change drastically when switching

to a different type of distribution. For example, BACON demonstrates virtually

perfect performance in a variety of normally distributed outlier types, but when

analyzing data from a mixture of %2 distributions, shows near 100% failure rate

in many situations. There is no guarantee that observed data follows a normal

distribution in practice, and we feel that in general, more testing of algorithms

on non-normal data should be conducted.

40

We thus conclude our tour of past and current outlier methods and in the

following chapter, propose a new algorithm for outlier detection.

41

3. The Development of a New Method

3.1 Preliminary Investigation

As a preliminary step to developing a new outlier detection method, we

briefly examined and compared existing methods to determine possible areas of

improvement. We selected four algorithms that appeared to have good poten-

tial for finding outliers: MCD (using-an improved version called fastMCD [71]

implemented as cov.mcd in Splus), BACON (a DOS executable version of the

code was obtained from the author, Prof. A.S. Hadi), MULTOUT the hybrid

algorithm of Rocke and Woodruff (a C version of the code was obtained from the

authors website) and Kurtosisl (a MATLAB version of the code was obtained

from the authors). In accordance with most other authors, such as Rocke and

Woodruff [66], we use a multivariate version of Tukeys contaminated Normal

model [81], (1 e)lV(0,I) + eN(5e, AI), where e = (1,...,1)T, and 5 = 10.

We considered p = 5,10,20 and e = 10%, 20%, 30%, 40%. The values of A we

consider are A = 0.1 and A = 1.0 which correspond to a concentrated cluster of

outliers (point outliers) arid shift outliers, respectively. Radial outliers, A = 5.0,

are not very interesting or informative since all the methods yielded 2% error

or less. None of the algorithms except cov.mcd were suitable for repeated sim-

ulatons in the given format, so for each parameter combination we carried out

10 simulations with n = 1,000 points, generated by the Splus random number

generator. The exception was Kurtosisl, which due to MATLABs restriction

42

on large data sets, was designed for only n = 750; these points were randomly

drawn from the original set of 1,000 points.

Frequently, results of this type are presented in a 2 x 2 table showing success

and failure in identifying outliers and non-outliers, which we henceforth call in-

liers. It is also possible to calculate various indices such as sensitivity, specificity

and positive/negative predictive power, terms which are commonly used in the

medical literature. Throughout this thesis, we present the percentage false neg-

atives followed by the percentage false positives in the same cell; we henceforth

refer to these respective values as the outlier error rate and inlier error rate.

We propose these names because the outlier error rate specifies the percentage

errors recorded within the group of true outliers, and similarly for the inlier

error rate. Note that these respective error rates correspond to the complement

of the sensitivity and the complement of the specificity. This is more compact

than a series of 2 x 2 tables and allows us to easily focus on bringing the error

rates down to zero. Nevertheless, 2x2 tables are very descriptive and frequently

used in existing literature, so we illustrate the connection between our notation

Table 3.1: Sample 2x2 table illustrating the notation used in this thesis

Predicted Outliers Predicted Inliers

True Outliers a b

True Inliers c d

Outlier Error Rate = -fr. Inlier Error Rate = -7-7.

43

and a standard 2x2 table in Table 3.1.

Results from these simulations are shown in Table 3.2. The values 50.0/10.3

in the upper left cell of this table mean that for this simulation, MCD failed

to identify 50% of the outliers, and mistakenly identified 10.3% of the regular

points [inliers] as outliers. This corresponds to sensitivity 50.0%, specificity

89.7%, positive predictive value 67.6% and negative predictive value 80.7%.

Computational time was not a big issue for these simulations due to the

small data size. For BACON, we used the second option of obtaining an initial

subset (coordinate-wise medians) because performance with the other option

was markedly inferior. MULTOUTs C code took slightly longer to execute

than the Kurtosisls MATLAB code (on the order of one minute), which might

be a source of concern for larger data sets, but did not pose any difficulties for

n = 1,000. Examination of table 3.2 reveals that both MCD and MULTOUT

are incapable of detecting concentrated outlier clusters. (This could be due to

the algorithms using an initial value originating from a cell consisting solely of

outliers, which would have a significantly smaller covariance than a cell consist-

ing soley of inliers, followed by convergence to the wrong solution.) BACON

demonstrates a perfect score, while Kurtosisl is usually quite accurate with a

few exceptions. The exceptions for Kurtosisl occur when it missed all or nearly

all of the outliers on one or two of the ten runs. For 5 = 10, however, near per-

fect performance should be expected as there is virtually zero overlap between

the inlier and outlier groups.

44

Table 3.2: Preliminary simulations with multivariate normal outliers

MCD BACON MULTOUT Kurtosisl

p = 5 e = 0.3 A = 0.1 50.0/10.3 0/0 0/0.7 0/0.8

A = 1 0/2.7 0/0 0/0.8 0/0.5

e = 0.4 O 1 i1 100/35.6 0/0 100/16.9 0/0.9

A = 1 0/2.5 0/0 0/0.8 10.0/0.7

p 10 e = 0.2 \i o 11 50.0/7.6 0/0 0/1.0 0/1.0

A = 1 0/2.7 0/0 0/0.8 0/1.0

e = 0.3 A = 0.1 100/22.7 0/0 70.0/5.8 0/0.9

A = 1 0/2.8 0/0 0/1.0 0/1.0

6 = 0.4 A = 0.1 100/49.1 0/0 100/22.1 0/0.8

A = 1 0/2.5 0/0 0/0.7 0/0.8

p = 20 e = 0.1 A = 0.1 0/3.6 0/0 20.0/1.2 0/1.0

A = 1 0/3.6 0/0 0/0.9 0/1.1

e = 0.2 A = 0.1 100/17.6 0/0 100/5.3 0/1.0

A = 1 0/3.0 0/0 0/0.9 18.5/1.2

e = 0.3 A = 0.1 100/38.2 0/0 100/12.9 11.0/5.6

A = 1 76.0/2,9 0/0 29.5/1.1 0/1.2

e = 0.4 A = 0.1 100/76.0 0/0 100/34.5 0/1.0

A = 1 96.0/3.5 0/0 99.1/1.6 0/1.0

45

The data in Table 3.2 were generated from multivariate Normal distribu-

tions. An interesting question not usually addressed in the literature is how

these methods would perform with non-Normal data. We therefore carried out

a similar experiment, but this time we generated the data from a mixture of un-

correlated x2 distributions the primary distribution (describing the inliers)

is xb outliers drawn from xh- These results can be seen in Table 3.3, also

i

based on the mean over 10 iterations.

Several differences are apparent from the multivariate Normal data. The

most noticeable is BACONs significant inability to detect x2 outliers for p = 5

and the higher contamination levels for p = 10. MCD continues to produce rela-

tively high numbers of false positives, while MULTOUT and Kurtosisl produce

better overall results.

3.2 The Development of a New Method

3.2.1 Obtaining Robust Estimates

Since it was our purpose to develop an outlier detection method suitable for

large data sets, we considered computational speed a very important criterion.

We therefore chose not to pursue the projection pursuit type algorithms, such as

Kurtosisl, but to focus on distance-based methods instead. A key ingredient of

a successful distance-based method is a robust estimate of the covariance matrix.

Despite being more sensitive to outliers (i.e. more difficult to estimate robustly)

than the location estimate, the Mahalanobis distance is also extremely sensitive

to an incorrect scale estimate and could be rendered incapable of separating the

46

Table 3.3: Preliminary simulations with x2 outliers

MCD BACON MULTOUT Kurtosisl

P5 e = 5% 0/12.2 46.6/0.01 0/6.2 0.2/6.9

e = 10% 0/12.4 91.4/0:03 0/6.56 0.2/6.65

e = 20% 0/10.3 97.0/0.01 0/5.2 0.2/6.7

e = 30% 0/9.2 99.3/0 1.0/4.0 5.5/5.6

e = 40% 0/6.3 99.5/0 80.1/1.0 17.2/3.6

p=10 e = 5% 0/13.7 0.8/0.2 0/7.4 0/7.9

e = 10% 0/12.6 0.4/0.2 0/7.5 0/7.6

e = 20% 0/11.6 10.3/0.2 0/5.7 0/8.0

e = 30% 0/9.5 39.3/0.2 0/5.7 0/7.3

e = 40% 0/7.9 79.1/0.1 0/4.9 5.6

p=20 e = 5% 0/15.4 0/0.2 0/8.0 0/7.6

e = 10% 0/13.8 0/0.3 0/7.3 0/8.3

e = 20% 0/12.4 0/0.3 0/6.3 0/7.6

e = 30% 0/9.5 0/0.4 0/6.0 0/6.5

e = 40% 0/9.3 0/0.2 0/5.0 0/5.6

47

outliers and inliers. We therefore devoted considerable time to obtaining a fast,

robust estimate of the covariance matrix.

Of the four estimators shown in Figure 2.2, we prefer Andrews sine function

and Tukeys biweight function, because they are smoother than the other two

estimators. The biweight has been successfully used in a variety of situations, so

we also decided to use it. For p 2, Mosteller and Tukey [57, ch. 10] calculate a

highly robust covariance matrix based on biweight robust regression. To achieve

this, however, they sacrifice component invariance, i.e. the results from calculat-

ing var(x) followed by var(y) are different than first calculating var(y) followed

by var(x), and hence cdv(x,y) also differs from cov(y,x). Mosteller and Tukey

[57] provide an extension for p 3 and higher, but the computational burden

increase dramatically because estimation of the off-diagonal elements involves

multiple regression procedures. This procedure would not be feasible for high

dimensions. We felt that the price of invariance and additional computational

burden were too high for large data sets, so did not pursue this direction further.

Analyzing each component separately, we subsequently investigated the

weights assigned to observations by the biweight function. Let the univari-

ate observations be Xi,... ,xn with corresponding weights wi,..., wn; Xbw the

location estimate, which we initially take to be the median; S a scale estimate

based on the median absolute deviation from Xbw: S = median{|a:i z^l}; and

c a tuning constant. The weights Wi are then iteratively calculated according to

48

(1_(2i_|^)2)2> when(

Wi = V J

Ej 3*bw

cS

0,

otherwise,

Mosteller and Tukey [57] recommend c = 6 or c = 9; Hoaglin et al. [38]

examine the biweights robustness properties, such as gross error sensitivity, and

efficiency at c = 9 because less data is rejected, but better (i.e., lower) gross

error sensitivity at c = 6. Kafadar [45] also conducted a study investigating the

biweights efficiency for different values of c, sample sizes and distributions and

found that its efficiency increases with both sample size and tuning constant c at

the Gaussian. At other distributions such as One-Wild and Slash, its efficiency

increases with sample size but decreases with c. Note that c = oo corresponds

to the sample mean, because then all the observations have weight equal to one.

For instance, at c = 6, Hoaglin et al. [38] show that the asymptotic variance of

the biweight at the standard Gaussian is 1.094, so that its asymptotic efficiency

relative to the mean (on Gaussian data) is e = 91.4%, and the gross error

sensitivity is 7* = 1.68. For comparison, recall that the mean has 7* = 00 and

for the median, 7* = 1.25. The asymptotic efficiency at c = 9 is e = 98.2%,

with 7* = 2.05. Mosteller and Tukey [57, p. 206] note that the performance

of an estimator with efficiency above 90% is indistinguishable in practice from

one with an efficiency of 100%, so from a theoretical viewpoint, c = 6 seems

preferable when dealing with outliers. Since the expected value of the MAD

performance (efficiency) for these and other values. The biweight has a higher

49

equals 0.6745
equals roughly 4a. We found that c = 6 works better than c = 9 because it

limits potential outliers from being included in the initial location and scale

estimates; it is very important that these estimates accurately describe only

the inliers. We did not investigate c 3; this would yield a rejection point of

0.6745 3cr 2a, which would reject too many inliers and underestimate the

scale.

Another computational matter is convergence. Kafadar [45] found that three

to six iterations were sufficient for convergence to 0.0005 of the biweight scale

statistic. For our intended applications, namely large data sets, computational

speed is very important, so we used 3 iterations. We do not store the value of the

biweight location or scale, and will shortly discuss how we found better results

when reassigning the biweight weights to either zero or one, so the only critical

aspect is whether the weights are above or below the reassignment boundary.

After fitting the biweight to each of the p components, we used the weights

thus obtained to calculate a weighted mean and covariance matrix. For the

covariance, we used the product of the weights for the different coordinates as

J2k=i_wk^k(xk~x )(xk~xl) ^ where the superscript indicates the

2^k=i wkwk

dimension number (1.. .p) and the subscript the observation number (1... n).

Investigation into the covariance matrices obtained in this manner revealed that

they were frequently too small. This occurred because most of the inliers re-

ceived weights in the range 0.90-0.99, so that their full contributions to the

sample covariance were not counted. This was significant because points near

follows: C(i,j)

50

the edge of the primary distribution received reduced weights, yet their full

weights were necessary for an accurate estimate. To correct this problem, we

focused attention on producing an initial, smaller set of outlier-free data from

which we could calculate an unweighted sample mean and covariance. This is

a similar idea to BACON, but we achieve the result in a very different man-

ner. To this end, we reassign a weight of 0 to all points with biweight-weight

of less than 0.3, and a weight of 1 to those points with biweight-weight of 0.3

and greater. This value of 0.3 was initially arbitrarily chosen; we selected a

relatively small value so as not to discard too much data. We subsequently refer

to the value where this separation is made as the cutoff value determining a

suitable cutoff value for a wide variety of data situations is an important part

of this investigation.

We illustrate the superiority of this modification with a tricky set of cor-

related Poisson data, following a method described in [42, ch. 37]. Consider

a data set with n 10,000 and p = 5. Generate the first 8,000 observations

from a Poisson(lO) distribution, and the first component of the next 2,000 ob-

servations from a Poisson(40) distribution. The remaining 4 components of the

last 2,000 observations are again generated from a Poisson(lO) distribution. To

obtain correlated data, we generate another 10, 000 x 1 vector Y from a Pois-

son(10) distribution, and add this vector to each component of the data matrix.

Let Ylt...,Ys denote the columns of the data matrix before addition of Y, and

Xi,...,X5 the columns after addition. It is straightforward to compute the

theoretical covariance matrix: cov(Ai, X2) = cov(Yi + Y, Y2 + Y) = var(Y) = 10

51

and var(Xi) = var(Fi + Y). For the set of inliers the first 8,000 observa-

tions the variance of each column Xk is 10 + 10 = 20, k = 1,..., 5. Using

the original set of biweight weights, the following sample covariance matrix was

obtained:

15.29 6.49 6.47 6.57 6.68

6.49 13.35 6.01 6.28 6.36

6.47 6.01 12.49 6.08 6.24

6.57 6.29 6.08 12.97 6.47

6.68 6.36 6.24 6.47 13.42

Using a cutoff value of 0.3, we obtained the following sample covariance matrix:

20.19 8.91 8.99 9.16 9.38

8.91 19.31 8.77 9.21 9.33

8.99 8.77 18.35 8.95 9.10

9.16 9.21 8.95 19.14 9.48

9.38 9.33 9.10 9.48 19.45

which is a considerable improvement and only slightly smaller than the theo-

retical inlier covariance matrix: 20s on the diagonal and 10s everywhere else.

Considering the fact that there is considerable overlap between the outliers and

inliers, and also that the data are not Normally distributed, the latter covariance

estimate is quite satisfactory. It leads to adequate separation between inlier and

outlier Mahalanobis distances, so that for this particular data set and cutoff

value, the outlier identification error is virtually zero. For comparison, we also

52

examine the results obtained by some of the other outlier algorithms. For the

same data set, BACON did not detect any outliers and both its initial and final

covariance estimates were very similar, the final estimate being

169.68 9.54 9.52 9.69 9.42

9.54 20.33 9.82 10.20 10.23

9.52 9.82 19.81 10.14 10.07

9.69 10.20 10.14 20.28 10.46

9.42 10.23 10.07 10.46 20.34

3.2.2 Determing the Rejection Boundary

An important issue is how to actually reject the outliers. Ideally, our robust

estimates of location and scale will lead to robust Mahalanobis distances demon-

strating a good separation between outliers and inliers. Johnson and Wichern

[43, ch. 5] state that for multivariate normal data, the classical Mahalanobis

distance based on the sample mean and covariance matrix is follows an F dis-

tribution. This implies that an appropriate boundary value could be found by

selecting a suitable F percentile. Our study showed that boundary values cal-

culated in this manner were frequently suboptimal. BACON uses a modified

Xp percentile as described on page 33, we found that the rejection values it ob-

tained were often too high. Plotting the robust Mahalanobis distances would

easily allow perfect classification through visual inspection, but the computed

rejection point led to unnecessary misclassification errors. We therefore decided

to design a new criterion that didnt require any assumptions and would work

53

for robust Mahalanobis distances computed from a wide variety of data types.

Intuitively, if robust location and scale estimates have been computed that

accurately describe the inliers and reveal the unexpectedly large displacement of

the outliers, the robust Mahalanobis distances will separate into two categories

a group with smaller values (inliers) and a group with larger values (out-

liers) For sufficient data, if we were to calculate the empirical density of these

distances, we expect to see a peak describing the inliers, after which the density

would return close to zero, and could either reveal a smaller second peak (if the

outliers Mahalanobis distances were sufficiently concentrated) or simply remain

close to zero. In practice, we found that even with 40% outliers, the first peak

was very well defined but the second peak was virtually nonexistent because in

contrast to the inliers, the outliers distances were widely dispersed.

The value of Mahalanobis distance at which the density first approaches

zero could therefore serve as a final rejection boundary. Any observations with

Mahalanobis distance less than this value would be classified as inliers, and

otherwise as outliers. This rejection point takes into account the unique charac-

teristics of each data set and doesnt require any assumptions other than robust

location/scale estimates that accurately describe the inliers. When the peak

flattens out and the density returns to zero, the data (inliers) upon which the

peak is calculated have vanished. We illustrate this concept in Figure 3.1. The

robust Mahalanobis distances calculated by our method, using a cutoff of 0.3,

are shown in the first graph, with the corresponding density in the second. (We

attribute the sharp rise in density for robust Mahalanobis distance greater than

54

100 to be a result of numerical instabilities of the density algorithm at this end-

point.) The data for these figures were generated with n~ 10, 000 and p 10;

the first 6,000 points from an uncorrelated xl and the remaining 4,000 points

from an uncorrelated X13 Both graphs also show the rejection value, 18.975,

yielding 2.45% outlier error and 1.08% inlier error. For comparison, BACON

failed to detect nearly all of the outliers in the same data set (98.5% error among

outliers, no error among inliers). When BACON was modified to compute a re-

jection point based upon the density method thus described, its outlier error

rate dropped to 77.4%. (That is, we inserted our density estimation routine into

the BACON program and used this instead of BACONs modified xl percentile

to determine the final rejection boundary.) An even more impressive improve-

ment occurred when the outliers were drawn from %f5, for 20% contamination

in p = 5 dimensions. When computing the rejection point according to its usual

modified xl percentiles, BACON produced outlier and inlier error rates of 98.1%

and 0.01%, but when modified to use the density method, these respective er-

ror rates became 38.9% and 3.9%. (The fact that vastly different error rates

were produced by our method and BACON, utilizing the same outlier rejection

procedure, points to the fact that robust location/scale estimates are crucial

to overall performance, leading to good separation between outlier and inlier

Mahalanobis distances.)

In practice, rather than requiring the estimated density to actually reach

zero, it is better to determine the rejection point as the first time that the

slope of this density enters a small interval containing zero. As the density

55

Estimated density Mahalanobis Distance

o

lO

0 2000 4000 6000 8000 10000

Observation no.

Figure 3.1: Robust Mahalanobis distances with estimated density and slope

56

decreases, the slope is negative and increasing (approaching zero). Using the

slope would give better results particularly for the situation where there is very

little separation between the two groups, and the density peak returns to zero

very gradually. This can be clearly seen in Figure 3.1, where even among highly

contaminated data, imposing a smallness condition on the slope results in a very

accurate classification scheme. We estimate the slope through first differences,

which we have usually found to be sufficiently smooth for our purposes, but

we acknowledge that smoothing the slope estimate could improve performance.

Occasionally, we did experience some difficulty with the slope being zero before

the peak had ended, but this was actually due to non-smoothness of the esti-

mated density curve itself. This problem was solved by imposing the additional

requirement that the density curve be less than a certain percentage of its peak

value; we found good results with 0.3. An alternative solution (which we didnt

investigate) would be to smooth these first differences. Examination of several

empirical density curves revealed that departure from monotonicity following

the densitys peak were not uncommon, and were sometimes longer than just

a few points. We leave the investigation of smoothing the densitys slope for

future work, however.

Computing a confidence interval on the slope requires knowledge of the true

density, and, as previously stated, we cannot assume a y2 distribution for the

Mahalanobis distances. We therefore decided to empirically compare results

from four half-interval lengths, centered around zero: (a) 0.001, (b) 0.0005,

(c) 0.0001, and (d) 0.00005. The first value results in a higher sensitivity to

57

outliers, but also more false negatives, than the other values, and conversely for

the fourth value. With the small number of simulations performed up to this

point, it was impossible to tell which of these four values yielded the best overall

success rates, so throughout most of our investigation we examined results from

all four values. Only near the end of our study, when we had run thousands of

simulations, did we feel comfortable in selecting (c) 0.0001 as yielding the overall

best success rates in a wide variety of data situations. The first value resulted

in an unacceptably high number of false negatives in many situations, with

opposite results for the fourth value. It is worth noting that these cutoff values

are not dependent on the scale of the data, because the Mahalanobis distance

utilizes a covariance estimate and is accordingly scale invariant. This is one

reason why it is essential to obtain an accurate covariance estimate. The best

interval length will, however, depend on the acceptable error rates as described

above.

3.2.3 Computing the Sample Density

There are several methods of computing the sample density. We followed

the ideas of Silverman [77], and especially his use of the kernel density estimate.

Let f(x) be the estimated density of / at x. We can calculate f(x) by

where w(xi,x) is the weight contributed by the observation Xi at the point x,

and w(xi,x) satisfies J^oow(xi,x)dx and w(xi,x) > Â£Nx,y. It is also clear that

58

the smoothness properties of / will depend heavily upon this weighting function.

The kernel density is obtained by letting the kernel function K(x, Xi, h) replace

the weighting function and scaling appropriately:

The required conditions on K are always satisfied if K is chosen to be

a probability density, as is usually done. Specifically, these conditions are

Epanechnikov [22] showed that the choice of kernel is not crucial to the mean

integrated square error of the estimated density, so computational efficiency is

usually the deciding factor. The normal distribution is a common choice for

K{x), which we utilize. Among other results, the normal density will ensure

that K(x) is a smooth curve possesseing derivatives of all orders. The window

width h determines the degree to which far-away observations influence the den-

sity estimate at a particular point. It also helps to determine the smoothness of

the resulting /. An overly small value of h will result in a non-smooth density

estimate with individual bumps around each observation, tending towards a col-

lection of Dirac delta functions as h > 0, while if h is too large, the resulting /

so that the kernel density estimate is given by

joQK{x)dx = 1 and K(x) > (Nx. (A symmetric K(x) is very useful, but not

absolutely essential.) If K(x) is a probability density, so will f(x) and more-

over, f{x) will inherit all the continuity and differentiability properties of K(x).

59

will contain only one very smooth peak over the mean of the data. In our con-

text, choosing h too small will cause the density estimate to be overly influenced

by random fluctuations and lead to a non-smooth slope estimate. Letting h be

too large will cause the outliers to affect the density peak intended to described

the inliers, and extend the densitys return to zero, which is critical in small

separation situations. For data containing outliers, Silverman [77] recommends

a robust window width,

h = 0.9 An~1/5

where A = min{standard deviation, interquartile range/1.34}. Silverman [77]

also discusses an adaptive window width which is narrower when the data are

dense and wider otherwise, but we found satisfactory results with the fixed width

kernel so we did not investigate this further.

To our knowledge, the estimated density of robust Mahalanobis distances

has not been previously used in outlier identification, and we consider this to be

an important contribution to outlier identification.

3.2.4 Modifications to the Original Proposal

For most of this investigation, we also computed a similar estimate of loca-

tion, namely the coordinate-wise location estimate Xbw provided by the biweight

function. After we were able to efficiently run thousands of simulations, however,

we rejected this variation. The weighted mean gave slightly better results, in

terms of producing more separation of outlier and inlier Mahalanobis distances,

and produced overall error rates approximately 0.5% to 0.75% lower than the

60

coordinate-wise biweight location estimate.

Another variation that we examined was what we refer to as the MINWT

option. After computing the biweight weights and reassigning them to either

zero or one, MINWT consists of setting all p weights of an observations com-

ponents equal to the minimum weight across these components. That is, if at

least one component receives a weight of zero, the other components of that

observation also receive a weight of zero. The purpose here is to make abso-

lutely sure that no outliers receive a weight of one and influence the sample

mean/covariance calculation. Since the classical sample mean and covariance

are very sensitive to outliers, they could easily be adversely affected if sufficient

care isnt taken to ensure an outlier-free subset. If a particular observation is an

outlier, some of its components might be slightly inside the region of acceptance,

and receive weights above the cutoff value. These weights would be raised to one,

influencing the robust estimates more than was intended. The MINWT option

is certainly not very efficient and would only be feasible when there is sufficient

data to discard questionable points. For certain difficult situations, this option

did in fact lead to significantly better results; notably for heavy contamination

(30% 40%), and very little outlier separation.

Initially, our cutoff value of 0.3 was somewhat arbitrarily chosen as a value

that might provide an initial outlier-free subset. Subsequent analysis indicated

that 0.3 might not be the best choice; better results were found using a higher

cutoff. For the location estimate Xbw, define u(x) as a measure of the scaled

distance from Xf,w:

61

The biweight weighting scheme can then be written in the form

w(u) = (1 v2)2.

The u-value corresponding to w{u) = 0.3 is

u

(0.3) = \J 1 Vo! = 0.6725

For a cutoff of 0.3, if a point x satisfies |tt(a:)| > 0.6725, then w(x) = 0, otherwise

smooth biweight weighting function to a step function with discontinuities at

u 0.6725. We depict these two functions graphically in Figure 3.2.

It can be seen from this figure that the area beneath the curve of the mod-

ified weighting function is considerably larger than that beneath the biweight.

Intuitively, it seems that this would result in the overall weights being too high,

and allowing too many outliers to influence the robust estimates. It might be

good for the area beneath the two curves to be at least reasonably similar so

that when the weights are changed, certain properties are preserved, such as

the mean weight w. It is not hard to calculate the cutoff value at which the

two weighting schemes will have the same area under the curve. For ease of

computation, we consider only u > 0.

w(x) = 1 for \u(x)\ < 0.6725. The effect of this reassignment is to change the

62

Figure 3.2: Weighting function for the biweight and our proposed method

63

so that if the area under the curve for the step function is to equal 0.53, the

cutoff value needs to be

w(0.53) = (1 (0.53)2)2 = 0.52.

Using a cutoff value of 0.52, we did in fact find increased success at outlier

detection than with 0.3. We also continued the trend of examining higher cutoff

values and found that outlier detection success continued to increase, with a

markedly smaller increase in the inlier error rate For instance, on a 10,000 x 5

data set with 8,000 observations from uncorrelated xl and 2,000 observations

from uncorrelated a cutoff value of 0.3 resulted in an outlier error rate of

7.21% and an inlier error rate of 1.90%. The corresponding figures for 0.5 cutoff

were 3.41% outlier error and 1.63% inlier error, while for 0.8 cutoff they were

1.94% and 1.81%, respectively. Using a higher cutoff such as 0.8 would lead

to overall lower weights being assigned, and more conservative location/scale

estimates, yielding a higher sensitivity to outliers. We surmise that perhaps the

benefits of the same area under the curve are not as important as other aspects

of the algorithm. At this point, we leave the cutoff as a parameter yet to be

ascertained, and return to it in the following chapter.

A higher cutoff will reject more data, and it is certainly possible that the

covariance estimate will be too small to give an accurate description of the

inliers. We therefore investigated Winsorization of the covariance matrix. Win-

sorization is a robust technique most often applied to the mean, and related to

the trimmed mean. Given n ordered observations y^,... ,y(n), the symmetric

64

Winsorized mean is given by

Vw ~{g V{g+1) + V{g+1) + V(g+2) + + V{n~g) + 9 IHn-g))-

Th

Winsorization has the property that it greatly lessens the influence of outliers

by effectively moving them to the boundary of acceptability, but does not reject

them completely. Dixon [18] notes that this procedure is called Winsorization in

honor of Charles P. Winsor, who actively promoted its use. Winsorized variances

are not as commonly used, however.

Our situation is not as straightforward as computing the Winsorized vari-

ance on a set of observations {yi}, but the principles can still be applied. If

an observation Xi is beyond the region of acceptability, it will be given a low

weight Wi regardless of which side of the location estimate it is situated on. An

observation with weight equal to (or very close to) the cutoff point will lie on

the boundary of the region of acceptability; this is the point where we would

like to start Winsorizing. We therefore search for the observation with the

closest possible weight to the cutoff value (but not smaller), and record its value

xw. When computing the variance summation, we effectively replace all ob-

servations with weight less than the cutoff point, with this recorded value xw.

Assume that the weight of xw is the kth order statistic among the sequence of

n weights {tOj} for that particular dimension, i.e. the weight of xw is the kth

smallest for that dimension. The numerator is thus a sum of squared devia-

tions: k (xw x)2 + ~~ *)2> where ^ is the location estimate for that

dimension.

65

Determining a suitable divisor required further investigation. The divisor for

the Winsorized mean is always taken to be n, but for the Winsorized variance,

several authors [83],[85] use h = n 2g, while [37] uses n{ 1 ^Â£)2. [73] utilize

h = n 2g, but also state that using n would result in more conservative

confidence intervals. A larger divisor (such as n) would result in a smaller

covariance matrix, and increased sensitivity to outliers, a desirable property. We

accordingly conducted a small experiment to examine the effect of this divisor

on outlier and inlier error rates; the results are shown in Table 3.4. For these

comparisons, we ran 1,000 simulations on a 10,000 x 5 data set, with inliers

drawn from uncorrelated xi and 20% outliers from uncorrelated x25, as well

as the MINWT option. Note that for 20% contamination, a divisor of 0.8n

corresponds to n g. For this particular data situation, the best error rates

can be obtained with a divisor larger than n. Employing a type of argument

commonly used against overfitting, however, we are not sure this would be the

best strategy for all data situations especially for low contamination levels, a

large divisor would create problems and cause too many inliers to be rejected.

A divisor of n seems to be a good compromise; it is reasonably conservative by

not allowing outliers to heavily influence the initial location/scale estimates. We

henceforth use this value during Winsorization.

It is also worth noting that we Winsorize only the diagonal elements of the

covariance matrix, not the off-diagonal elements. Moving outlying observations

an unspecified distance to the boundary of acceptability will clearly affect how

the X*1 coordinate varies with the coordinate, Ylk(xl ~ Â£J)- Thus,

66

Table 3.4: Examining the effect of the Winsorization divisor

Divisor Outlier Error % Inlier Error %

0.7n 2.18 1.70

0.75n 2.13 1.73

0.8n 2.05 1.78

0.85n 2.00 1.82

0.9n 1.94 1.86

0.95n 1.88 1.90

l.On 1.81 1.95

1.05n 1.77 1.99

l.ln 1.73 2.02

1.15n 1.68 2.07

1.2 n 1.64 2.11

1.25 n 1.60 2.15

1.3 n 1.57 2.18

1.35 n 1.53 2.22

1.4n 1.49 2.27

1.4577. 1.46 2.30

1.5n 1.42 2.34

1.6t7 1.36 2.42

1.7n 1.30 2.51

2.On 1.15 2.73

67

the only values we use in computing the off-diagonal elements are those with

weight equal to one.

Generally, we found that Winsorizing the variance gave better results. We

present some examples of covariance matrices obtained with and without Win-

sorization. Consider a 10,000 x 5 data set drawn from a mixture of uncorrelated

xl and xj5, a situation where observations from the two groups overlap. We

consider contamination levels of 5%, 10% and 20%. Without Winsorization, the

covariance matrix of the observations in the initial subset, using a cutoff of 0.7

with 5% contamination, was estimated as

5.48 -0.08 0.09 0.03 0.003

-0.08 5.28 -0.0004 0.01 -0.01

0.09 -0.0004 5.31 -0.05 0.06

0.03 0.01 -0.05 5.52 0.08

0.003 -0.01 0.05 0.08 5.21

Recall that for yj, the theoretical variance is 10. Employing Winsorization and

with the rest of the parameters kept constant, the diagonal elements were

{8.40 8.19 8.23 8.45 7.96}.

(The off-diagonal elements were the same since we only Winsorize the variance

estimates.) For 10% contamination and without Winsorization, the estimated

diagonal elements were

{6.21 6.13 6.24 6.40 5.88},

68

and with Winsorization

{10.23 10.06 10.11 10.45 9.72}.

Finally, for 20% contamination, the diagonal elements without Winsorization

were estimated to be

{9.31 9.13 9.19 9.34 9.23},

and with Winsorization as

{16.21 16.03 16.06 16.42 15.94}.

Winsorization increases the variance because to some extent, it still considers the

outliers. We computed the error rates from the above situations, based on 1,000

simulations and no MINWT, and present the results in Table 3.5. As before, we

utilize the format out.err%/in.err% to denote the outlier and inlier percentage

errors. It is not possible to draw any conclusions from this table. Based on

the actual covariance estimates, however, it seems likely that Winsorization

could be beneficial for low contamination levels, but as the variance estimate is

inflated with increasing percentage of outliers, augmenting the variance through

Winsorization becomes undesirable.

Another variation which was we tried was wpct. Instead of a fixed cutoff,

select the observations for the initial subset according to the upper wpct per-

centage of weights, where wpct is determined in advance. Not surprisingly, this

method produced great results if wpct was close to the percentage of actual in-

69

Table 3.5: Examining the effect of Winsorization

Contamination Level Without Winsorization With Winsorization

5% 0.55/4.63 0.75/3.88

10% 0.91/3.77 1.19/3.08

20% 4.55/2.64 4.31/2.00

liers in the data, but faltered if there was a big discrepancy. Sometimes general

characteristics of the type of data being investigated can lead the researcher to

develop an approximate estimate of the contamination level. For instance, many

data sets usually contain less than 5% outliers. Several outlier methods require

the number of outlier to be specified in advance, such as Grubbs test for k out-

liers in a specified tail [29], [37], so employing wpct would put our procedure in

this class of algorithms, wpct is a good alternative to other algorithms in this

class because it does not require the exact amount of outliers to be specified in

advance. It will still yield good results as long as the two percentages are fairly

close, and demonstrates a relatively gradual drop in performance with increas-

ing discrepancy. For an unknown percentage of outliers, however, a fixed cutoff

value is to be preferred.

This algorithm is computationally appealing, as will be analyzed in more

detail on page 141. Since each dimension is analyzed separately, computational

burden grows relatively slowly with increasing dimension. This is in marked con-

trast to most other methods, such as the MCD, Kurtosisl and even MULTOUT

(which depends on the MCD in its initial stage), where increasing p leads to sig-

70

nificant increase in computational time. Fitting the actual biweight function is

not computationally intense, and once the weights have been assigned, calculat-

ing the sample mean and covariance matrix are the fastest possible methods of

obtaining meaningful location and scale estimates. At this stage, estimation of

the sample density is the most time-consuming step of the algorithm, but as we

discuss on page 75, faster methods exist for this step, which lead to considerable

improvement.

3.3 A Summary of Our Proposed Method

Before proceeding to the next chapter, we summarize our proposed method.

1. Calculate the biweight weights for each data point using the algorithm

described on page 48, fitting each component separately.

2. If an observations weight is less than the cutoff value, reassign that weight

to zero, otherwise reassign it to one. The optimal cutoff value still needs

to be investigated.

3. Option MINWT: Select the minimum weight from each observations com-

ponents as the weight for the other p1 components too. The effectiveness

of this option still needs to be determined. The order of this step can also

be interchanged with the previous step.

4. Compute the weighted mean and covariance of all the observations, which

amounts to the sample mean and covariance of those observations in the

initial subset. The effect of Winsorizing the variance still needs to be

investigated.

71

5. Calculate a robust Mahalanobis distance for all n observations, using the

mean and covariance matrix computed in the previous step.

6. Calculate the density of these Mahalanobis distances, using the Normal

density as the kernel and window width as described on page 60, and use

first differences to approximate the slope.

7. There will be a peak describing the inliers, where the slope has a large

positive value followed by a large negative value and increases towards

zero. Find the value of Mahalanobis distance where this negative slope

first enters a confidence interval containing zero. A good value for the

half-length of this interval is 0.0001. Let the value of Mahalanobis distance

where the slope is sufficiently close to zero (and where the density is also

less than 0.3 of its maximum value), be called the rejection point.

8. Classify all points as inliers or outliers based on whether their Mahalanobis

distance is smaller or larger than the rejection point.

72

4. Results from the Proposed Method

4.1 Computational Details

Having developed an algorithm for identifying outliers, it is necessary to

evaluate its performance over a wide range of data types. We also need to ex-

amine its computational feasibility on larger data sets, the intended primary

application. So far we have tested our method in Splus, which is very slow,

and particularly inept at handling large loops. We therefore decided to imple-

ment it as a Fortran program, which could be used to rapidly run thousands

of simulations for diverse data types and different parameter values. Fortran

was selected because of the abundance of existing mathematical and statistical

subroutines, some of which utilized complex numerical techniques and had been

written to be computationally efficient. The specific version used was Fortran

90, a considerable improvement over Fortran 77. I did not make a full study

of all the improvements offered by F90, and there are almost certainly several

of my routines that could be optimized for F90. I am also aware of the devel-

opment of F95, but decided to designate this upgrade as a future project. F95

is an object oriented language and I have had only limited experience with the

object oriented approach.

A very important issue that arises in these types of numerical investigations

is the randomness of the pseudo-random number generator (PRNG). For the

programs presented in this thesis, we utilize the random number generators

73

described in the second volume of Numerical Recipes for Fortran [62, ch. B7],

updated and improved for F90. The period of this random number generator is

2.3 x 1018, which more than suffices for our applications. The PRNG presented

in the second volume of Numerical Recipes [62] is a faster and more efficient

version of the authors ran2 routine in the first volume [61]. In the first volume

[61], the authors claim that within the limits of floating point precision, ran2

provides perfect random numbers. As a practical definition of perfect, they

offered $1,000 to the first reader who could prove that ran2 fails to generate

random numbers, by means of a statistical test or otherwise. Ten years later

when the second volume [62] was published, the authors stated that the award

had not yet been claimed. This PRNG produces sequences of random 17(0,1)

numbers, from which appropriate transformations can be made to obtain random

data from other distributions. Some of the transformation routines are also

provided in the first volume of Numerical Recipes [61, ch. 7]. For instance,

to obtain normally distributed random numbers from 17(0,1) deviates, we use

the Box-Muller transformation (described in [62], with corresponding Fortran

code). Xdf deviates can be easily obtained from Gamma deviates, since the Chi-

squared distribution is a special case of the Gamma distribution, but in general,

obtaining random Gamma deviates is a nontrivial problem. To generate these

deviates, we utilize Fortran routines posted at ranlib [63], a division of the Netlib

repository, which are based on a modified rejection technique described in [2].

Based on these algorithms, we therefore assume that the data generated for the

purpose of this thesis are sufficiently random.

74

One aspect of our algorithm which necessitated some change, was the density

calculation. The method described in the previous chapter utilizes an inefficient,

complete double summation. For each point at which we wish to estimate the

density, we have to compute the influence of all n Mahalanobis distances, even

though many of these distances are too far removed to have a measurable effect

on the density at that particular point.) In Applied Statistics algorithm AS 176,

Silverman [76] describes and provides Fortran code for a routine based on the

Fast Fourier Transform which is significantly faster than the complete double

summation. Part of his algorithm involves building a histogram of the data with

2fc cells (we use k = 9), and then applying the FFT to both the data and the

kernel K(x). (Recall that we used a Gaussian kernel.) We found that utilizing

this procedure reduced the computation time by approximately a factor of 10.

Another issue that arose during the practical implementation of our algo-

rithm was finding the desired rejection point, specifically concerning smooth-

ness of the density estimate. The first point where the slope enters the interval

[-0.0001,0.0001] is not always at the end of the main density peak, due to small

numerical inaccuracies in the estimated density. Sometimes the estimated slope

assumed very small negative values within the first few data points. We could

easily solve this problem by requiring the rejection point to be greater than half

the mean Mahalanobis distance. A second problem arose because the density

peak did not always return to zero in a smooth curve sometimes there were

short plateaus, or other deviations from what should be a monotonic decreasing

segment. At times, the lengths of these plateaus were such that that removing

75

them through a smoothing process would require the smoothing window to be

sufficiently large as to have other undesirable effects on the estimated density,

such as increase in mean integrated square error. An alternative (and faster)

solution would be to constrain the rejection point to regions where the density

was less than a certain fraction of its maximum peak value. After some inves-

tigation, we settled on a value of 0.3 for this fraction. For the majority of data

situations, the algorithms performance was not sensitive to this value of 0.3,

but for high contamination (35% 40%) the density curve returned to zero very

slowly and raising this value to 0.4 did produce slightly better results.

A few other miscellaneous notes regarding our programs computational

specifics are in order. We use double precision in all of our programs to guard

against round-off error. To calculate the median (used for both the biweight

algorithm and BACON), we employ a version of Quicksort which sorts only as

much data as necessary to locate the median; specifically, a subroutine written by

Alan Miller [55]. For the interquartile range (to determine the optimal window

width for density estimation), I used the function selection from Numerical

Recipes [61, p. 334], which calculates the kth smallest value in an array. Finally,

to calculate the inverse of the PDS covariance matrix for Mahalanobis distance

calculation, I used the subroutine dpodi, from the Unpack [49] repository of linear

algebra public domain software, another subdivision of Netlib.

Naturally, we want to compare our results with the best existing outlier

methods. None of the other algorithms we examined up to this point were in a

form suitable for rapid multiple iterations, so we wrote Fortran code for these

76

algorithms too, in order to run thousands of iterations and effectively compare

performances. BACON has usually produced good results up to this point, so it

was a natural choice for further investigation. The executable version provided

by the author ran on an MS-DOS platform; since it was not possible to view

or modify the source code, I wrote a Fortran program based directly on the

article from Hadi et al. [32], which will henceforth be referred to as BACON*

to distinguish it from the author original implementation. I again utilized some

of the linear algebra routines from Unpack [49], and additionally also a collection

of routines from cdflib [10] to calculate the inverse x2 percentiles. The latter

routines were written by Brown and Lovato (MD Anderson Cancer Center),

and based on an algorithm by DiDonato and Morris [17] as well as tables from

Abramowitz and Stegun [1].

Another candidate that has performed well up this point is Kurtosisl. It

does not appear to be perturbed by difficult data types such as concentrated

outlier clusters or non-Normal data. It also is very appealing in that it rep-

resents an innovative class of outlier algorithms, based on improved projection

pursuit methods, and redesigned so as to be significantly more computation-

ally feasible. The original MATLAB code was clearly not suitable for our use

since it restricted the data set to n = 750, while the MATLAB platform was

also very slow and not intended for these type of simulations. I therefore wrote

Fortran code based directly on the MATLAB program, which will henceforth

be referred to as Kurtosisl*. In addition to the routines already mentioned,

Kurtosisl* also necessitated a large number of other linear algebra routines

77

such as Cholesky decomposition, simultaneous solution of systems of equations

(including also special cases for under/over-determined systems of equations),

eigenvalue/eigenvector calculation (for several different matrix types), QR fac-

torization and LQ factorization. I was able to obtain Fortran subroutines for

these procedures from lapack [47] and eispack [21], both special sections of the

Netlib repository. Although the portion of the code that I wrote was only about

1,500 lines, after adding all the required subroutines, the total program length

was nearly 18,000 lines. I am indeed very grateful to the authors of these routines

for making them available on the public domain.

We did not examine the MULTOUT algorithm during the large-scale sim-

ulation stage. Preliminary results seemed to indicate that it did not perform

as well on non-standard data types such as outlier clusters and possibly also in

higher dimensions such as p = 20. Juan and Prieto [44] conducted a simula-

tion experiment using Normal data with small separations, comparing the MCD,

MULTOUT and other methods, in which MULTOUT demonstrated significantly

decreased performance (100% failure rate in detecting outliers) in higher dimen-

sions (p = 10,20).

We have not examined MULTOUT sufficiently to draw definite conclusions,

but we do note that decreased performance in higher dimensions could be a

result of utilizing the MCD for the initial stage. The MCD does not handle

increased dimension very well; subdividing the data into smaller cells reduces n

but not p. Though it did not influence our decision, we also note that the C

implementation of MULTOUT took very long to execute, even longer than our

78

Fortran version of Kurtosisl* (which took a very long time, as will be discussed

later). The other algorithm we previously examined, the MCD, is for reasons

already discussed not a viable option for use in large data sets, and also fared the

worst of the 4 methods we examined. Thus we decided to compare our method

to only BACON* and Kurtosisl*.

The choice of data is also important. Most outlier algorithms in the liter-

ature are tested on Normally distributed data, even though there is no reason

to believe that the majority of applications and real world data sets satisfy this

assumption. When these algorithms are tested on real data sets, these sets are

typically of small n and p. We felt that the x2 distribution offered a good al-

ternative to the normal, due to its assymmetry and longer tails. For instance,

when mixing x2 with X15 or Xio> there is significant overlap between the two

groups and the algorithms could be tested under extreme circumstances.

4.2 Choice of Parameters for our Method

With the limited investigations performed up to this point, it has not been

possible to determine the best value for the cutoff parameter ctf, or the success

of Winsorization and MINWT. We therefore ran 1,000 simulations with cutoff

values from 0.5 to 0.9, and the permutation set of Winsorization crossed with

MINWT. We used a data set of size 10, 000 x 5, with 8,000 observations drawn

from uncorrelated x! and 2,000 from uncorrelated Xis- This resulted in some,

but not excessive, outlier overlap, and a moderate contamination level. The re-

sults are presented in Table 4.1, utilizing the format outlier, error %/inlier. error %

79

Table 4.1: Examining the effect of cutoff, Winsorization and MINWT

Wins. Wins. No Wins. No Wins.

ctf MINWT No MINWT MINWT No MINWT

0.5 3.34/1.79 8.98/2.24 3.04/2.37 11.78/2.78

0.55 2.96/1.76 7.57/2.16 2.49/2.34 9.45/2.85

0.6 2.68/1.75 6.27/2.11 2.13/2.31 7.58/2.74

0.65 2.50/1.74 5.24/2.03 1.86/2.32 6.00/2.66

0.7 2.30/1.77 4.36/1.97 1.64/2.36 4.63/2.60

0.75 2.13/1.80 3.61/1.92 1.48/2.40 3.49/2.55

0.8 1.98/1.84 2.93/1.90 1.31/2.50 2.54/2.57

0.85 1.81/1.95 2.48/1.92 1.16/2.71 1.92/2.69

0.9 1.79/2.13 2.27/2.04 1.05/3.34 1.46/3.13

0.95 1.74/2.58 1.98/2.49 0.76/10.00 0.89/10.12

80

discussed in Table 3.1 on p. 43.

We notice from Table 4.1 that as the cutoff increases, the outlier error rate

decreases monotically while the inlier error rate follows a quadratic curve; we

indicate the minimum of this quadratic in bold type. For increased cutoff, the

variance estimate shrinks because a greater portion of the outlying points are

assigned a weight of zero and are not included in the estimate. A smaller, more

accurate variance estimate causes distant points to receive large robust Maha-

lanobis distances, with a greater possibility of being labelled as outliers. Since

the outliers clearly do lie farther away, a smaller variance estimate results in

more outliers possessing large robust Mahalanobis distances, a greater separa-

tion from the inliers, and an increased likelihood of being correctly labeled as

outliers. Hence the outlier error rate decreases with increasing cutoff.

Investigation of the quadratic trend of the inlier error rates showed that for

lower cutoff values, the off-diagonal elements of the covariance estimate were

considerably larger than zero, e.g. for a cutoff of 0.5, the off-diagonal elements

were on the order of 2.5. This may be due to unintended correlations between

the inliers and outliers. As the cutoff increases, fewer outliers are assigned a

weight of one, and the off-diagonal elements decrease to zero. The improved

estimates of these off-diagonal elements result in lower robust Mahalanobis dis-

tances for the inliers, and thus fewer of them are incorrectly rejected. The error

rate therefore decreases. When these off-diagonal elements are sufficiently close

to zero, however, improving them no longer leads to lower inlier error rates,

which subsequently start increasing. In contrast to the outlier error rate, a

81

smaller variance estimate leads to higher inlier error rates because some of the

inliers receive large robust Mahalanobis distances, and are incorrectly labelled

as outliers. When we artificially force the off-diagonal covariance elements to

be zero, the inlier error rate increases monotically with the cutoff. These two

opposite trends highlight the difficulty in obtaining a uniformly best cutoff value.

It appears that the combination of No Winsorization and No MINWT does

not lead to as good results as the other 3 combinations, so we do not consider it

further. A possible explanation for this is that without any kind of adjustment,

20% outliers is a sufficiently heavy contamination level to adversely affect the

covariance estimate, and leads to an inadequate separation between inlier and

outlier Mahalanobis distances. We also selected 2 values of the cutoff to in-

vestigate at each of the remaining three factor combinations, placing somewhat

more weight on lower outlier error, while still maintaining reasonable inlier er-

rors. For Winsorization and No MINWT, we investigate cutoff values of 0.85

and 0.9, while for MINWT, and both levels of Winsorization, we investigate

cutoff values of 0.8 and 0.85.

We examine dimensions p = 5,10, 20 and contamination levels e = 5%, 10%,

20%, 30%, 35%, 40%. e = 35% was examined because the algorithm sometimes

demonstrated a marked increase in error rate between e = 30% and e = 40%;

adding this value showed that the sudden increase usually occurred between e =

35% and e = 40%. We are not aware of significant outlier research or testing on

non-normal data, so we investigate a mixture of %2 distributions as an example

of highly skewed data. It is still important, however, to verify performance of our

82

method on normally distributed data, so we also investigate two types of shift

outliers, the most difficult type of Gaussian outliers /citerocke96. We are also not

aware of significant outlier research on correlated data, so the third category of

outlier distributions we examine is a mixture of correlated multivariate normals.

The multivariate normal distribution was chosen as an example of correlated

data because it will be easy to compare results directly with the uncorrelated

shift outliers.

The computational cost of running these simulations is reasonable, so we

follow a complete factorial experimental design [9]. We ran 1,000 simulations

for our proposed method, as well as for BACON*, but only 10 iterations for

Kurtosisl*, which still took roughly 3-5 times longer than 1,000 iterations with

our method. We will conduct a more complete investigation of computational

time in section 4.7, but at this stage, it is important to note that the reported

results for Kurtosisl* do not have the same accuracy as the other methods.

A simple geometric argument reveals that for a given inlier distribution,

if the same outlier distribution is used for p = 5, p = 10 and p = 20, the

average distance to the outliers will increase with increased dimension, and an

algorithms success rate will increase with dimension, provided it is capable of

analyzing large dimensions (the MVE and MCD are examples of algorithms

that do not satisfy the latter condition.) For instance, the one-dimensional

distance from the origin to a point 5 units away is clearly 5, while the two-

dimensional distance from the origin to (5,5) is -\/52 + 52 = 7.07. To compare

the effect of dimension on outlier detection sucess, we therefore need to scale the

83

distance to the outlier mean so that it remains approximately constant for all

dimensions. The Euclidean distance could be used to calculate the appropriate

scale factors for uncorrelated data with the same variance in each dimension,

but for a general data set, the Mahalanobis distance between two populations

is a more appropriate metric [75, p. 36] that takes into account not only the

variance of the data, but also the correlations between components. For a given

separation between the inlier and outlier distribution means, which we will call

S, the expected value of the Mahalanobis distance needs to remain constant

regardless of dimension:

E[(^in - - Mout)] = 52

where nin is the mean of the inlier distribution, fj,out the mean of the outlier

distribution, and S is a convenient scale matrix [51, p. 31], which we take

to be the covariance matrix of the inlier distribution. Setting (d,..., d)T =

E[(fj.in nout)T, i.e. d is the one-dimensional difference between fjbin and fJ.0Ut,

we can evaluate the expected value of the Mahalanobis distance as

= d2T,UEUcij

where c1-7 are the elements of the inverse covariance matrix E-1. For a given

value of <52, we therefore determine nout such that d v S Â£ For

the simulations in the following sections, we will examine several values of 62,

and among others, note how rapidly success at outlier detection improves with

84