
Citation 
 Permanent Link:
 http://digital.auraria.edu/AA00007284/00001
Material Information
 Title:
 New spatiotemporal methods for prospective surveillance of disease outbreaks
 Creator:
 Dassanayake, Sesha Kalinda
 Place of Publication:
 Denver, CO
 Publisher:
 University of Colorado Denver
 Publication Date:
 2016
 Language:
 English
Thesis/Dissertation Information
 Degree:
 Doctorate ( Doctor of philosophy)
 Degree Grantor:
 University of Colorado Denver
 Degree Divisions:
 Department of Mathematical and Statistical Sciences, CU Denver
 Degree Disciplines:
 Applied mathematics
 Committee Chair:
 Hendricks, Audrey
 Committee Members:
 French, Joshua P.
Santorico, Stephanie Cobb, Loren Juarezcolunga, Elizabeth
Notes
 Abstract:
 Three new procedures are presented for detecting disease outbreaks using disease counts in multiple geographic regions. Popular disease surveillance methods are extended and combined with recent methodical advances from other disciplines to design powerful, but straightforward testing procedures. Two popular methods for identifying temporal clusters in statistical process control (SPC) â€“ the Cumulative Sum (CUSUM) and ExponentiallyWeighted Moving Average (EWMA) procedures  are extended for detecting spacetime clusters using accumulated disease counts from neighboring regions. In contrast to controlling the conventional average run length error criterion employed in SPC, false discovery rate is used, which is more suitable in a disease surveillance setting. Instead of using the traditional control limits in SPC methods, pvalues are used for making statistical decisions. The use of false discovery rate and pvalues enables the utilization of popular multiple comparison methods, improving the power of our testing procedures. The gain in power is verified using numerous simulation studies. The rapid detection ability of the methods is illustrated with the 2011 Salmonella Newport case study from Germany.
Record Information
 Source Institution:
 University of Colorado Denver
 Holding Location:
 Auraria Library
 Rights Management:
 Copyright Sesha Kalinda Dassanayake. Permission granted to University of Colorado Denver to digitize and display this item for nonprofit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads 
This item has the following downloads:

Full Text 
NEW SPATIOTEMPORAL METHODS FOR
PROSPECTIVE SURVEILLENCE OF DISEASE OUTBREAKS
by
SESHA KALINDA DASSANAYAKE B.A., Wabash College, 1998 M.S., University of Minnesota, 2001 M.S., University of Colorado Denver, 2012
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Applied Mathematics
2016
Â©2016
SESHA KALINDA DASSANAYAKE ALL RIGHTS RESERVED
11
This thesis for Doctor of Philosophy degree by
Sesha Kalinda Dassanayaka has been approved for the Applied Mathematics Program by
Audrey Hendricks, Chair Joshua French, Advisor Stephanie Santorico Loren Cobb
Elizabeth Juarezcolunga
in
July 30, 2016
Dassanayake, Sesha Kalinda (Ph.D., Applied Mathematics)
New Spatiotemporal Methods for Prospective Surveillance of Disease Outbreaks Thesis directed by Assistant Professor Joshua French
ABSTRACT
Three new procedures are presented for detecting disease outbreaks using disease counts in multiple geographic regions. Popular disease surveillance methods are extended and combined with recent methodical advances from other disciplines to design powerful, but straightforward testing procedures. Two popular methods for identifying temporal clusters in statistical process control (SPC)  the Cumulative Sum (CUSUM) and ExponentiallyWeighted Moving Average (EWMA) procedures  are extended for detecting spacetime clusters using accumulated disease counts from neighboring regions. In contrast to controlling the conventional average run length error criterion employed in SPC, false discovery rate is used, which is more suitable in a disease surveillance setting. Instead of using the traditional control limits in SPC methods, pvalues are used for making statistical decisions. The use of false discovery rate and pvalues enables the utilization of popular multiple comparison methods, improving the power of our testing procedures. The gain in power is verified using numerous simulation studies. The rapid detection ability of the methods is illustrated with the 2011 Salmonella Newport case study from Germany.
The form and content of this abstract are approved. I recommend its publication.
Approved: Joshua French
iv
To my parents, Daisy and Chandrasena Dassanayake
TABLE OF CONTENTS
CHAPTER
I. INTRODUCTION...............................................................1
II. AN IMPROVED CUSUMBASED PROCEDURE FOR PROSPECTIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE
REGIONS....................................................................6
III. AN IMPROVED EWMABASED PROCEDURE FOR PROSPECTIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE
REGIONS...................................................................39
IV. AN EWMA AND MORANâ€™S I COMBINED APPROACH FOR PROSPECTIVE
DETECTION OF SPATIOTEMPORAL CLUSTERS IN MULTIPLE GEOGRAPHIC REGIONS...................................................................78
V. CONCLUSIONS...............................................................106
REFERENCES
APPENDIX
A. Details of the StoreyTibshirani Procedure and the Estimation of the True Null
Hypothesis n0............................................................113
B. The Bootstrap Procedure Used in Chapters II, III, and IV.................115
C. Additional CED and FPR Plots for Pooled EWMA and CUSUM Models............116
D. CED and FPR Plots for Pooled EWMA and Spatiallyadjusted EWMA models.....128
E. CED and FPR Plots for Pooled EWMA and Spatiallyadjusted EWMA Models.....135
F. Observed FDR as a Function of Length of Outbreak for the Pooled EWMA and
SEWMA....................................................................137
G. Implementing the StoreyTibshirani Procedure Using the â€˜fdrtoolâ€™ Package.138
vi
CHAPTERI
INTRODUCTION
Emerging disease outbreaks must be detected as early as possible for public health authorities to take necessary preventive measures. Most surveillance systems operating in the United States and Europe record both the time and the location of newly found cases [1], The data collected by these surveillance systems can be analyzed by either using a retrospective analysis or a prospective analysis. In a retrospective study, the entire data set is available, and the analysis is carried out over the entire data set once. A common example is classical hypothesis testing. However, in a prospective analysis, disease counts are collected sequentially over time, and repeated analysis is performed on the data collected so far to decide whether an increase in disease incidence has occurred. This is a more complicated problem than retrospective analysis, and techniques such as the traditional hypothesis testing cannot be applied. Furthermore, in prospective surveillance, quick detection is critical to allow adequate time for public health interventions.
Unkel et al. [2] broadly classify prospective disease surveillance methods as time series methods, regression based methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detection methods. During the past decade, SPC methods that originated in industrial process control have been successfully utilized for disease surveillance purposes. The Cumulative Sum method (CUSUM) is a popular SPC method that is utilized in Bio Sense, developed by the Center for Disease Control (CDC) and ESSENCE (Electronic Surveillance System for the Early Notification of Communitybased Epidemics) developed by the Department of Defense and the Johns Hopkins University [3], Another popular SPC method, the Exponentially Weighted Moving Average method (EWMA), is used in the
1
Early Aberration Reporting System (EARS) developed by the Center for Disease Control and Prevention (CDC) and ESSENCE [4],
The CUSUM method cumulates the CUSUM statistic at the previous time step and takes the difference between an observed value and a reference value. The new CUSUM statistic at a given time interval is taken to be the maximum of zero or the cumulated value. An alarm is signaled when the CUSUM statistic exceeds a predefined control limit. When the process is in control (i.e., when there is no outbreak), we do not want to sound an alarm. However, when the process is out of control (i.e., when there is an outbreak), we want to signal an alarm. The original CUSUM method was proposed in an industrial process control setting by Page [5] for normally distributed data; Lucas [6] extended the CUSUM method to count data and introduced the Poisson CUSUM method.
The EWMA chart is also a popular SPC method that is easy to implement and interpret. The EWMA is a moving average where weights are given to observations in a geometrically decreasing order with more recent observations getting more weight and distant observations getting less weight. The starting value is often set to the target value to be monitored; in a disease surveillance setting the target value is the expected disease count when there are no outbreaks. Similar to the CUSUM chart, an alarm is signaled when the EWMA chart exceeds a predefined threshold. The original EWMA chart was proposed for normally distributed data by Roberts [7] and was extended for Poisson counts by Borror [8],
Both the original CUSUM and the EWMA charts are purely temporal. Raubertas [9] proposed a method to extend the purely temporal CUSUM chart to a spatiotemporal setting [9], Raubertas [9] suggested pooling counts of spatially contiguous regions to form regional neighborhoods. This way, spatial information about what is happening in the nearby regions is
2
captured in the resulting pooled CUSUM statistic. As before, an alarm is signaled whenever the pooled CUSUM statistic exceeds a predefined control limit. Subsequently, Rogerson and Yamada [10] extended the purely temporal CUSUM chart in two ways: first, they allowed the expected counts to vary over time; second, they suggested forming regional neighborhoods along the lines proposed by Raubertas [9], The multiple testing problem that arises when testing multiple regional statistics was handled by controlling the familywise error rate (FWER) using a Bonferroni correction. In conventional multiple CUSUM charts, where a separate CUSUM chart is used to monitor each attribute, similar Bonferroni methods are used to control the familywise error rate.
FWERbased procedures aim to control the probability of even one false positive. In conventional SPC charts, the false alarm rate is defined using the incontrol average run length, ARL0, which is the expected time until a false alarm. However, when testing a large number of hypothesis simultaneously, controlling the FWER makes it very difficult to reject the null hypothesis, meaning the power for the tests is very low. A better alternative is to control the false discovery rate (FDR) which is the expected proportion of false discoveries. FDR procedures have more statistical power at the expense of increased false alarms. Benjamini and Hochberg [11] proposed a method to control FDR when the test statistics are independent. Subsequently, Benjamini and Yekutieli [12] extended the method for dependent statistics [13], FDR controlling methods have been lately adopted for handling the multiple testing problem in multiple process control charts [13],
In conventional SPC charts, control limits are used to signal alarms when the charting statistic exceeds a predefined control limit or a threshold. Li et al. [14] propose a general framework for SPC charts where pvalues are used instead of critical values to signal alarms.
3
When the incontrol distribution (i.e., the distribution of the charting statistic when there is no outbreak) is known, Li et al. [14] suggested using Monte Carlo simulations to simulate the incontrol distribution in estimating the pvalues. In the most general case, when the incontrol distribution is unknown, Li et al. [14] recommend using bootstrap methods to extract bootstrap samples from the incontrol (nooutbreak) period to estimate the pvalues.
Li and Tsung [15] proposed a method for controlling FDR for multiple CUSUM charts in an industrial process control setting. At each time step, a CUSUM statistic is computed for each attribute. For each CUSUM statistic, a pvalue is calculated using a random walk based approximation. When the statistics are independent Li and Tsung [15] recommend using the BenjaminiHochberg [11] procedure; when the control statistics are dependent, Li and Tsung suggest using the BenjaminiYekutieli [12] procedure.
The three proposed methods described in the rest of this dissertation combine several of these approaches in novel ways, incorporating new methods to capture spatial information and adopting innovative approaches to add more statistical power. In two of the methods, spatial information in the nearby regions is captured by pooling counts to form regional neighborhoods as suggested by Raubertas [9], The third method, captures spatial information by using a popular spatial index to summarize the local spatial association in a given locality [16], In addition, pvalues are used to signal alarms, instead of using the conventional control limits, as suggested by Li et al. [14], Furthermore, FDR error control is used to address the multiple testing problem that arises when comparing multiple control chart statistics, instead of using the traditional FWERbased procedures. Moreover, the power of the procedures are further increased using the more recent StoreyTibshirani method [17] for multiple testing, in addition to using the commonly used BenjaminiYekutieli [12] procedure.
4
The proposed three methods make use of a common underlying algorithm. The first step of the algorithm is to set a tolerable overall false discovery rate. Next, at each time step, a control chart statistic is calculated for each regional neighborhood either by pooling neighborhood counts or using a spatial index to model the spatial similarity in the neighborhood. Using past observations, an incontrol period with no outbreaks is identified. Bootstrap samples are drawn from this incontrol period and control chart statistics are computed to estimate the incontrol distributions for each region. At each time step, pvalues are calculated by comparing the control chart statistic at that time point with the sampling distribution of the same statistic for bootstrap samples from the no outbreak time period. The multiple testing problem that arises when performing multiple statistical tests simultaneously is handled using FDRbased procedures.
This document is structured as follows: chapter 2 discusses the first method where the CUSUM statistic is extended to a spatiotemporal setting using the proposed algorithm where spatial information is captured by pooling neighborhood counts. Chapter 3 describes the second method where the EWMA statistic is extended to a spatiotemporal setting using the proposed algorithm where spatial information is again summarized by pooling neighborhood counts. Chapter 4 introduces the third method where the EWMA statistic is extended to a spatiotemporal setting using the new algorithm where the spread of the disease in a locality is summarized using a local index of spatial association. Chapter 5, compares the strengths of the three methods and provide directions for future research.
5
CHAPTER II
AN IMPROVED CUSUMBASED PROCEDURE FOR PROSPETIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE REGIONS
1 Introduction
Emerging disease clusters must be detected in a timely manner so that necessary remedial action can be taken to prevent the spread of an outbreak. Consequently, prospective disease surveillance has gained prominence as a rapidly growing area of research during the past decade
[2] , In prospective surveillance, a decision is made at regular time intervals about the incidence of an outbreak, based on the data available up to that time. This is different from the traditional retrospective analysis where the entire data set is available for analysis.
Unkel et al. [2] broadly classify statistical methods proposed for prospective surveillance into time series methods, regression based methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detection methods. Out of these methods, SPC methods have a long history of application in public health surveillance [18], As the name suggests, many of the SPC methods originated in industrial process control, but have lately been adapted for use in public health surveillance [1], A key example is the Cumulative Sum (CUSUM) method, which is utilized in BioSense, developed by the Center for Disease Control, and ESSENSE, developed by the Department of Defense and Johns Hopkins University
[3] .
The CUSUM method cumulates the previous CUSUM statistic and the difference between an observed value and a reference value. The new statistic is the maximum of 0 and the cumulated value. When the process is â€œincontrolâ€ (i.e., no outbreak) we do not want to sound
*This chapter has been accepted for publication in Statistics in Medicine and is included with the permission of the copyright holder.
6
an alarm. When the process is â€œoutofcontrolâ€ (i.e., there is an outbreak), we want to sound an alarm. The method was originally developed for normally distributed data by Page [5] and was later extended to Poisson data by Lucas [6],
Raubertas [9] extended the purely temporal Poisson CUSUM method to a spatiotemporal setting. Disease counts from neighboring regions are pooled together to form regional neighborhoods. In pooling counts, a weighted average is calculated using the distance between regions as a weight. A CUSUM statistic is formed for each regional neighborhood, and an alarm is signaled when the CUSUM statistic exceeds the appropriate threshold. Later, Rogerson and Yamada [10] extended Raubertasâ€™s [9] method in two ways; first, they extended the method so that the expected counts can vary over time; second, they adopted a method that permits monitoring of regional neighborhood counts for subregions and their surrounding neighbors.
The multiple testing problem that arises when testing multiple regional statistics simultaneously was handled by controlling the familywise error rate (FWER), using the popular Bonferroni correction. This method is somewhat similar to the method used in traditional multiple Poisson CUSUM charts utilized in industrial process control, where the multiple testing problem is handled by using the Bonferroni correction [15],
The objective of FWER procedures is to control the probability of even one false discovery, which exerts a stringent control when a large number of tests are performed simultaneously. Benjamini and Hochberg [11] popularized a less stringent error criterion for false discoveries, namely the expected proportion of false discoveries or the False Discover Rate (FDR).
Compared to FWERcontrolling methods, FDRcontrolling methods have more power at the expense of additional false alarms. Benjamini and Hochberg [11] proposed a procedure for controlling the FDR in the context of independent test statistics. The procedure was extended to
7
dependent statistics by Benjamini and Yekutieli [12], FDRcontrolling procedures have recently been adopted to handle multiple testing problems in SPCbased outofcontrol processes [13],
Most CUSUMbased methods signal an alarm once the CUSUM statistic exceeds the appropriate threshold or critical value. In contrast, Li et al. [14] proposed a procedure using pvalues, instead of critical values, to determine whether an alarm should be signaled. When the incontrol distribution is known, Li et al. [14] suggest using MonteCarlo simulations to simulate the incontrol distribution and estimate pvalues. When the incontrol distribution is unknown, bootstrap methods can be utilized to determine the empirical incontrol distribution and estimate the pvalues.
In the context of industrial process control, Li et al. [15] proposed a new CUSUMbased method that controls the FDR when multiple processes are being considered. One calculates a CUSUM statistic at each time step for each of the processes. The corresponding pvalue for each process at each time step is calculated using a randomwalk based approximation. Li et al. [15] provide the option of using the procedure of Benjamini and Hochberg [11] if the multiple statistics are assumed to be independent and the BenjamniYekutieli extension [12] if the statistics are assumed to be dependent. The proposed method is a combination of several of these approaches: as proposed by Raubertas [9], the neighborhood disease counts are pooled in calculating the Poisson CUSUM statistic for each regional neighborhood; following the guidelines of Li et al. [14], these CUSUM statistics and bootstrap methods are used to compute pvalues from which alarms are signaled  instead of using critical values. Adapting the Li et al. [15] method, FDR procedures are used to handle the multiple testing problem as opposed to using FWER based techniques. We further improve the power of the procedure by utilizing an
8
FDRcontrolling technique proposed by Story and Tibshirani [17] in addition to the method proposed by Benjamini and Yekutieli [12],
One begins the method by first setting an overall FDR level that the procedure is supposed to control. Then, at each timestep, the disease counts of immediate neighbors are pooled together to compute a CUSUM statistic for each neighborhood. Next, Monte Carlo simulations or bootstrap methods can be used to approximate the incontrol (no outbreak) distribution of the neighborhood CUSUM statistics, from which the corresponding pvalues can be calculated. As multiple dependent statistics are tested simultaneously, two popular FDRbased methods are utilized to handle the multiple testing problem.
Section 2 provides additional details of the proposed method along with specifics of the Poisson CUSUM method and details of the popular FDRbased multiple testing procedures used in the algorithm. Section 3 provides results of a simulation study using the proposed methodology, and Section 4 illustrates the results from applying the method to detect a Salmonella Newport outbreak in Germany; these results are compared with the results from traditional multiple CUSUM charts. Finally, Section 5 summarizes the strengths of the proposed method and provides some potential future directions for research.
2 Methods
2.1 Standard CUSUM
The CUSUM method was proposed by Page [5] to detect small persistent changes in the mean of a process. The initial formulation of the CUSUM method assumed the responses were uncorrelated and normally distributed. Later, Lucas [6] extended the CUSUM method to uncorrelated count data generated from a Poisson distribution. We describe the CUSUM method in this setting. Let Yx, Y2,..., Yn be the response values at times t = 1,2,..., n. The goal is to
9
sound an alarm when the response mean shifts from the â€œincontrol processâ€ mean of A0 to the â€œoutofcontrol processâ€ mean of Ax. The monitoring statistic used for the Poisson CUSUM method is
Ct = max(0, Ct_i + Yt k), (1)
where Yt is the count observed at time t, Ct is the Poisson CUSUM statistic at time t, k is a value chosen to minimize the detection time after a mean shift, and C0 is defined to be 0. The CUSUM statistic is the larger of 0 and the sum of the CUSUM statistic for the previous time step with the difference between the observed value Yt and the reference value k. The value k in equation (1) is chosen to minimize the time to detect a mean change from A0 to Ax (where A0 > Ax) and is determined by the formula
X1 â€” A0
k =, / Â° â– (2)
In a^ In aq
An alarm is signaled when Ct > h, where /lisa threshold or a critical value chosen to control the error rate. The threshold /lisa function of the value of the parameter k and the desired incontrol average run length, ARL0, is calculated using either Monte Carlo simulations or statistical tables. The ARL0 is defined as the desired average time between two false alarms when the process is in control. Robertson [19] points out that ARL0 â€œcan be difficult to specify.... In practice, approximations are used to estimate the value for h for a chosen ARL0 [20], though this remains a key issue in CUSUM methods.â€ Moustakides [21] proved that the CUSUM was the optimal procedure for detecting the mean shift, in the sense that â€œamong all procedures with the same ARL0, the optimal procedure has the smallest time until it signals a change, once the process shifts to the outofcontrol state.â€ The popularity of the Poisson CUSUM method over other SPC methods is perhaps due to this theoretical optimality property.
10
2.2 Extending Poisson CUSUM to a Spatial Setting
The original CUSUM method and its immediate extensions are univariate and purely temporal methods. Furthermore, different spatial extensions of the CUSUM methods have been proposed for spatiotemporal surveillance by Raubertas [9], Rogerson [22], and Rogerson and Yamada [10],
Raubertas [9] pooled the data in neighborhoods using distancebased weights to form regional neighborhoods. Suppose we have m spatial regions where disease counts are observed, and let Yu be the disease count in region l at time t. Since larger populations are generally expected to have greater counts of disease incidence, an adjustment must be made before computing the value of k, previously described in (2). Let Aol be the expected number of cases in region /, nL be the atrisk population in region /, and y0 be a baseline disease rate common to all regions. Then we have Aoi = niYo If we desire to detect a disease outbreak when the disease rate shifts to Yi, then Au = n^. Using Aoi and Alh the required for calculating the CUSUM statistic can then be computed using equation (2). Although, Aoh Alh and the corresponding kL can vary with changing population sizes, nh we considered the case of a constant fixed population in our simulation experiments, meaning kL is fixed for the observed time period. The pooled count of unit i at time t is
where w u is a measure of closeness between regions i and l. Similar pooling defines A'oi, A'lit
and k[. For each regional neighborhood, a pooled CUSUM statistic is calculated and an alarm is triggered if the statistic exceeds a threshold. Extensions of this method allow timevarying parameters for k and A [9], Alternative approaches are given by Purdy et al. [23], Our proposed
m
1=1
11
methodology maintains the general framework of Raubertas [9], with important changes described below.
We also note that another possible approach for extending the Poisson CUSUM method to a spatial setting is to calculate a univariate CUSUM statistic for each region, and then sound the alarm when the statistic exceeds a threshold adjusted to account for the multiple testing problem (e.g., adjusting using the Bonferroni correction). This is a method that is used in traditional multivariate process control [15], Rogerson and Yamada [10] outline a similar method that can be applied for disease surveillance over multiple geographic regions. According to this method, the threshold h for each regional CUSUM chart is calculated as follows. First, the familywise Type I error rate, a', is decided upon (for example, a common choice is a' = 0.05). Rogerson and Yamada [10] model run lengths as having an exponential distribution and desire the following relationship to hold: p{run length
2.3 Modifications for a more Powerful Procedure
Instead of using the traditional criticalvaluebased testing method in the CUSUM procedure, we instead utilize pvalues as described by Li et al. [14], According to the first setting considered by Li et al. [14], when the incontrol process distribution is known, pvalues
12
may be estimated using Monte Carlo simulations. More specifically, their algorithm starts by
collecting incontrol observations Yv Y2,... ,Yt at a given time point t. With these observations, the corresponding CUSUM statistics Cl, C2,..., Ct are computed. Next, B data sets from the incontrol distribution are simulated, and for each of these data sets the corresponding CUSUM statistics are computed. Using these statistics at each time point t, the empirical distribution of Ct is determined. At each time step, using Ct and the corresponding empirical distribution, the pvalues are estimated as the proportion of samples where the associated CUSUM statistic is at least as large as the observed test statistic. When the incontrol distribution is unknown, pvalues may be estimated using bootstrap methods [14],
In addition to using pvalues in our proposed procedure, we also utilize the FDR for error control. Because we will be monitoring multiple spatial regions, we clearly have a multiple testing problem. Li et al. [15] utilized the FDR to manage this problem in the context of traditional SPC methods using criticalvalues; we follow the same pattern using pvalues, instead of critical values.
Benjamini and Hochberg [11] popularized the concept of FDR to handle the multiple testing problem. The false discovery rate is defined as the expected proportion of false discoveries among all discoveries:
FDR = E[V/R\,
where V is the number of true null hypotheses that are declared significant and R is the total number of hypotheses that are declared significant out of m tests.
13
Numerous procedures have been proposed to control the FDR of numerous tests:
â€¢ In the simplest case, it is assumed that the m test statistics are independent. In that case, Benjamini and Hochberg [11] showed that the following procedure controls the FDR at level a. Consider testing hypotheses Hlt..., Hm and let < P(2) < â– â– â– < P(m) be the ordered observed pvalues, with denoting the corresponding hypothesis. Define j = max{i:p(Q < ai/m}, and reject H^,... ,Hyy If no such i exists, reject no hypotheses. We will refer to this procedure as the BenjaminiHochberg (BH) procedure.
â€¢ Benjamini and Yekutieli [12] proposed a modification of the BH procedure when the test statistics of the m tests are dependent. Specifically, if the a*i/m utilized in the BH procedure is replaced with a/Y^i t_1, then the FDR will still be controlled at level a. We will refer to this as the BenjaminiYekutieli (BY) procedure. Both the BH and BY procedures were used by Li et al. [15] to control the FDR. A more powerful and uptodate alternative is the StoreyTibshirani (ST) procedure, which we describe in more detail below.
â€¢ Storey and Tibshirani [17] pointed out that the original FDR method proposed by Benjamini and Hochberg [11] is too conservative and results in a substantial loss of power. To gain more power, Storey and Tibshirani [17] introduced the qvalue as an FDRbased measure of significance instead of the pvalue, which is a FWERbased measure of significance. Whereas the pvalue of a test measures the minimum false positive rate that is incurred when calling the test significant, the qvalue of a test measures the minimum false discovery rate that is incurred when calling a test significant. We describe this method in greater detail below.
14
An alternative definition of the qvalue for a particular set of tests is that it is the expected proportion of false positives incurred when calling a test significant. Using this definition,
Storey and Tibshirani [17] derived an expression for the false positive rate when calling all tests significant whose pvalue is less than or equal to some threshold t, where 0 < t < 1. The expected proportion of false positives is approximated by dividing the expected number of false positives by the expected total number of significant tests, for large m. The denominator is easily estimated by using the total number of observed pvalues that are less than or equal to t. The numerator, the expected number of false positives, is much trickier and is obtained by multiplying the probability of a true null hypothesis being rejected by the total number of true null hypotheses, m0. In finding the probability of a true null hypothesis being rejected, we need to use the result that the pvalues corresponding to the null hypothesis are uniformly distributed. As a result, the probability that a null pvalue being less than or equal to t, is simply t. So the expected number of false positives is m0t. However, since m0 is unknown the proportion of true null hypotheses, n0 = m0/m, has to be estimated.
The key step in the method is an approximation of n0 (see appendix A for approximation of 7r0). In estimating n0, Storey and Tibshirani make use of the fact that the distribution of true null hypothesis follows a uniform distribution and the distribution of truly alternative hypotheses is close to zero. They estimated n0 by considering the density of the uniformly distributed section of the density histogram of the pvalues.
Putting all of these components together, using the estimated proportion of true null hypotheses, fT0, the formula for computing the qvalue, for a given pvalue, t, is:
fc^mt
min:
t>Pi count{pi < t;i =
15
Likewise, a qvalue for each test can be calculated and the resulting qvalues can be ordered at each time step from the smallest to largest. Once a significance threshold of a is decided upon, a set of significance tests is identified where a proportion of a is expected to be false positive.
It is important to make a distinction between FDR and an alternative quantity called pFDR, under which the qvalue is technically defined. FDR can be loosely defined by FDR =
E(V//?), while the pFDR is defined as pFDR = E(V/R\R > 0), to avoid V/R being undefined when R = 0 [17], Just as a pvalue is defined as the minimum false positive rate when calling the test significant, using pFDR, a qvalue can be defined as the minimum pFDR at which the test is called significant [24], For large m, when performing surveillance over a large number of smaller geographic units, such as for statewide or nationwide surveillance, pFDR Â« FDR = E[V]/E[R],
As pointed out by Storey and Tibshirani [17], their procedure can be used for models with dependent statistics when the â€œweak dependenceâ€ criterion is satisfied. According to them, â€œas a rule of thumb, the more local the dependence is, the more likely it is to meet the weak dependence criterion.â€ The local dependence in the model generated by regional neighborhoods satisfies the weak dependence criterion. (For a formal mathematical definition of â€œweak dependenceâ€, see Remark D in [17])
2.4 Detailed Description of the Proposed Methodology
First, the desired FDR of all charts, a, is decided. At each time t, disease counts for each of the m regions Ylt, Y2t,..., Ymt are collected. Disease counts of immediate neighbors are pooled to form regional neighborhoods with counts Y{t, Y2t,..., Yjnt. Then, the corresponding CUSUM statistics for the regional neighborhoods C'lt, C2t,... Câ€™nt are calculated. For each region, B data sets are simulated based on the assumed known incontrol distribution. Using
16
these, for each regional neighborhood, an empirical distribution is determined from which the corresponding pvalues p[t, p'2t, â€”>Pmt are calculated. Lastly, an appropriate FDRcontrolling procedure is used to make decisions about whether an alarm should be signaled.
Alternatively, bootstrap methods can be used instead to calculate the pvalues. As before, after defining the overall FDR level, at each time point t, disease counts for each of the m regions Ylt, Y2t,..., Ymt are collected. Then the disease counts of the immediate neighbors are pooled to form counts for regional neighborhoods Y{t, Y2t,..., YJnt. Using these counts, the corresponding CUSUM statistics for the regional neighborhoods C[t, C2t,... C^nt are calculated. However, since the incontrol distribution is unknown in this case, bootstrap methods are used to determine the empirical incontrol distribution. Fixing the time period for which the counts follow the incontrol distribution, B bootstrap samples of regional counts from the incontrol time period are drawn with replacement across time while preserving the spatial relationships between the counts (see appendix B for details of the bootstrap procedure). Specifically, instead of sampling the counts of individual regions, we instead sample B times from the incontrol time period. The entire set of regional counts for each sampled time then count as a single set of bootstrapped data. For the bootstrap samples, the corresponding CUSUM statistics are computed to determine the empirical incontrol distribution. Using this empirical incontrol distribution and the CUSUM statistics for the regional neighborhoods C[t, C2t,... C/nt, the corresponding pvalues for each regional neighborhood at times t, p'lt, p'2t,..., p'mt, are calculated. Finally, a FDRcontrolling procedure is used to determine the alarms, as before, since multiple pvalues are compared simultaneously.
We will compare several FDRcontrolling procedures in Section 3. The BH procedure will be used as a baseline. Next, the BY procedure will be used to see how accounting for
17
dependence between test statistics improves performance. Lastly, the ST procedure will be used to highlight the additional gain in power.
2.5 Contrast with Existing Methods
The proposed method is better suited for disease surveillance compared to the conventional methods for several reasons: (1) the use of pvalues in hypothesis testing is typically preferred over critical values, (2) the use of FDR for error control is better in a public health setting than the standard ARL0 used in SPC, (3) the multiple testing problem can be easily handled using a variety of offtheshelf procedures. These points are elaborated upon below.
Conventional SPC methods were designed using thresholds or critical values. The use of pvalues has many advantages over the conventional approach:
1. Pvalues are used by most disciplines using statistics, whereas critical values are only used in certain contexts.
2. When a signal of distributional shift is delivered, pvalues can be used to quantify the strength of evidence of an outbreak. Results can indicate a range from no evidence, weak evidence, moderate evidence, strong evidence, etc. However, with the conventional critical value approach, it is difficult to quantify the strength of evidence since the statistics are not on a standard scale from one setting to another.
The FDR is more appropriate for handling the multiple testing problem in surveillance scenarios than the ARL0. The concept of ARL0 originated in an industrial process control setting. When the process is in operation, an alarm is signaled when the CUSUM statistic crosses a predefined threshold, and the process is stopped. Sometimes, even when the process is incontrol the CUSUM will signal an alarm. This is a false alarm and is analogous to a Type I error. The ARL0 is defined as the expected time until a false alarm.
18
The appropriateness of using FDR in a disease surveillance setting over the use of traditional ARL0 based methods is elaborated upon below:
1. The FDR in a processmonitoring scheme may be regarded as the expected proportion of outofcontrol signals that turn out to be false alarms. On the other hand, the false alarm rate is the probability of concluding that the incontrol process is out of control when it is actually in control. Therefore, FDR has a more meaningful interpretation in the surveillance setting than the false alarm rate. For example, if the FDR control level is 0.01, it is expected that 1 out of 100 alarms may be false. However, a false alarm rate of 0.01 simply means that we would sound an alarm for an incontrol process 1 out of 100 times. We are more interested in controlling the error rate for outofcontrol processes than incontrol processes.
2. FDR is directly related to predictive value positive (PVP), one of the seven attributes used by CDC for evaluating surveillance systems, since FDR = 1  PVP.
3. The ARL0 does not make sense in a health context although it is meaningful in a SPC context. A run length is defined as the number of observations from the starting point up to the point where the statistic crosses a predefined threshold. When such a shift occurs, the process might be stopped and restarted again. In contrast to an industrial process, an emerging disease outbreak cannot be stopped: so the concept of a run length is not as appropriate in a health context.
Lastly, using FDR provides us with a host of uptodate methods to handle the multiple testing problem such as BenjaminiHochberg [11], BenjaminiYekutieli [12], and StoreyTibshirani [17] procedures. It is possible to gain more power with these methods compared to techniques controlling the FWER such as the Bonferroni correction.
19
3 Simulation Experiment
3.1 Experiment Setup
Our simulation experiment consists of a study area comprised of 25 regions on a fivebyfive grid. The time period considered by the simulation experiment was 100 time points or â€œdaysâ€. No outbreak occurred during the first fifty days (days 150); disease counts for each of the 25 regions for the first half of the simulation were simulated as independent Poisson random variables with a constant mean of Aoi = 4, i = 1,... ,25. However, for the second half of the simulation (days 51100), an outbreak was simulated where the Poisson means for the regions peak in the center and tapered towards the edges. The out of control means are denoted by Alt, i = 1, ...,25.
Specifically, as illustrated in Figure 1, the four corner regions 1, 5, 21, 25 get mean shifts of 0.2 standard deviations from the original mean. Regions 2, 3, 4, 6, 10, 11, 15, 16, 20, 22, 23, and 24 get 0.3 standard deviation mean increases to 4.6. The middle regions 7, 8, 9, 12, 14, 17, 18, and 19 get 0.75 standard deviation mean increases to 5.5. Finally, the mean of region 13 increases by 1 standard deviation to 6. The outofcontrol process means are shown in Figure 1.
For this simulation, each of the 25 CUSUM statistics was designed with an incontrol Poisson mean (A0) of 4 and an outofcontrol Poisson mean (A^ of 6. In other words, since the incontrol mean is 4 (standard deviation is 2), the CUSUM statistic is built to detect a change of one standard deviation increase. Using the first method of Li et al. [14], ten thousand Monte Carlo simulations were used to determine the empirical incontrol distribution. The desired level of the FDR was set to a = 0.05.
20
1 2 3 4 5
4.4 4.6 4.6 4.6 4.4
6 7 8 9 10
4.6 5.5 5.5 5.5 4.6
11 12 13 14 15
4.6 5.5 6 5.5 4.6
16 17 18 19 20
4.6 5.5 5.5 5.5 4.6
21 22 23 24 25
4.4 4.6 4.6 4.6 4.4
Figure 1. The outofcontrol mean disease counts (lu, i = 1,2,..., 25) of the 25 regions for days 51100 of the simulation experiment are indicated in the center of the regions. Region numbers are indicated at the top of each region in smaller font.
3.2 Simulation Results
Figures 2 and 3 show the results for a single simulation for regions 1 and 13, respectively. Region 1 has the smallest shift in mean with a 0.2 standard deviation increase while region 13 has a 1 standard deviation increase.
Figure 2 summarizes the results of a single simulation over the 100 day period for region 1, where the process mean was 4fortimesl50and4.4fortimes51100. Figure 3 (a) shows the disease counts for region 1 over the 100 day period for a single simulation. The blue line shows the observed counts in region 1, whereas the red line shows the pooled counts: pooled counts were calculated by totaling the counts of the region of interest, in this case region 1, and its immediate neighbors that share a common boundary (regions 2, 6, and 7). In other words, the region of interest, and its immediate neighbors, each get a weight of 1. Figure 3 (b) shows the CUSUM statistic for the observed (blue) and pooled (red) counts; note that the CUSUM statistic for the observed counts does not show any large increase, except for a relatively small spike at
21
day 80, in the presence of the outbreak (days 51  100). Recall that region 1 gets the smallest increase of a 0.2 standard deviation shift in the mean. The CUSUM statistic using the observed regional counts is designed to respond to a much larger increase of a one standard deviation shift and is not sensitive to this relatively small change. However, the CUSUM statistic for pooled regional neighborhood counts captures information from neighboring counts as well, and indicates a substantial rise after day 51. In Figure 3 (c) the blue line shows the difference, (aBH â€” pvalue) for the independent model, where aBH is the pvalue threshold using the BH procedure; similarly, the red line shows the difference, (aBY â€” pvalue), for the pooled model where aBY is the pvalue threshold using the BY procedure; finally, the orange line shows the difference, (0.05 â€” qvalue) for the pooled model where 0.05 is the overall FDR using the ST procedure. All three plots are displayed together for ease of comparison. Note that the point at which an alarm will be signaled is the time at which the difference exceeds 0, shown by the grey horizontal line. The last figure, Figure 3 (d), displays these alarms for the three multiple comparison procedures previously discussed: BH, BY, and ST. The BH procedure (applied to the CUSUM statistic from the observed counts) sounds only one alarm on day 80. The BY procedure (applied to the CUSUM statistic for pooled counts) starts to sound alarms after day 57, while the ST procedure starts indicating regular alarms much earlier (beginning at day 51), and always sounds an alarm when the BH procedure sounds an alarm.
Figure 3 contains the plots for region 13, with the highest step increase of 1 standard deviation. Note that the CUSUM statistic in this simulation was designed to detect a change of this magnitude. The alarms are more persistent for all three procedures, with the ST procedure (orange line) identifying the outbreak the earliest (on day 51, right at the beginning of the
22
outbreak), followed by the BY procedure (red line) on day 53, and the BH procedure (blue line) on day 57.
Region 1
time
(a)
z>
w
3
O
time
(b)
Â£
b
time
(c)
time
(d)
Figure 2. Summary plots for region 1: (a) disease counts (b) CUSUM statistics (c) differences, and (d) alarms. In plot (c) the blue line shows the difference,(aBH â€” pvalue), where aBH is the threshold using the BH procedure for the independent model. Similarly, the red line shows the difference, (aBY â€” pvalue), where aBY is the threshold using the BY procedure for the pooled model; the orange line shows the difference, (0.05 â€” qvalue), where 0.05 is the overall FDR using the ST procedure for the pooled model. The grey
horizontal line indicates a zero difference.
23
Region 13
time
(b)
Â£
time
(c)
E
ro
<
1 1 1 1 9 0 â€” BH BY ST 1  n , n , n . 1' r
o 1 11 1 >1
o 1 0 20 1 40 O CD 1 80 i 100
time
(d)
Figure 3. Summary plots for region 13: (a) disease counts (b) CUSUM statistics (c) differences, and (d) alarms. In plot (c) the blue line shows the difference, (aBH â€” pvalue), where aBH is the threshold using the BH procedure for the independent model. Similarly, the red line shows the difference, (aBY â€” pvalue), where aBY is the threshold using the BY procedure for the pooled model; the orange line shows the difference, (0.05 â€” qvalue), where 0.05 is the overall FDR using the ST procedure for the pooled model. The grey
horizontal line indicates a zero difference.
24
To confirm that the three procedures are controlling the FDR at the desired level of a = 0.05, we calculated the observed proportion of false discoveries for each procedure for 100 independent simulations, using the same experimental set up in Section 3.1 for each simulation. Boxplots of the results for each procedure are displayed in Figure 4. Clearly, the overall FDR is below the specified 0.05 for all of the three procedures.
'3
o
o
CO
o
o
a:
Cl CM
U_ O
O
O
O
O
o
o
Figure 4. The box plots of proportion of false discoveries for all three procedures over 100 simulations. Results for the BH procedure is shown on the far left, the BY procedure in the
middle, and the ST procedure on the far right.
3.3 Additional Investigation of Power
We now discuss the â€œpowerâ€ of the three procedures more carefully. To show that these results are consistent, the conditional expected delays (CEDs) and the probability of alarms (PAs) for a large number of simulations were calculated for the three models.
25
CED and PFA are commonly used measures of evaluation in surveillance. Frisen [25] discusses these two measures and several performance measures used in statistical surveillance. Conditional expected delay (CED) is a measure commonly used to compare speed of detection and probability of a false alarm (PFA) is used to compare false alarm rates.
Let tA = min{t: S(t) > t) be the alarm time of a surveillance statistic crossing the threshold h. Conditional expected delay (CED) is the average delay time until an alarm when the change occurs at time point x, and defined as
CED(x) = E[tAâ€” xtA > x].
The other measure, probability of a false alarm is define as,
PFA = P(tA < x).
Robertson et al. [19] point out that a surveillance system should provide â€œquick detection and few false alarms.â€ Furthermore, Jiang et al. [26] point out that â€œdesign and evaluation of a surveillance method need to tradeoff between false alarm probabilities and detection delays.â€ Therefore, PFA and CED are useful measures to evaluate the performance of surveillance systems.
In our setting, we are controlling the FDR. Since the FDR should be held constant across procedures, a more appropriate measure of power in our context is simply the probability of an alarm (PA). Since the FDR at each time step is held constant across procedures, the most powerful procedure will be the one that sounds an alarm most frequently; we estimate this by dividing the total number of alarms by the total number of tests.
In calculating CED and PA for the three models, a range of change points (5, 10, 15, ... , 95) were considered (recall that the simulation results presented earlier was for the case where the change point was day 50). For each change point, 100 simulations were performed. For
26
example, 100 data sets were generated where the change point for the outbreak was at day 5. Similarly, 100 data sets each were generated for the other change points such as days 10, 15, ... , etc. Then for each change point, the CEDs of the 100 simulated data sets were averaged for each of the three models respectively. In order to avoid issues in estimating the CED for breakpoints near the upper end of the original simulation time period (1,2,... ,100), the time period under consideration was extended up to 150 so that a change could be detected for all data sets.
Figure 5 shows the CED plots for the BH procedure (blue), the BY procedure (red), and the ST procedure (orange) for regions 1, 2, 7, and 13. Recall that the increase in the outofcontrol process mean for each region is different, with region 1 having the smallest increase and region 13 having the largest.
The top left plot in Figure 5 shows the average CED against change point time for region 1 for all three models. Clearly, for all four regions, the ST procedure has the lowest CED, followed by the BY and BH procedures, respectively, across all change points. In general, as evident by the four plots in Figure 5, the procedures using the pooled regional neighborhood counts (models using BY and ST procedures) signal much faster than the model that only uses observed regional counts (model using the BH procedure); within the pooled models, the ST procedure detects outbreaks faster than the BY procedure.
27
CED CED
Region 1
Region 2
Change Point (days)
Change Point (days)
Region 7
Region 13
Change Point (days)
Change Point (days)
Figure 5. Average CED versus change point time for BH model, BY model and ST model
for regions 1, 2, 7 and 13
28
We now compare performance of the procedures using PA. Figure 6 shows the probability of an alarm against change points for all three procedures in regions 1, 2, 7, and 13. Again, the same choice of change points (5, 10, 15, ... , 95) were considered. For each change point, the PAs for 100 simulations were averaged for all three methods. As expected, in all four regions the ST procedure has the highest PA followed by the BY procedure and then the BH procedure. However, for larger increases in the mean, the vertical distance between the three lines tends to diminish, indicating that the gains in power tend to be smaller for larger step increases in the mean. Since FDR is controlled at a 0.05 level, the higher the total number of alarms, the higher the number of true alarms. So when FDR is controlled, we want to sound the alarm more frequently. As expected, the ST procedure (orange) sounds the most alarms followed by the BY procedure (red) and finally the BH procedure (blue). The downward trends in the plots are due to the following reason: when the outbreak occurs earlier during the 100 day period, the duration of the outbreak is longer, allowing more chances to detect the outofcontrol process.
29
Region 1
Region 2
Region 7
Region 13
Figure 6. Average probability of an alarm versus change point time for BH model, BY model and ST model for regions 1, 2, 7 and 13
30
4 Application to Salmonella Data
The proposed method was applied to a data set of Salmonella Newport cases reported weekly from 16 German federal states between years 20042014. The data were obtained from the Robert Koch Institute in Germany [27], Since Salmonella is not a contagious disease, without trend or seasonality, the counts were assumed to be independent. The 16 German federal states are displayed in Figure 7.
Figure 7. Map of the sixteen federal states of Germany
31
The first two years of data (20042005) were used to estimate the incontrol distribution in each state since there were no unusually high disease counts reported from any of the states during this period. A Poisson dispersion test [28] was used to ensure that the Poisson counts were not overdispersed. No overdispersion was detected in the first two years of data at a type I error rate of 0.01. Also, it is reasonable to assume that the data are spatially independent over the first two years, which we use as the incontrol time period.
The proposed surveillance model was applied to data from 2006  2013. Since, states with larger populations are likely to have larger disease counts, the method proposed by Raubertas [9] detailed in section 2.2 was adopted to address this issue. For the pooled model, disease counts from each state were pooled together with those of the immediate neighbors, using a binary connectivity matrix similar to the one used in the simulation experiment. For the pooled counts, the CUSUM statistic was computed, the pvalues for each region were estimated, and an alarm was signaled using an FDRcontrolling procedure. Pvalues were estimated using the bootstrap method since the incontrol distribution was unknown. The BH procedure was used to handle the multiple testing problem for the independent counts model while the ST procedure was used for the pooled model; for both models the overall FDR was set to 0.05 to minimize false alarms.
A plot of the results for both the independent counts and pooled counts models are shown in Figures 8 and 9 for the states Bavaria and SaxonyAnhalt, respectively.
The data set contained counts for 528 weeks starting from the first week of January in 2004 to the second week of February in 2014. The time axis shows the number of weeks, starting from the first week of January 2004 (week 1). Since the Salmonella Newport outbreak actually occurred in 2011 [29], the results are plotted from the last week of September 2009
32
(week 300) to the second week of February 2014 (week 528). The results indicate that the pooled model using the ST procedure was successful in detecting the Salmonella outbreak in the first week of November 2011 (week 410), and this was consistent for all 15 states (The pooled model using the BY procedure also gave alarms on week 410 in all 15 states as well, as the rise in disease counts was fairly large). In contrast, the model using independent state counts was rather inconsistent in signaling an alarm around the period of the outbreak: the model did not detect an outbreak in Bremen, which has the smallest population; in Bavaria, the outbreak was detected 53 weeks later; in RhinelandPalatinate and SaxonyAnhalt it was detected two weeks later; in BadenWuerttemberg and SchleswigHolstein it was detected a week later.
These results using the proposed method employing FDR techniques can be compared with the performance of traditional multiple Poisson CUSUM methods using FWERbased multiple testing procedures. The traditional method using 15 univariate Poisson CUSUM charts with a Bonferroni correction produced rather inconsistent results in signaling the alarm: in Bremen and Saxony, no change was detected; in Bavaria, the outbreak was detected 53 weeks later; in BadenWurttemberg it was detected 4 weeks later; in RhinelandPalatinate, SchleswigHolstein, and Thuringia the outbreak was detected two weeks later; in MecklenburgVorpommern it was detected a week later.
The alarm plot on Figure 8(c) shows the results for Bavaria: using the pooled model with the ST procedure, the outbreak is detected in week 410, consistent with the results of the other states, but there is a much longer delay (53 weeks) in detection using the independent counts model. Since the increase in disease counts is relatively large in week 410, both ST and BT procedures detect the outbreak in week 410. However, note that there is another relatively smaller increase in counts after week 450. This smaller shift is detected first by the more
33
powerful ST procedure in week 452, followed by the BY procedure in week 456, and then the BH procedure in week 359. Although the ST and BY procedures are comparable in terms of detection speed when it comes to detecting large increases, the ST procedure is quicker in detecting relatively smaller changes, consistent with the simulation results.
Figure 9(c) shows the results for SaxonyAnhalt: using the pooled model with the ST procedure, the outbreak is again detected in week 410, but there is a twoweek delay in detection using the independent model. The pooled model sounds a steady alarm after week 410 for a longer period; however, the independent model sounds an alarm irregularly for a relatively shorter period. In summary, the pooled model using the StoreyTibshirani procedure was successful in detecting the outbreak simultaneously throughout the country as opposed to the independent model, which signaled alarms inconsistently and following considerable delays.
34
Bavaria
time (weeks) (a)
time (weeks) (b)
E
TO
<
time (weeks) (c)
Figure 8. Pooled disease counts (a), CUSUM statistics (b), and alarm plots (c) for Bavaria. The blue line in (a), (b), and (c) represent the independentcount model and the red line represent the pooledcount model for Bavaria.
35
SaxonyAnhalt
time (weeks) (b)
E
TO
<
time (weeks) (c)
Figure 9. Pooled disease counts (a), CUSUM statistics (b), and alarm plots (c) for SaxonyAnhalt. The blue line in (a), (b), and (c) represent the independentcount model and the red line represent the pooledcount model for SaxonyAnhalt.
36
5 Discussion
In the proposed new procedure, regional disease counts from multiple regions are aggregated to compute an alarm statistic using the popular Poisson CUSUM method. Two novel aspects of the proposed method are particularly advantageous in a disease surveillance setting. First, the use of pvalues (instead of the commonly used critical values) enables us to evaluate the strength of the outofcontrol signal. Second, the use of FDR for error control instead of the standard FWER or ARL0 allows the use of powerful tools to handle the multiple comparison problem. The StoreyTibshirani [17] procedure was the most powerful procedure in our tests, in comparison with the more conservative BenjaminiHochberg [11] and BenjaminiYekutieli [12] procedures. The simplicity of the algorithm and the improved speed of detection make the proposed method useful in a practical disease surveillance setting, as illustrated by the Salmonella Newport example in Germany.
We emphasize that the proposed procedure controls the FDR at each time step. If one wanted to control FDR over all time steps simultaneously, the BY and ST procedures would still apply, though the procedure would no longer be prospective. As evident from simulation results, pooling regional neighborhood counts can increase the speed of detection in comparison to individual regional counts, when the shift, the deviation from the mean, occurs in several units that make up the regional neighborhood (this was previously emphasized by Raubertas [9]). However, pooling may result in a delay in detection if the shift occurs in only one or two regions that make up a neighborhood because the effect of the shift will be diluted. Additionally, pooling may result in an increase in false alarms in regions sharing a common boundary.
Another issue related to the utilization of the CUSUM statistic is in identifying the time at which the outbreak ends. The CUSUM statistic may not decrease quickly after an outbreak has ended,
37
leading to incorrect decisions when the outbreak is over. This issue is addressed by Gandy and Lau [30],
The proposed methodology is applicable for the detection of foodborne outbreaks, a class of outbreaks with no trend or seasonality. A future research direction is to extend the method to a broader class of settings, encompassing outbreaks with trend and seasonality. Additionally, we have assumed spatial and temporal independence of the observed counts (though the pooled CUSUM statistics are correlated). It is certainly possible that the counts are dependent, even after adjusting for a nonstationary mean structure. This has been addressed by Rogerson and Yamada [10] among others. We are actively working on extending the proposed methodology to the setting where counts are dependent in time and/or space.
The R package â€˜fdrtooT [31] was used to calculate the qvalues for the simulations and the case study (see Appendix G for details of â€˜fdrtooT package). We would like to thank several anonymous referees and an associate editor for their very helpful suggestions toward improving this manuscript.
38
CHAPTER III
AN IMPROVED EWMABASED PROCEDURE FOR PROSPECTIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE REGIONS
1 Introduction
Quick detection and prevention of emerging outbreaks is an important problem in public health surveillance. Large amounts of data on a variety of diseases are collected in the United State and Europe to monitor and control the spread of disease outbreaks [1], Both prospective and retrospective methods can be used to analyze the collected data to answer various public health questions. During the past decade, there has been increased interest in using prospective methods for surveillance of disease outbreaks. In prospective disease surveillance, the number of reported cases are collected sequentially over time and analysis is performed on the most uptodate data to decide whether there has been an increase in the disease incidence rate. This is different from retrospective analysis where the entire data set available for analysis and the entire data set is analyzed at the same time, as in traditional hypothesis testing. In prospective surveillance, it is important to detect an increase in the incidence of the disease as quickly as possible and report that to public health authorities to take timely action.
Prospective surveillance methods are often broadly classified as time series methods, regression based methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detection methods [2], During the recent years, SPC methods that originated in industrial process control have been successfully utilized for surveillance of disease outbreaks [3], Two SPC methods that have been successfully extended for use in disease surveillance include the Cumulative Sum (CUSUM) method and the Exponentially Weighted Moving Average method (EWMA). Both have gained prominence in
39
practical disease surveillance: CUSUM is used in BIOSENSE [4], while the EWMA method is used in the Early Aberration Reporting System (EARS) developed by the Center for Disease Control and Prevention (CDC) and ESSENCE (Electronic Surveillance System for the Early Notification of Communitybased Epidemics) developed by the Department of Defense and Johns Hopkins University [4],
The EWMA chart is easy to implement and interpret similar to the CUSUM chart, the other popular SPC method that is also used in practical disease surveillance [32], The EWMA control chart is a moving average of current and past observations, where the weights of the past observations decrease as in a geometric series [32], In an industrial process control setting, the goal of the EWMA chart is to quickly detect shifts in the process mean. The starting value is often set to the process mean which is the target value to be monitored: in disease surveillance this value is the mean number of cases when there is no outbreak. Similar to the CUSUM control chart, the EWMA chart also has a lower control limit and an upper control limit. The process is considered to be out of control when the EWMA statistic crosses the control limits. When the process is â€œincontrolâ€ (i.e., when there is no outbreak) we do not want to sound an alarm. However, when the process is â€œoutofcontrolâ€ (i.e., when there is an outbreak), we want to sound an alarm. The method was originally proposed by Roberts [7] for normally distributed responses and was later extended to Poisson data by Borror [8],
The EWMA chart is a purely temporal control chart. Other purely temporal control charts have been extended to spatiotemporal settings in the past: namely, Raubertas [9] extended the purely temporal Poisson CUSUM method to a spatiotemporal setting. Raubertas [9] formed neighborhoods by pooling counts of regions that share a common boundary. In other words, counts are pooled using closeness as a weight, where the weights are binary. This way,
40
overlapping neighborhoods are defined with each region having its own neighborhood. The pooled data within the neighborhood is used to calculate a CUSUM statistic for each neighborhood. An alarm is signaled when the neighborhood CUSUM statistic exceeds its control limit.
Rogerson and Yamada [10] extended Raubertasâ€™s method for spatial surveillance by forming neighborhoods consisting of contiguous regional units where neighborhoods can be extend beyond regions sharing a common boundary. Instead of using binary weights, Rogerson and Yamada used weights that decease with increasing distance from the region. The local neighborhood statistics formed this way are monitored to signal alarms. The resulting multiple testing problem that is encountered when testing multiple neighborhood statistics simultaneously is handled by controlling the familywise error rate (FWER), using the popular Bonferroni correction. Bonferroni correction methods are also employed in traditional multiple control charts utilized in industrial process control to handle the associated multiple testing problem
[15].
FWER is the probability of making one or more false discoveries among all hypothesis performed. Therefore, FWER procedures can be too strict when a large number of tests are performed simultaneously. In such situations, the False Discovery Rate (FDR), originally proposed by Benjamini and Hochberg [11], is a better alternative. FDR is the expected the proportion of false discoveries among all discoveries. FDR controlling procedures exert less stringent control than the traditional FWER procedures. As a result, FDR procedures have more power at the expense of increased false alarms. Benjamini and Hochberg [11] proposed a FDRbased method to handle the multiple testing problem for independent statistics. The method was
41
subsequently extended for dependent statistics by Benjamini and Yekutieli [12], FDR methods have been recently employed to address multiple testing problems in SPC charts [13],
In conventional SPC charts, control limits are used to signal alarms. Li et al. [14] proposed a general algorithm to use pvalues, instead of critical values, to signal alarms in a univariate setting. When the incontrol distribution is known, they proposed using Monte Carlo simulations to simulate the incontroldistribution to calculate the pvalues. When the incontrol distribution is unknown, they proposed using bootstrap methods to generate the empirical incontrol distribution to estimate the pvalues.
For multiple control charts, Li and Tsung [15] proposed a FDRbased method, instead of the conventional FWER based methods. According to their method, first, the multiple chart statistics are calculated at each time step. Then, the corresponding pvalues are calculated at each time step using a randomwalk based approximation. When the multiple statistics are independent Li and Tsung [15] use the Benjamini and Hochberg procedure [11] and when the statistics are dependent they use the BenjamniYekutieli [12] procedure.
Several of these approaches are combined in the proposed method. Regional neighborhoods are formed, as proposed by Raubertas [9], by pooling counts of regions sharing a common boundary. Then EWMA statistics are computed for these neighborhood using pvalues instead of critical values as suggested by Li et al. [14]; further, adopting their method, bootstrap methods are used to estimate the empirical incontrol distribution. The multiple testing problem that arises when testing multiple neighborhood statistics simultaneously is handled by using FDR procedures as opposed to using FWER based techniques, adapting the Li and Tsung [15] method. The power of the proposed method was further improved by utilizing techniques
42
proposed by Story and Tibshirani [17], in addition to using FDRcontrolling techniques proposed by Benjamini and Yekutieli [12],
The first step of the method is to decide an acceptable FDRlevel for all chart. Then, disease counts of regions sharing a common boundary are pooled together at each time step to compute neighborhood EWMA statistics. The incontrol (no outbreak) distributions of the neighborhood EWMA statistics are approximated by using bootstrap methods since the incontrol distributions are not known. Using the neighborhood EWMA statistics at each time step and the incontrol distributions of the neighborhood EWMA statistics, pvalue are estimated. Since multiple dependent statistics are tested simultaneously, both the BenjaminiYekutieli [12] and StoreyTibshirani [17] procedures are used to handle the multiple testing problem.
In what follows, we will extend the EWMA method to the setting where the data are counts, while incorporating nearby spatial information and also minimizing the assumptions made about the disease generating process when there is not an outbreak. Furthermore, we increase the power of the method by controlling the False Discovery Rate (FDR) instead of the Average Run Length criterion typically employed in SPC. Section 2 provides the necessary background information and details of the extended methodology. Section 3 provides results of a simulation study comparing the extended EWMA method to a similarly constructed CUSUM method [33], In section 4 the extended EWMA method is applied in the detection of a Salmonella Newport outbreak in Germany and the results are compared with the results from the CUSUM method. Section 5 highlights the strengths of the proposed method and discusses potential research directions.
43
2 Methods
2.1 Standard EWMA
The original EWMA chart was proposed by Roberts [7] for normally distributed responses. This method was extended to Poisson data by Borror [8],
Let Yx, Y2,..., Yn be the response values at times t = 1,2,..., n. Borror [8] assumed that the responses are uncorrelated and identically distributed Poisson random variables with mean /i0. The mean /i0 is often referred to as the incontrol mean. The objective of the EWMA method is to detect an upward shift in the mean from /i0 to a prespecified upper limit. The test statistic for the Poisson EWMA method is
Et = ma x[u0, + (1  (1)
where Yt is the count observed at time t, Et is the Poisson EWMA statistic at time t, and lisa weighting constant such that 0 < 1 < 1. E0 is usually defined to be /i0. With this method, a weight of 1 is given to the current observation at time t, and a weight of (1 â€” 1) is given to the EWMA statistic from the previous time step.
The expected value and the variance for the Poisson EWMA statistic are E(Et) = /i0 and
â€ž ^ i[i  (l  i)2tK Var(i;') =(2â€™
respectively [8], Therefore, for large values of t, the asymptotic variance is given by
l^o
Var(Â£t) =
(21)'
An alarm is signaled whenever Et > h, where the critical value h is determined by the formula
(2)
h = Mo + LJ
44
where the outofcontrol mean we would like to detect is L standard deviations above /i0: note
that the asymptotic standard deviation
â€”[iQ is used in this formula.
2 â€”A
2.2 Extending the Purely Temporal EWMA Chart to a Spatiotemporal Setting
The original Poisson EWMA chart proposed by Borror [8] is a purely temporal control chart. Different methods have been proposed to extend the EWMA chart to a spatial setting, for example, methods proposed by Lin et al. [34] and Sparks and Ellis [35], Lin et al. [34] proposed a spatialEWMA scan statistic by assigning different weights to data with different radius levels from the specific center of investigation. Sparks and Ellis [35] proposed using the EWMA statistic to build temporal memory to the scan statistic for detecting outbreaks of unknown size and shape. Both of these methods attempt to extend a spatial statistical method  the scan statistic  to a spatiotemporal setting. However, it is possible to extend temporal statistical methods, such as SPC charts, to a spatiotemporal setting. The method proposed by Raubertas [9] for extending the Poisson CUSUM chart to a spatiotemporal setting is often highlighted in the literature on extending the purely temporal SPC charts to a spatiotemporal setting. We apply this method in extending the EWMA chart to a spatiotemporal setting.
Ruabertas [9] proposes pooling counts from adjacent regions to form neighborhoods according to the spatial proximity matrix W. If we have m spatial regions, the geographic relationship among the m regions is often quantified through the use of an m x m spatial proximity matrix W = with wiL denoting the measure of closeness between region i
and l [16], Common weighting schemes include binary weights {wa = 1 if regions i and / that share a common boundary and wa = 0 otherwise) or inverse distance weights (wu = djf for some power 6 > 0 where djy is the distance between the centroids of regions i and /; wiL = 0
45
otherewise). Let Yit be the disease count in region i at time t and let w u be a measure of closeness between regions i and l. Then the pooled count for region i at time t, is
Note that we set wu = 1 as counts of region i are given a weight of 1: a weight of 1 is given to counts of region i and its immediately adjacent neighbors when pooling counts. If an increase in disease counts occur over an area that consist of more than one region, then by pooling counts one increases the sensitivity of detecting a shift in the mean disease counts. However, pooling may have a negative effect if the increase in counts occur only in one region, because of dilution.
Since regions with larger populations are likely to have larger disease counts, a correction should be made in calculating the EWMA statistic for each neighborhood. Letting rij denote the population atrisk in region and y0 be a baseline disease rate common to all regions, the expected number of cases in region i, jioi is given by the relationship jioi = n,y0. In adopting Ruabertasâ€™s [9] method, which was originally proposed for the CUSUM chart, to the EWMA chart, we used the same procedure for pooling counts within neighborhoods and performing the population correction as proposed by Raubertas [9], For each regional neighborhood consisting of region i and its immediate neighbors, a pooled EWMA statistic Et was calculated at time t.
In calculating the pooled EWMA statistic Lottie pooled expected number of cases for regional neighborhood i at time t, /i0, was calculated by pooling the in control means of region i and its immediately adjacent neighbors, just as pooling counts. Both the pooled regional neighborhood counts Y(t and the pooled expected cases /i,'0 were used in equation (1) to calculate the pooled EWMA statistic Et for regional neighborhood i.
m
(3)
i=i
46
2.3 Improving the Power of the Procedure
In conventional EWMA charts, control limits are used to signal alarms. Instead, we propose using pvalues to signal alarms. In the context of SPC methods, this appears to have been first proposed by Li. et al. [14], and the approach was utilized in the disease surveillance setting by Dassanayaka and French [33], Li. et al. [14] estimated the pvalue in several settings by drawing samples from an incontrol distribution. In the most general case, when the incontrol distribution is unknown, Li. et al. [14] proposed using the bootstrap method to estimate the incontrol distribution. This idea was extended to the prospective disease surveillance setting in [33], and we adopt a similar procedure here. First a set of incontrol times are specified based on prior knowledge. Next, B bootstrap samples are collected from the set of incontrol data (see appendix B for details of the bootstrap procedure). In a typical bootstrap setting, the responses are treated as being independent and identically distributed, which we do not have in our setting. However, the collection of responses at each time step of the incontrol time period have the same joint distribution, and the responses at one time are assumed to be independent of the responses at other times. In order to preserve the spatial structure of the data, each bootstrap sample will correspond to the complete set of responses observed at a particular incontrol time period. If the set of incontrol times is 1,2,..., t0, then the bootstrap samples effectively correspond to sampling B times with replacement from 1,2,..., t0, and then using the collection of responses at each sampled time as a single bootstrap sample. For each bootstrap sample, the corresponding EWMA statistics are computed for each of the m regions. Using these statistics, the empirical incontrol distribution of the EWMA statistic for each region is estimated.
At each new time point t, observations Ylt, Y2t,..., Ymt are collected for each region and regional neighborhoods are formed by pooling counts of each region and its immediate
47
neighbors. Then, the corresponding pooled EWMA statistics for each regional neighborhood E'it> E'21> â€” > Emt are computed. Afterwards, each of the pooled EWMA statistics Et is compared with the empirical incontrol distribution for the corresponding regional neighborhood to calculate a pvalue for each regional neighborhood. Using the resulting set of pvalues Pitâ€™ P2tâ€™ Pmt, alarms are signaled using an appropriate FDRcontrolling multiple testing procedure. In order to improve the power of our testing algorithm, we utilize the FDR [15] for error control. An FDRcontrolling procedure for the CUSUM method was previously proposed by Li and Tsung [15] using a random walk based method for approximating the pvalues. We employ a similar idea for the EWMA method, but instead utilize the pvalues estimated using the bootstrap procedure so that numerous FDRcontrolling procedure are available to us.
The FDR is the expected proportion of false discoveries among all discoveries:
FDR = E[V/R\,
where V i s the total number of true null hypotheses that are rejected (i.e., false alarms) and R is the total number of hypothesis  both true and false  that are rejected (i.e., all alarms) out of a total of m tests. FDR error control methods have greater power at the expense of increased false alarms.
We briefly mention three FDRcontrolling multiple testing procedures that are relevant to our setting. When m independent test statistics are compared simultaneously [in our setting, assuming that no pooling is done], Benjamini and Hochberg [11] proposed a method that can be used to control FDR at level a. Suppose that m hypotheses H1,..., Hm are to be tested, with corresponding pvalues plt..., pm. Let p^y..., P(m) denote the ordered the pvalues in ascending order, with corresponding ordered hypotheses H^y..., H(my Next, find the largest j such that
48
P(y) < aj/m. Finally, reject all hypotheses for i = 1We will refer to this method as the BH procedure.
The EWMA statistics calculated at each time step will be correlated across space because of the pooling procedure. Consequently, the BH procedure is not the most appropriate procedure. Two popular FDRcontrolling procedures for correlated statistics are BenjaminiYekutieli procedure and the StoreyTibshirani procedure. The BenjaminiYekutieli procedure [12] modified the original BH procedure and reject H^,..., for the largest j such that pyy <
cci 1
â€”â€”, where c(m) = ,_1. We will refer to this method as the BY procedure.
Tfl.CyTfl) 2ii= l ^
Storey and Tibshirani [17] argued that the Benjamini and Hochberg [11] procedure is too conservative and proposed a new FDRbased procedure to gain more power. They proposed using the qvalue, which is a FDRbased measure of significance, instead of the pvalue, which is a FWERbased measure of significance. The qvalue is defined as the minimum false discovery rate that is incurred when calling a test significant, whereas the pvalue is the minimum false positive rate that is incurred when calling a test significant. The computation of the qvalue is fairly complicated, but is simplified by the use of the R package â€˜fdrtool.â€™[31] (see Appendix G for details of â€˜fdrtooF package). Details of the qvalue computation are provided in [33], In summary however, the FDR will be controlled at level a when null hypotheses are rejected when the associated qvalue is less than a. We will refer to this procedure as the ST procedure.
2.4 Summarization of Testing Procedure
The first step of the proposed method is to define an allowable FDR level a, for all m regions. Then at each time step /, disease counts Ylt, Y2t,..., Ymt are collected in each of the m regions. Next, the regional neighborhoods are formed by pooling counts of each region and its adjoining neighbors using the Raubertas method in (3). Using these regional neighborhood
49
counts Y{t, Y{t,..., Kr'rit, the corresponding EWMA statistics E[t, E'2t,... E'nt are computed for each regional neighborhood, replacing Yit with Y(t.
Bootstrap methods are used to approximate the incontrol distributions (null distributions) of the pooled EWMA statistics for each regional neighborhood. From the incontrol period (no outbreak period), B bootstrap samples are drawn with replacement; each bootstrap sample is a vector of pooled neighborhood counts from each of the m neighborhoods at a given time in the incontrol period. For each bootstrap sample, pooled EWMA statistics are calculated for each of the m regions. The B bootstrap samples are used to construct a bootstrap distribution for each region and the pvalue associated with the EWMA method is then computed as. The BY or ST procedure is then used to determine the regions for which an alarm should be signaled.
In section 3, three different FDRbased controlling procedures will be used. For the baseline model using independent regional counts, the BH procedure will be used. For the proposed pooled count model, two different FDR controlling procedures will be used. To show how pooling improves performance, first, the BY procedure will be used for the dependent regional neighborhood statistics. Second, to gain more statistical power, the ST procedure will be used.
2.5 Comparison of the Proposed Method with Conventional SPC Methods
The proposed method is more appropriate for disease surveillance than the conventional methods for the following reasons: (i) the method uses pvalues as opposed to using critical values as in conventional SPC methods; (ii) FDR error control, which is more appropriate for disease surveillance, is used with the new method in contrast to using ARL0, which is the standard measure in SPC; (iii) the use of FDR facilitates the adoption of more powerful multiple testing procedures such as the BH, BY, and ST procedures compared to more conservative FWERbased procedures such as the Bonferroni correction.
50
With the conventional SPC methods, critical values are used to signal alarms; however, the proposed method uses pvalues, which is better suited for disease surveillance for the following reasons:
1. A pvalue is a common statistical measure and is used in a variety of applications that make use of hypothesis testing. However, critical values are not an as commonly used.
2. With pvalues, one could easily quantify the strength of evidence for an outbreak, for example, no evidence, weak evidence, moderate evidence, or strong evidence. However, with the conventional critical value approach it is not straightforward to quantify the outbreak signal.
The use of FDR is more fitting for disease surveillance than ARL0. The concept of ARL0 originated in an industrial process control setting. When a process is in operation, the period from the start of the process to the point where the control chart crosses a threshold is considered a run. Sometimes, even when the process is incontrol the control chart can cross the threshold, signaling a false alarm. This is similar to a Type I error in classical hypothesis testing. ARL0 is defined as the average time between two false alarms. The ARL0 and the Type I rate are directly related: namely, the reciprocal of ARL0 is the Type I error rate [2],
Employing FDR as an error control measure in disease surveillance is more appropriate than the use of ARL0 for the following reasons:
1. FDR has a more meaningful interpretation in disease surveillance than the Type I error rate, the reciprocal of ARL0. FDR is the proportion of false alarms among all alarms. However, Type I error rate is a proportion of false alarms among all hypothesis test performed. For example, an FDR of 0.05 would mean that on average, if there are 100 alarms, then 5 of those alarms are false alarms. However, if the Type I error rate is 0.05,
51
that would mean that out of 100 tests (this is the same as 100 time intervals, as a test is performed at each time interval) we would expect 5 false alarms. Clearly, it is more useful to know the proportion of false alarms among all alarms than among all tests.
2. Predictive Value Positive (PVP) is one of the seven attributes used by the Center for Disease Control to evaluate a new surveillance system. PVP is the proportion of true alarms among all alarms; so FDR is the compliment of PVP (i.e., FDR = 1  PVP)
3. As described above, ARL0 is related to the concept of a run in industrial process control: a run is defined as the time from the start of the process to the point when an alarm is signaled and the process is stopped. However, the concept of a run is not meaningful in disease surveillance as we do not have the control over an outbreak to terminate it, unlike an industrial process line.
With conventional multiple SPC charts, FWERbased control procedures, such as the Bonferroni correction is used to address the multiple testing issue. These methods are known to be conservative. With use of FDR error control, there is a host of more powerful multiple testing procedures such as the BH, BY, and ST procedures.
2.5 Comparison of the Proposed EWMAbased Method with Poisson CUSUM Method
Dassanayake and French [33] proposed a method for the Poisson CUSUM chart that is similar to the one here. The method requires that the disease data follow a Poisson model. In the design of a Poisson CUSUM chart, a reference value is specified which minimizes the time to detect a specific shift in the mean. This reference value is determined assuming a Poisson count model. Hawkins and Olwell [28] emphasize that â€œdepartures from the Poisson model affect the CUSUM.â€ They recommend checking for conformity to the Poisson model using a Poisson dispersion test, since the CUSUM chart is sensitive to departures from the assumed Poisson
52
model. Although, the Poisson model is reasonable for count data, there is no guarantee that disease counts follow the desired Poisson model. Therefore, a better alternative is the EWMA chart, which does not require the counts to follow a specific distribution. With less stringent assumptions, the EWMA chart is more flexible for disease surveillance than the Poisson CUSUM chart.
It is important to point out that the EWMA chart is an application of simple exponential smoothing. Therefore, the input to the chart does not have to be a discrete count; it can as well be a continuous input. So the proposed method can be used for identifying spacetime clusters of continuous measures also. For example, the method can be used to detect unusual buildups in pollution levels over multiple geographic regions. High pollution levels affect the health of populations living in those areas. Therefore, the use of the EWMA chart expands the utilization of the method to other healthmonitoring applications as well.
3 Simulation Study
3.1 Experiment Setup
For our spatial domain we considered a fivebyfive grid consisting of 25 regions.
Disease counts were simulated in each of the twentyfive regions for a 100 day period. No outbreak was simulated for days 150, which is the incontrol period. For this period, independent Poisson counts were generated in each region with a constant incontrol mean of XQi = 4, i = 1, ...,25. For days 51 100 an outbreak was simulated with an increase in the mean in each region. These outof control means, denoted by Xlt, i = 1, ...,25, peaked in the center region and tapered towards the edges, as illustrated in Figure 10. Regions 1, 5, 21, and 25 received a 0.2 standard deviation increase in the mean (to a mean of 4.4). Regions 2, 3, 4, 6, 10, 11, 15, 16, 20, 22, 23, and 24 were given a slightly higher 0.3 standard deviation increase in the
53
mean (to a mean of 4.6). The nine regions adjacent to the center region, regions 7, 8, 9, 12, 14, 17, 18, and 19, were given a higher 0.75 standard deviation increase (to mean of 5.5). The center region 13 was given the highest, a 1 standard deviation increase (to a mean of 6).
The FDR level at each test time was set to 0.05. An EWMA statistic was designed for each region with weighting constant of A = 0.2 and an incontrol process mean of /J.0 = 4 (Montgomery [36] recommend selecting a A value where 0.05 < A < 0.25). To generate empirical incontrol (nooutbreak) distributions, B = 10,000 bootstrap samples were drawn with replacement from the nooutbreak period (day 150).
1 2 3 4 5
4.4 4.6 4.6 4.6 4.4
6 7 8 9 10
4.6 5.5 5.5 5.5 4.6
11 12 13 14 15
4.6 5.5 6 5.5 4.6
16 17 18 19 20
4.6 5.5 5.5 5.5 4.6
21 22 23 24 25
4.4 4.6 4.6 4.6 4.4
Figure 10. A heat map of the mean of each region during the outbreak period during days 51100. The outofcontrol mean for each region is indicated in black at the center of each region, while the region numbers are shown at the topcenter in blue.
3.2. Simulation Results
Figure 11 and Figure 12 show the results of a single simulation over a 100day period for two specific regions in the 25region area. Figure 11 shows the results for region 1, which received the smallest shift in the mean of a 0.2 standard deviation increase. Figure 12 shows the results for region 13, which received the largest shift in the mean, a one standard deviation
54
increase. The two figures can be compared to see how the magnitude of the mean shift affects the speed of the outbreak detection. The performance of three models were compared in terms of speed of detection and false alarms. The first model is the baseline model using the BH procedure utilizing independent counts of each region. The second and the third models are the proposed models pooling counts of regional neighborhoods using the BY and ST procedures respectively.
Figure 11 shows the results for a single simulation for region 1: recall that for all regions, during the nooutbreak period from day 150, the incontrol mean was set to 4; region 1 received the smallest shift in the mean, a 0.2 standard deviation increase. So during the outbreak period from day 51100, the mean shifted up from 4 to 4.4 in regions 1. Figure 11(a) shows the simulated disease counts for region 1 (blue) and the pooled counts (red) for region 1 and its adjacent neighbors, regions 2, 6, and 7. Figure 11(b) shows the EWMA statistic for the independent model only using counts of regions l(blue) and the pooled model using counts of regions 1, 2, 6, and 7 (red). Note the gap between the independent and pooled EWMA models, which is simply a reflection of the fact that the pooled model necessarily is larger since it adds the counts from neighboring regions. Also note that there is not much of a variation in the EWMA statistic for the independent model (blue); however, for the pooled model there is a noticeable increase of the EWMA statistic during the outbreak period from days 51100 (red). Figure 11(c) shows the difference between pvalues or qvalues and their corresponding thresholds. Specifically, the blue line shows the difference (aBH â€” pvalue), where aBH is the threshold for the independent model for region 1 using the BH procedure; the red line shows the difference (aBY â€” pvalue), where aBY is the threshold for the pooled model using the BY procedure; the yellow line shows the difference (0.05 â€” qvalue), where 0.05 is the overall FDR
55
for the pooled model using the ST procedure. The grey horizontal line indicates a zero difference. An alarm will be signaled whenever a difference curve exceeds the grey line. Figure 11(d) displays whether an alarm is sounded at each time step (1 = alarm, 0 = no alarm).
Although the outbreak starts on day 51, the independent model (blue) detects it only on day 80, after a 29day delay. The pooled model is much faster in signaling alarms: the pooled model using the BY procedure (red) detects the outbreak on day 53, just two days after. This model produces more persistent but discontinuous alarms over the outbreak period. The pooled model using the more powerful ST procedure detects the outbreak as it happens on day 51 and continues to sound a steady alarm during the outbreak period from days 51100. It is important to emphasize the sensitivity of the pooled model using the ST procedure  even a slight 0.2 standard deviation increase in the mean can be continuously detected with this model.
56
Region 1
time
(a)
time
(d)
Figure 11. Summary plots for region 1 over time: (a) disease counts, (b) EWMA statistics, (c) differences, and (d) alarms. In plot (c), the blue line indicates the difference (aBH â€” pvalue), where aBH is the threshold using the BH procedure for the independent model; the red line shows the difference (aBY â€” pvalue), where aBY is the threshold for the pooled model using the BY procedure; the yellow line displays the difference (0.05 â€” qvalue), where 0.05 is the FDR for all 25 regions for the pooled model using the ST procedure. The grey horizontal line indicates a difference of zero.
57
Figure 12 provides the same information as Figure 11, but for region 13 which received the largest shift in the mean (a one standard deviation increase). For regions 13, the mean during the nooutbreak period is 4, as for all other regions. In Figure 12(a) the blue line shows the disease counts for region 13 and the red line shows the pooled counts for region 13 and its adjacent neighbors, regions 7, 8, 9, 12, 14, 17, 18, and 19. Figure 12(b) shows the EWMA statistics for both the independent (blue) and the pooled (red) models. Again, notice that there is a clear increase in the EWMA statistic for the pooled model during the outbreak period from days 51100, while the increase is less noticeable for the independent EWMA statistic.
Compared to the alarm plots for region 1 shown in Figure 11(d), alarm plots for region 13 shown in Figure 12(d) display noteworthy differences. The independent model (blue) starts to signal alarms much earlier starting at day 57. As the mean increase during the outbreak is larger in region 13 compared to region 1, there are more alarms even with the independent model during the outbreak period, although irregular. The pooled model using the BY procedure (red) starts to signal on day 53 and sounds continuous alarms from day 55 100, unlike the intermittent alarms in region 1 during the outbreak period. The ST procedure again detects the outbreak at the very beginning on day 51 and sounds steady alarms throughout the outbreak period from days 51100.
58
Region 13
C
3
O
O
o
o
time
(a)
time
(b)
&
time
(c)
time
(d)
Figure 12. Summary plots for region 13: (a) disease counts, (b) EWMA statistics, (c) differences, and (d) alarms. In plot (c), the blue line indicates the difference (aBH â€” pvalue), where aBH is the threshold using the BH procedure for the independent model; the red line shows the difference (aBY â€” pvalue), where aBY is the threshold for the pooled model using the BY procedure; the yellow line displays the difference (0.05 â€” qvalue), where 0.05 is the FDR for all 25 regions for the pooled model using the ST procedure. The grey horizontal line indicates a difference of zero.
59
To verify that the three procedures control the FDR at the specified a = 0.05 level, 100 independent simulations were carried out for each model using the experimental procedure outlined in 3.1. For each simulation, the proportion of false discoveries was calculated for each procedure. The resulting boxplots of these proportions are displayed in Figure 13. The ST procedure, which is the most sensitive of the three procedures, shows the most variation in the proportion of false discoveries, though the proportions were generally below the specified a = 0.05 level. The BY procedure appears to be very conservative compared to the ST procedure. The BH procedure produces false discovery proportions similar to the ST procedure, but its power is less because it does not pool counts.
0
0
"k_
0
>
O
O
0
0 _cn cd
o
c
o
tr
o
Q_
O
CM
o
o
o
co
o
o
co
o
o
CM
O
O
O
Observed BH
Pooled BY
Pooled ST
Figure 13. The box plots of the proportion of false discoveries in each simulation for all three procedures over 100 simulations. The far left plot is for the independent model using the BH procedure, the middle plot is for the pooled model using the BY procedure, and the far right plot is for the pooled model using the ST procedure.
60
3.3 Simulation Comparing EWMA and CUSUM Charts
Next, a simulation was conducted to compare the proposed EWMA method to the CUSUM method proposed by Dassanayake and French [33], The standard CUSUM chart is known to have a drawback  once it starts to signal alarms, it can take a considerable time to stop signaling alarms even after the outbreak is over. We wanted to compare the performance of the two methods in this sense. A single 100day simulation similar to the one described in section 3.2.was conducted. For the no outbreak period, from days 150, independent Poisson counts were generated for each region with a constant incontrol mean of 4. However, for this simulation, unlike the one described in section 3.2, an outbreak was simulated only for weeks 5175, with outofcontrol Poisson means as displayed on Figure 10. Since the outbreak ends on week 75, the remaining weeks 76100 is again a no outbreak period with an incontrol Poisson mean of 4. The resulting alarm plots using the three models for region 13 are displayed in Figure 14.
Figure 14(a) shows EWMA method using the BH procedure (orange dotted line) signaling an alarm starting from week 59 and continuing to sound alarms till week 75, with few interruptions in between (the simulated outbreak is from weeks 5175). The CUSUM statistic using the BH procedure (purple dotted line) starts to sound alarms from week 61 and signals steady alarms till week 80  although the outbreak is over on week 75, it continues to sound alarms till week 80. The CUSUM statistic takes much longer to drop after the outbreak has ended, leading to more false alarms. Notice that in Figures 14 (b) and (c), using BY and ST procedures respectively, the EWMA method detects an outbreak faster than the CUSUM method, while also ceasing to sound an alarm almost immediately after the outbreak has concluded, while the CUSUM method continues to sound alarms. With all three procedures,
61
the EWMA statistic performs better at correctly signaling the end of the outbreak compared to the CUSUM statistic.
Region 13
E
re
<
CO  EWMA
o   CUSUM l' 1 1 ',1 1 I 1 1 lil l
xr ' J 1 I' 1 l 1 1, 1 l
o i j i i
o _ ! I :.i
i i i i i r
0 20 40 60 80 100
time (a) BH
E
re
<
co  EWMA 1 < 1
o   CUSUM 1 1 1 l 1 1
xr 1 1 1 1 1 l
o i i i
o _ l 
0 20 40 60 80 100
time
(b) BY
E
TO
<
co  EWMA â– 1
o   CUSUM ; !
xr I1 ! I
o !' 1 I
o _ .'! Li
0 20 40 60 80 100
time (c) ST
Figure 14. Alarm plots for region 13, for a simulated outbreak from weeks 5175. Figure 14(a) shows the alarm plots for the independent model using the BH procedure for the EWMA (orange) and CUSUM (purple) statistics. Figure 14(b) shows the alarm plots for the pooled model using the BY procedure for the EWMA (orange) and CUSUM (purple) statistics. Figure 14(c) shows the alarm plots for the pooled model using the ST procedure for the EWMA (orange) and CUSUM (purple) statistics.
62
3.4 Comparing Performance of the Three Procedures
Robertson et al. [19] highlights that a surveillance system should provide â€œquick detection and few false alarms.â€ Therefore, in comparing the performance of the three procedures, we used two standard measures used to compare performance in surveillance systems; one to measure the detection speed and the other to measure the false alarm rate. Conditional Expected Delay (CED) is a measure commonly used in surveillance to compare detection speeds; the Probability of a False Alarm (PFA) is a common measure used to compare false alarm rates. Frisen [25] discusses these two measures, along with some others in the context of surveillance.
The Conditional Expected Delay (CED) is the average delay from a specific change point x: a change point is the time at which the actual change in the mean occurs. Let tA be the alarm time when the procedure signals an alarm. Then the CED can be defined as
CED(x) = E[tA â€” xtA > x].
Frisen [25] defines the second performance measure, the probability of a false alarm as,
PFA(r) = P(tA < x).
According to this definition, PFA is the probability that an alarm occurs before the actual change in the mean at time x, as persistent outbreaks are considered. Since we are considering outbreaks of a limited duration, we use a similar measure  the false positive rate (FPR). FPR is the proportion of a false alarms before or after the outbreak, as false alarms can occur before and after the outbreak.
In calculating CED and FPR, we considered outbreaks starting on week 51 with durations 2, 3, 4, 5, 6, 7, 8, 9, and 10. For example, for an outbreak of duration 2 weeks starting at week 51, the outbreak will start on week 51 and end on week 52. For each duration, we simulated 100
63
data sets and calculated the detection delay for each simulation for all three procedures; then we averaged the delays for each model to estimate the CEDs for the three procedures. Plots of the estimated CED plots are displayed on Figure 15 (see appendix C for additional CED plots).
First, consider the EWMA pooled model using the BY (red solid line) and ST (orange solid line) procedures. Recall that lower CED means better performance, indicating lower delays in detection. Recall that region 1 received a 0.2 standard deviation increase, region 2 a 0.3 standard deviation increase, region 7 a 0.75 standard deviation increase, and region 13, a 1.0 standard deviation increase. Regardless of the magnitude of the increase in the means, for all four regions the ST procedure is performing consistently better (lower CED) compared to the BY procedure  across all outbreak lengths. The CEDs for the pooled EWMA model using the ST procedure are close to zero for most outbreak lengths, indicating speedy detection of outbreaks of varying magnitudes. Compared to the model using the ST procedure, the model using the BY procedure shows substantial delays, particularly for smaller shifts in the mean: for regions 1 and 2 the CEDs are between 2 to 4 weeks; for regions 7 and 13 the CEDs are between 1 to 2 weeks. As pointed out earlier, speedy detection of an outbreak is critical in disease surveillance. Therefore, the ST procedure is preferred over the BY procedure.
64
Region 1
Region 2
Length of outbreak (days)
Length of outbreak (days)
Region 7
Region 13
D 3 4 1 L EWMA & BY CUSUM& BY EWMA & ST CUSUM& ST D 3 4 1 L EWMA & BY CUSUM& BY EWMA & ST CUSUM & ST
LU CM  O X // v ^ " LU CM  O
o  o 
2468 10 2468 10
Length of outbreak (days)
Length of outbreak (days)
Figure 15. CEDs for both EWMA and CUSUM methods for the pooled model using BY
and ST procedures in regions 1, 2, 7, and 13.
Second, the speed of detection of the pooled EWMA and CUSUM can be compared using Figure 15. The EWMA statistic using the BY procedure seem to be performing somewhat better (lower CED) than the CUSUM statistic using the BY procedure for most outbreak lengths in regions, 1, 2, and 7. However, for region 13, with the highest shift in the mean, the performance of the two statistics using the BY procedure seems to be comparable. On the other
65
hand, both the EWMA and CUSUM statistics using the ST procedure seems to perform comparably across all outbreak lengths, for all regions regardless of the size of the shift in the mean. However, a careful inspection of the plots would reveal that the pooled EWMA statistic using the ST procedure is slightly faster (a statistically significant difference) than the pooled EWMA model using the ST procedure, for medium to large increases in the mean (regions 2, 7, and 13), across all outbreak lengths considered. Overall, in terms of detection speed, the pooled EWMA statistic using the ST procedure has a slight edge over the pooled CUSUM statistic using the ST procedure for medium to large shifts in the mean.
Next, we considered the second performance measure to decide whether one statistic is better than the other. Figure 16 shows the plots for FPR in regions 1, 2, 7, and 13 (see appendix C for additional FPR plots). Using the same experimental setup for generating CED plots, outbreaks of length 2, 3, 4, 5, 6, 7, 8, 9, and 10 were considered starting at week 51. For each outbreak length, 100 data sets were simulated, and the false positive rate in each simulation were averaged to estimate the FPR for each of the three procedures. For example, for outbreak length 2, 100 data sets were simulated and the resulting FPRs were averaged for each procedure. The proportion of false alarms for each simulation was calculated by dividing the total number of false alarms by the total number of nooutbreak time periods (total number of time periods when the null hypothesis is true). Likewise, FPRs were calculated for other outbreak lengths 3, 4,...,
10 as well. A lower FPR indicates better performance (other considerations being held constant), as false alarms are undesirable.
66
Region 1
Region 2
Length of outbreak (days)
Length of outbreak (days)
Region 7
Region 13
Length of outbreak (days)
Length of outbreak (days)
Figure 16. FPRs for both EWMA and CUSUM statistics for the pooled model using BY
and ST procedures in regionsl, 2, 7, and 13.
First, we compared the performance of the EWMA pooled model using the BY procedure (red solid line) with the EWMA pooled model using the ST procedure (orange solid line). In all four regions  with different shifts in means  the more sensitive EWMA pooled model using the ST procedure has somewhat more false alarms, as expected, compared to the EWMA pooled model using the BY procedure. Also, note that there is a slight upward trend in the curves as the duration of the outbreak increase: for longer outbreaks, both the CUSUM and the EWMA
67
statistics take longer to signal the end of the outbreak resulting in more false alarms. Compared to the EWMA, the CUSUM takes a substantial amount of additional time to signal the end of the outbreak resulting in increased false alarms for longer outbreaks with larger shifts in the mean (as illustrated in plots for regions 2, 7, and 13). Furthermore, the distance between the red and the orange lines for region 13 is smaller than that of region 1. With a more prominent shift in the mean, region 13 has fewer false alarms compared to region 1.
Second, we compared the performance in terms of FPR for the pooled EWMA and CUSUM statistics. The red dotted line shows the CUSUM statistic using the BY procedure and the red solid line shows the EWMA statistic using the BY procedure. Clearly, across all regions, the EWMA statistic using the BY procedure is performing better (lower FPR) than the CUSUM statistic using the BY procedure. Furthermore, as the magnitude of the outbreak increases from region 1 to region 13, the EWMA statistic using the BY procedure performs even better (lower FPR) compared to the CUSUM statistic using the BY procedure.
Next, we compared the performance of the pooled EWMA statistic using the ST procedure (orange solid line) with the pooled CUSUM statistic using the ST procedure (orange dotted line). For smaller shifts in the mean  in regions 1 and 2  the two models perform comparably. However, for larger shifts in the mean  region 7 and 13  the EWMA statistic using the ST procedure is performing better, for most outbreak lengths, than the CUSUM statistic using the ST procedure; noticeably, for larger outbreak lengths, the performance is even better with the ST procedure than the BY procedure. Also, the performance of the EWMA statistic using the ST procedure is better compared to the CUSUM statistic using the ST procedure in region 13 than in region 7, as the magnitude of the outbreak is larger with a more prominent shift in region 13.
68
In summary, although both EWMA and CUSUM models perform comparably in terms of detection speed (CED), EWMA model outperforms the CUSUM model in terms of the false positive rate (FPR) (see appendix E for additional plots on observed FDR as a function of outbreak length for both EWMA and CUSUM models).
4. Salmonella Newport Case Study
Disease counts of weekly Salmonella Newport cases from each of the 16 German federal states were available from Robert Koch Institute, Germany [27], from the first week of January
2004 (week 1) to the second week of February 2014 (week 528). The proposed method was applied to this data set for outbreak detection. Since Salmonella is a noncontagious disease without any trend or seasonality, independence in disease counts was assumed. A map of the 16 German federal states are displayed on Figure 17.
Since there were no unusually high disease counts during the first two years from 2004
2005 (weeks 1104), data from this period was used to calculate the incontrol mean for each state. It is also reasonable to assume spatial independence of counts during this incontrol period, as verified by spatial analysis.
The proposed method was applied to data from 20062014. Since the population is unevenly distributed within the country, certain states with larger populations, such as Bavaria, had a high expected disease counts, while certain other states with low populations, such as Bremen, had a low expected disease counts. To account for this, a population adjustment was performed following the method used by Ruabertas [9] which is explained in section 2.2. Regional neighborhoods were formed by pooling counts of each region and its neighbors sharing a common boundary using a binary weights matrix. The method outlined in Section 2.4 was applied to the Salmonella data to identify outbreaks of Salmonella cases, utilizing 10,000
69
bootstrap samples and an FDR level of 0.05 at each time step. Both the BY and ST procedures were used to account for the multiple testing problem. As a baseline, the EWMA method was also applied without pooling neighboring counts, using the BH procedure to control the FDR at
0.05.
Figure 17. Map of the 16 German federal states.
Figures 18 and 19 show the results for the states of Bavaria and Rhineland Palatinate.
The xaxis shows the week number of the time period of interest, where the first week of January 2004 is week 1. The actual outbreak occurred in the first week of November 2011 [29] (week 410); so the results are shown from the second week of September 2010 (week 350) to the fourth week of July 2013 (week 500). Of the three procedures, the ST procedure using the pooled
70
model speedily identified the outbreak as it started on week 410 in all 15 states (two of the 16 states, Saarland and Rhineland Palatinate, were combined). The pooled model using the BY procedure also identified the outbreak on week 410 in all states as a substantial shift in the counts occurred at the onset of the outbreak. However, the baseline model using only the individual state counts produced delayed alarms: in one state the alarm was detected two weeks later and in five other states it was detected a week later.
Figure 18(c) shows the alarms in the state of Bavaria for the three models. The pooled model using the ST procedure (as well and the BY procedure) promptly identified the outbreak on week 410, when the actual outbreak started. However, with the baseline model using only the counts of Bavaria, the outbreak was detected a week later on week 411. Also, note that there is a relatively smaller peak on week 460. The pooled model using the more sensitive ST procedure starts to signal alarms from week 456, as the counts start to climb; then again, the pooled model using the BY procedure picks up the alarm only on week 460  4 weeks later. The independent count model using the ST procedure, however, does not detect this peak at all.
Figure 19(c) shows the alarm plots for the state of Rhineland Palatinate. Again, both the pooled model using the ST and the BY procedures signal the outbreak promptly on week 410. However, there is a weekâ€™s delay with the baseline model. Also note that the more powerful ST procedure picks up a small cluster of increased counts around week 450 while the other two model do not signal any alarms.
71
Bavaria
time (weeks) (a)
time (weeks) (b)
BH
CO (   BY
o 1 j ST
E !
Alai 0.4 1 1 I
o 1 i I 
350 400 450 500
time (weeks) (c)
Figure 18. Disease counts (a), EWMA statistics (b), and alarm plots (c) for the state of Bavaria. The blue line in plots (a), (b), and (c) represent the independent count model. The red line in plots (a) and (b) indicate the pooled count model. In plot (c) the red line indicate the alarm plots for the pooled model using the BY procedure and the orange line shows the alarm plots for the pooled model using the ST procedure.
72
Rhineland Palatinate
time (weeks) (a)
E
ro
<
time (weeks) (c)
Figure 19. Disease counts (a), EWMA statistics (b), and alarm plots (c) for the state of Rhineland Palatinate. The blue line in plots (a), (b), and (c) represent the independent count model. The red line in plots (a) and (b) indicate the pooled count model. In plot (c) the red line indicate the alarm plots for the pooled model using the BY procedure and the orange line shows the alarm plots for the pooled model using the ST procedure.
Compared to the EWMA statistic, the CUSUM statistic has a major disadvantage  the CUSUM statistic continues to signal alarms long after the outbreak is over. Figure 20(c) shows the results for Bavaria using the pooled models for both the CUSUM and EWMA statistics with the ST procedure. Both statistics sound alarms at the beginning of the outbreak on week 410.
73
However, the CUSUM statistic takes longer to signal the end of the outbreak: although the EWMA statistic signals the end of the outbreak on week 425, the CUSUM statistics signals the end of the outbreak on week 434  9 weeks later. Also, note the second peak after week 450. Both the CUSUM and the EWMA statistics start to signal alarms on week 456. However, the EWMA statistic signals the end of the outbreak on week 463 while the CUSUM statistics takes 5 additional weeks to signal the end of the outbreak on week 468. As shown in Figure 20(b) the CUSUM statistic builds up more rapidly at the onset of the outbreak compared to the EWMA statistic and takes longer to stabilize. The CUSUM statistic increases rapidly and takes longer to return to zero as the only factor that decreases the statistic is a constant parameter. However, the EWMA increases much slower and stabilizes quickly as â€œthe weights of the past observations fall off exponentially as in a geometric seriesâ€ [32],
74
Bavaria
time (weeks) (a)
time (weeks) (b)
time (weeks) (c)
Figure 20. Disease counts (a), EWMA and CUSUM statistics for the pooled model using the ST procedure (b), and alarm plots (c) for the state of Bavaria. The blue line and red line in plots (a) represent the independent counts and the pooled counts. The orange lines in plots (b) and (c) represent the pooled EWMA model using the ST procedure while the purple lines depict the pooled CUSUM model using the ST procedure.
75
5. Discussion
In the proposed new procedure, aggregated disease counts in multiple geographic regions are used for surveillance of disease outbreaks using the popular EWMA statistic. Two new features of the method are practically beneficial in a disease surveillance setting: first, the use of pvalues, instead of the conventional critical values, makes it possible to quantify the strength of evidence for an outbreak. Second, the use of FDR error control, in place of the traditional FWER methods, makes it possible to utilize a variety of more powerful multiple testing procedures such as the BH, BY, and ST procedures. Of the three procedures, the ST procedures is the most powerful and detects outbreaks the fastest.
The use of the EWMA statistic over the Poisson CUSUM statistic has several benefits in disease surveillance. First, the Possion CUSUM method requires the disease counts to follow a Poisson model and is sensitive to deviations from this assumption, as explained in section 2.5 [28] . However, the proposed method using the EWMA statistic does not require the counts to follow any specific distribution. Also, as illustrated by the results, compared to the Poisson CUSUM method, the proposed EWMA model has a reduced false alarm rate: the EWMA statistic is much quicker in signaling the end of the outbreak, compared to the CUSUM statistic.
It is important to point out that FDR is controlled at each time step in the proposed methodology. Also, the proposed method pools counts of neighboring units to signal an alarm. As pointed out by Raubertas [9], pooling may increase the speed of detection when the shift in the mean occur in several units of the regional neighborhood. However, pooling may further delay the detection speed if the shift occurs only in one or two regions in the neighborhood. In addition, pooling may result in an increase in the false alarms of the neighboring regions.
76
As mentioned before, the input to the method does not have to be discrete counts continuous inputs are also valid. Therefore, one possible future research direction would be to study the application of the method for continuous inputs: as an example, monitoring of high pollution levels in multiple geographic regions. Furthermore, the current model is applicable for detecting foodborne outbreaks, a class out outbreaks without trend or seasonality. Another possible future research direction would be to generalize the method to detect outbreaks having seasonality, and also for when disease counts are correlated across time and/or space.
77
CHAPTER IV
AN EWMA AND MORANâ€™S I COMBINED APPROACH FOR PROSPECTIVE DETECTION OF SPATIOTEMPORAL CLUSTERS IN MULTIPLE GEOGRAPHIC
REGIONS
Emerging disease outbreaks must be detected as early as possible for public health authorities to take prompt preventive measures. Disease surveillance is the primary method used by public health practitioners to monitor the spread of an outbreak. Most public health surveillance systems record the time and location of reported cases [1], Spacetime surveillance methods utilize both the time and location information to identify emerging spacetime disease clusters. Most of these methods consider an area of study consisting of small geographic units where disease counts in each unit are collected sequentially over time. In a retrospective analysis seeking to identify whether a disease outbreak occurred in the past, a single decision is made for the entire data set, as in classical hypothesis testing. However, in prospective surveillance, decisions are made about whether there is currently a disease outbreak based on the data available so far. Prospective surveillance is also referred to as online surveillance in the literature. The aim of prospective surveillance is to correctly identify an outbreak as quickly as possible to initiate public health interventions.
Prospective surveillance methods are classified by Unkel et al. [2] as time series methods, regressionbased methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detection methods. Of these methods, SPC methods that originated in industrial process control have been successfully adopted for practical disease surveillance purposes [4], Specifically, The Exponentially Weighted Moving Average method (EWMA) is a popular SPC method that has been used in the Early Aberration Reporting System
78
(EARS) developed by the Center for Disease Control and Prevention (CDC) and ESSENCE (Electronic Surveillance System for the Early Notification of Communitybased Epidemics) developed by the Department of Defense and Johns Hopkins University [4],
The EWMA is a statistic that is widely used to monitor industrial processes in which more weight is given to recent observations and less weight to past observations. The starting value of the EWMA statistic is a target value to be monitored which is often taken to be the process mean in an industrial process control setting: in a disease surveillance setting, this is the mean disease count when there is no outbreak present. The center line for the EWMA statistic shows the mean for the incontrol process and control limits are predefined with a fixed number of standard deviations above or below the center line (the standard deviation of the EWMA statistic is also estimated from historical data). An alarm is signaled, indicating the process is out of control, when the EWMA statistic crosses the control limits. When the process is incontrol, we do not want to sound an alarm. When the process is outofcontrol, an alarm should be signaled. Adopting SPC language for prospective disease surveillance, an outofcontrol process corresponds to a disease outbreak and an incontrol process to no outbreak. We wish to signal an alarm anytime there is disease outbreak.
The original EWMA statistic was designed to identify purely temporal clusters. The EWMA method was originally proposed in an industrial process control setting for normally distributed data by Roberts [7] and was subsequently extended to Poisson count data by Borror [8], The purely temporal SPC charts have been extended for identifying spacetime clusters: namely the Poisson CUSUM chart, was extended for spatiotemporal cluster detection by Raubertas [9], The method captures local spatial information by combining counts within a regional neighborhood. A regional neighborhoodâ€™s counts are combined by pooling counts of
79
each region along with neighboring regions sharing a common boundary. These pooled counts are used to calculate a pooled statistic for each regional neighborhood. An alarm is signaled whenever a regional neighborhood statistic exceeds a predefined threshold, indicating a rise in counts in that regional neighborhood. Subsequently, Rogerson and Yamada [10] extended the purely temporal Poisson CUSUM chart in two ways: first, similar to the method outlined by Raubertas [9], the Poison CUSUM chart was extended to monitor counts in regional neighborhoods; second, the Poisson CUSUM chart was extended allowing expected counts to vary over time. The multiple testing problem that arises when testing multiple neighborhood statistics simultaneously was handled by controlling the Family Wise Error Rate (FWER), using the popular Bonferroni Correction.
Although pooling counts is a simple way of capturing spatial information, there are more sophisticated methods to model spatial autocorrelation in a given neighborhood. Spatial autocorrelation is the correlation, â€œ... among the same type of measurements taken at different locationsâ€ [16], such as the correlation among disease counts in different regions. Local Indicators of Spatial Association (LISA) [37] provide a local measure of similarity between a regionâ€™s counts and counts of nearby regions. One of the most widely used LISAs is a local version of Moranâ€™s I [38], The local Moranâ€™s I will be positive when the regions in a locality have similar values, which indicates spatial clustering of similar counts (positive spatial autocorrelation). The local Moranâ€™s I will be negative when the neighboring regions have dissimilar values, indicating negative spatial autocorrelation Therefore, the local Moranâ€™s I provides an efficient method to assess for spatial clustering of a disease in a neighborhood.
The FWER procedures aim to control the probability of even one false discovery. FWERcontrolling procedures can be too stringent when testing a large number of tests. In
80
contrast to FWER procedures, False Discovery Ratecontrolling (FDR) procedures control the expected proportion of false discoveries among all discoveries. As a result, FDRbased procedures provide a more lenient alternative when testing a large number of hypothesis tests simultaneously. FDR procedures offer more power at the expense of increasing the number of false alarms. Benjamini and Hochberg [11] proposed a simple and easy to implement FDRcontrolling method when the test statistics are independent. Subsequently, the method was extended for dependent statistics by Benjamini and Yekutieli [12], Procedures have been proposed for the implementation of EWMA chart using FDR error control in the context of industrial process control [13]
In conventional EWMA charts an alarm is signaled whenever the EWMA statistic exceeds a predefined control limit. Using this method, we would only know whether the process is outofcontrol or not at a given point in time. However, Li et al. [14] proposed a general procedure for control charts, where pvalues can be used, instead of control limits, to signal alarms. This way, when a distributional shift occurs, by looking at the pvalues, we would know how stable the process is (the process is stable when it is incontrol). Li et al. [14] outlined a procedure to estimate the incontrol distribution of a control chart statistic, extracting bootstrap samples from the incontrol data. At each time, a pvalue is calculated by comparing the control chart statistic at that time point with the approximated incontrol distribution. An alarm is signaled when the pvalue is less than a prespecified significance level.
Li and Tsung [15] proposed a methodology to simultaneously monitor multiple attributes of a manufacturing process using multiple CUSUM charts along with FDR error control. At each timestep Li and Tsung [15] compute a CUSUM statistic for each control chart. Using these CUSUM statistics, pvalues are calculated at each time step using random walk based
81
approximations. When the multiple CUSUM statistics are assumed to be independent, Li and Tsung [15] suggest using the Benjamini and Hochberg [11] procedure to signal alarms while controlling the FDR; furthermore, when the CUSUM statistics are known to be dependent Li and Tsung [15] recommend using the Benjamini and Yekutieli [12] method.
In the proposed method, a new statistic is proposed by extending the purely temporal EWMA statistic to a spatiotemporal setting by modifying a pooled EWMA statistic to allow for spatial correlation between counts using the Local Moranâ€™s I. As suggested by Li and Tsung [15], pvalues are used, instead of traditional control limits. FDR based multiple testing methods are used as opposed to FWER based method: FDR controls the rate of false discoveries (e.g. falsely rejected null hypotheses) among all discoveries (e.g. total number of rejected hypotheses). Since we are more interested in detecting a true deviation from the null than incorrectly rejecting true null hypothesis, controlling FDR is more appropriate than controlling FWER. The multiple testing problem that arises when testing a large number of regional statistics simultaneously is handled by employing a more powerful FDRbased testing procedure proposed by Story and Tibshirani [17],
The paper is organized as follows: section 2 describes the proposed methodology with details of the design of the EWMA control chart, modeling for spatial autocorrelation using the local Moranâ€™s I index, and FDRbased multiple testing using the Storey Tibshirani procedure. Section 3 shows a simulation study along with a performance comparison of the method with a simpler model that does not model for spatial autocorrelation of counts. Section 4 discusses the strengths of the method and discusses future research directions.
82
2. Methods
2.1 Standard EWMA
The original EWMA control chart was proposed by Roberts [7] for normally distributed data. Subsequently, Borror et al. [18] extended the method for Poisson data.
Borror et al. [18] considered a sequence of counts Yx, Y2,..., Yn which are independent and identically distributed Poisson random variables with mean /r, collected at times t = 1,2,..., n. When the process is in control, pL = pL0. The objective is to detect a shift in the mean to some other value. The Poisson EWMA chart is defined as
Et = max\ji0,XYt + (1  A)Et_1]
where Yt is the observed disease count at time i, Et the Poisson EWMA statistic at time i, A a weighting constant between 0 < A < 1. The starting value of the EWMA statistic is usually the process target so that E0 = ju.0.
The parameter A is the weight given to the most recent observation. Montgomery and Douglas [36] recommend selecting a A value where 0.05 < A < 0.25. The EWMA statistic gives more weight to the most recent observation and weights on past observations fall off as in a geometric series [32],
With this definition of the EWMA statistic, the mean and the standard deviation can be derived as given by Borror et al. [18],
E(Et) = /Tq and Var(Et) =
For large values of /, the asymptotic variance is given by
Var(Ec) =
83
because (1 â€” X)2t > 0 as t > oo. Using the asymptotic standard deviation, the control limit h of the oneway control chart is given by the formula
h = Mo + o,
where the control limit h is L asymptotic standard deviations above the incontrol mean /i0. An alarm is signaled whenever Et > h.
2.2 Extending the Purely Temporal EWMA Chart to a Spatiotemporal Setting: Forming Regional Neighborhoods by Pooling Counts Within Neighborhoods
The original Poisson EWMA statistic proposed by Borror [8] is a purely temporal statistic. Raubertas [9] proposed a simple way to extend the purely temporal Poisson CUSUM chart into a spatiotemporal setting by pooling counts of each region and its neighboring regions to form regional neighborhoods. These pooled counts are used to calculate a pooled Poisson CUSUM statistic for each regional neighborhood. Although Raubertas [9] used the Poisson CUSUM statistic, a similar method can be used to extend the EWMA statistic to a spatiotemporal setting.
If there are m spatial regions, the geographic relationship among the m regions can be summarized by an m x m matrix containing a measure of closeness between each pair of regions. For example, the closeness between two regions i and / if they share a common boundary can be defined as wa = 1; if they do not share a common boundary, the closeness can be defined as wiL = 0. In this manner, counts of each region will be pooled with its immediately adjacent neighboring regions to form regional neighborhoods; the area under surveillance will consist of overlapping regional neighborhoods. As Raubertas [9] points out, if an increase in counts occur in more than one region within a neighborhood, pooling will increase the sensitivity of detecting a shift in the mean counts. However, pooling will have a negative effect if the
84
increase happens in only one region because of dilution. Let Yu be the disease count in unit / at time t and w u be a binary measure of closeness between regions i and /. Then the pooled counts of region i and its immediately adjacent neighbors at time t will be
v! 
*lt
2.3 Extending the Purely Temporal EWMA Chart to a Spatiotemporal Setting: Modeling for Spatial Autocorrelation in Regional Neighborhoods
Rather than simply pooling counts within regional neighborhoods, a local version of Moranâ€™s I index can be used to model for the spatial autocorrelation. Just as the global Moranâ€™s I index [38] provides a summary of the spatial similarity over an entire study areas, the local version of Moranâ€™s I provides a summary of the spatial similarity in a local neighborhood.
Waller and Gotway [16] define an adjusted local version of Moranâ€™s I statistic which they use to identify local disease clusters in a retrospective analysis. Let Yt be the observed disease count of region i where z=l, ..., N, following a Poisson distribution, Et be the expected counts in region /, and Wjy be a weight describing the proximity between regions i and j. Then the adjusted local Moranâ€™s I statistic is
h =
YiEt
N
W;
Yj
Ej
j=i
y[Ej
Thus, /j is the product of standardized counts in regions i and the weighted sum of standardized counts of the neighbors of region i. If the observed counts of both region i and its immediately adjacent neighbors are similar, for example higher than what is expected, then li would be positive. On the other hand, if the observed counts of region i and its neighbors are dissimilar,
85
then 11 will be negative. Therefore, the adjusted local Moranâ€™s I statistic can be effectively used to summarize the spatial similarity of observed counts in the local neighborhood.
2.4 Modifications to a More Powerful Procedure
The proposed method is better suited for disease surveillance over the traditional SPC methods for the following reasons: with the proposed method, pvalues are used instead of the traditional control limits; FDRbased multiple testing procedures are used as opposed to conventional FEWERbased procedures; the use of pvalues and FDRbased multiple testing make it possible to use more recently developed testing procedures such as the StoreyTibshirani procedure. These three points are explained below. The procedure for estimating the pvalues used to decide whether an outbreak is in progress is based on the procedure proposed by Li et al. [14] for CUSUM statistics. We assume that the incontrol distribution of the data is unknown, so we use the bootstrap method to approximate the null distributions of the EWMA statistics. However, Li et al. [14] did not have to deal with spatial structure in their data. In a typical bootstrap sample, one samples n observations with replacement from the observed data Yx, Y2,..., Yn. This is repeated a large number (5) times. In our setting, we must preserve the spatial structure of the data since the responses are not necessarily independent and identically distributed across space. The statistics of interest are then calculated for each sample. In this case, EWMA statistics at each time point /, Elt E2,..., Et were calculated. After that, the EWMA statistics at each time point t were compared with the null distribution of the EWMA statistic to compute pvalues at each time point t.
Since statistics belonging to multiple regions are tested simultaneously, a multiple testing problem arises. Li and Tsung [15] address the multiple testing problem using an FDR
86
controlling procedure that uses conventional critical values. We instead use pvalues, allowing for the use of many FDRbased testing procedures.
When numerous hypotheses are tested simultaneously, the use of FDRbased testing is often more appropriate than the FWERbased methods. FWER is the probability of making at least one false discovery among all hypotheses tested. When many hypotheses are tested, FWER control can be very stringent, resulting in procedures with little testing power. FDR is an alternative testing error criterion related to the expected proportion of false discoveries.
Formally,
FDR = E[V/R\,
where V i s the number of true null hypothesis that are declared significant and R the total number of hypotheses that are declared significant out of m hypotheses tested.
Using FDR as the error criterion allows us to employ more recently developed multiple testing procedures such as the StoreyTibshirani [17] procedure. Storey and Tibshirani [17] proposed a FDRbased multiple testing procedure that can be applied for locally dependent statistics, such as the EWMA statistics generated from overlapping neighborhoods in the proposed model. Storey and Tibshirani [17] point out that the original multiple testing procedure proposed by Benjamini and Hochberg [11] for independent statistics is too conservative. They proposed a more powerful method using qvalues instead of the conventional pvalues: just as a pvalue is the minimum false positive rate that is incurred when calling a test significant, a qvalue is the minimum false discovery rate that is incurred when calling a test significant. The Storey and Tibshirani [17] method for multiple testing is detailed below. We will refer to this procedure as the ST procedure.
87
For a given list of pvalues, the ST method calculates a list of qvalues. A qvalue, by definition, is the expected proportion of false positives incurred when calling a test significant. Storey and Tibshirani [17] derived an expression for the false positive rate when calling all tests significant whose pvalue is less than or equal to a threshold /, where 0 < t < 1. When a large number of tests m are considered, the expected proportion of false positives can be approximated by dividing the expected number of false positives by the expected number of significant tests. The denominator, which can be easily calculated as the expected number of significant tests, is the total number of observed pvalues that are less than or equal to the threshold t.
The calculation of the expected number of false positives in the numerator is more complex. The expected number of false positives is the product of the probability of a true null hypothesis being rejected and the total number of true null hypothesis, m0. In calculating the probability of a true null hypothesis being rejected, we need to use the fact that the pvalues of true null hypotheses follow a uniform distribution. As a result, the probability of a pvalue being less than or equal to the threshold is simply t. Therefore the expected number of false positives is m0t. Since m0 is an unknown, the proportion of true null hypotheses n0 = m0/m has to be estimated.
A key step in the ST procedure is the estimation of n0 (see appendix A for approximation of 7r0). Using the fact that pvalues of true null hypotheses follow a uniform distribution and true alternative hypotheses have a distribution close to zero, Storey and Tibshirani [17] determine an estimate of the proportion of null hypotheses, n0, by considering the density of the uniformly distributed section of the histogram of pvalues.
Combining all the steps, the qvalue for a given pvalue t is
where /(â€¢) is the indicator function. Using this expression, a qvalue can be calculated for each of the m pvalues, by substituting pt in for t,i = 1,..., m. Finally, the qvalues are ordered in the ascending order and all qvalues less than a predefined FDR level a are called significant.
It is important to point out that a qvalue is technically defined using a quantity called pFDR. The quantity FDR can be loosely defined by FDR = E(V//?), whereas pFDR is defined as pFDR = E(V/R\R > 0), since there is a nonzero possibility of R = 0 [17], Using pFDR, a qvalue can be technically defined as the minimum pFDR incurred when calling a test significant, just as a pvalue is defined as the minimum false positive rate when calling a test significant [24], When a large number of regions such as counties within a state or constituencies within a country are simultaneously monitored, then pFDR Â« FDR = E[V]/E[R],
Storey and Tibshirani [17] remark that their method can be applied to models with dependent statistics when the â€œweak dependenceâ€ criteria is satisfied. They state that â€œas a rule of thumb, the more local the dependence is, the more likely it is to meet the weak dependence criterion.â€ (see Remark D in [17] for a mathematical definition of â€œweak dependenceâ€).
2.5 Details of the Proposed Methodology
The first step of the method is to define a tolerable overall FDR level a for all m regions. Then at each time period /, disease counts Ylt, Y2t,..., Ymt are collected for each of the m regions. Next, the counts are standardized using the counts of the incontrol period (i.e., no outbreak period) to estimate the incontrol mean floi and the standard deviation for each region i =
1,m. Note that as counts are assumed to be distributed as Poison random variables, if the estimated incontrol mean for region i is /20j then the standard deviation for region i is So
the standardized counts for region i and time t is
Thereafter, the standardized counts Yitstd are used to calculate an adjusted local Moranâ€™s
I statistic Iit for each region where
Here wtj is a spatial proximity matrix. We considered wtj to be a binary connectivity matrix whose elements are Wjy = 1 if regions i and j share a common boundary and Wjy = 0 if regions i and j do not share a common boundary. Note that we omit the spatial similarity of region i with itself and set wu = 0 for i = 1,..., m. The calculated adjusted local Moranâ€™s lisa product of the standardized count of region i and the sum of standardized counts among neighbors. After calculating the local Moranâ€™s I index Iit, an adjusted count Yitadj for region i at time t is calculated using centered counts (Yit â€” fioi) for i = 1,... ,m as
Again, the binary weights Wjy are defined as before. These adjusted counts Yit adj are used to calculate EWMA statistics Mlt, M2t, â€”, Mmt for each of the m regions at time t, with
The same statistics are calculated for each of the B bootstrap samples (see appendix B for
adjusted counts in (1) at location i for time t of bootstrap sample j. This way, there will be B number of EWMA statistics calculated for each region from which empirical incontrol distributions for each region can be approximated. Then the EWMA statistic Mit is compared with the empirical incontrol distribution for region i to estimate the pvalues plt, p2t, â€”, Vmt
Mu â€” max(poi,AYit,adj + (1
(1)
details of the bootstrap procedure), with M^p denoting the EWMA statistic calculated using the
90
Finally, the ST procedure is used to calculate the corresponding qvalues qlt, q2t, â€”, qmt and identify significant qvalues to sound alarms for the corresponding regions.
It is important to note that, purely temporal SPC methods have been extended to spatiotemporal surveillance by pooling counts of regional neighborhoods [9], However, pooling neighborhood counts can increase false alarms in neighboring regions as a spike in counts in a neighbor can directly affect the pooled counts of the regional neighborhood. However, by modelling for spatial autocorrelation in the regional neighborhood, one can lessen the effect of an increase of counts in a neighbor. Therefore, it is advantageous to summarize for spatial dependence between regions using an index such as the local version of Moranâ€™s I rather than simply pooling counts.
Last, purely temporal SPC methods have been extended to spatiotemporal surveillance by pooling counts of regional neighborhoods [9], However, pooling neighborhood counts can increase false alarms in neighboring regions as a spike in counts in a neighbor can directly affect the pooled counts of the regional neighborhood. However, by modelling for spatial autocorrelation in the regional neighborhood, one can lessen the effect of an increase of counts in a neighbor.
3 Simulation Experiment
3.1 Experiment Setup
A study area made up of 25 regions located in a fivebyfive grid was considered. Data were simulated for 100 sequential times or â€œdays.â€ No outbreak was present for days 150 and independent Poisson counts with an incontrol mean of /i0( = 4, i = 1, ...,25 were simulated for each region. Starting from days 5155, an outbreak was simulated in selected regions as shown on Figure 21. Regions 9, 14, 15, 18, 19, 20, 24 and 25 experience an outbreak and the outof
91
control means /i1( are shown on Figure 21. During the outbreak period, in regions 9, 14, 18, and 19 correlated Poisson counts with an outofcontrol mean of 6 were simulated with a correlation coefficient of 0.8; in regions 15, 20, 24, and 25 correlated Poisson counts with an outofcontrol mean of 5 were simulated with a slightly lower correlation coefficient of 0.7.
In simulating correlated Poisson counts, the algorithm presented by Barbiero and Ferrari [39] was used. First, 25 variables were drawn from a joint normal distribution with mean zero and a correlation matrix as described above. Next, univariate normal cumulative distribution function (CDF) of the variables were applied to derive probabilities for each variable. Finally, the inverse CDF of the Poisson distribution function was applied with the mean structure displayed in Figure 21.
In all other regions, no outbreak was simulated, so independent Poisson counts were simulated with the same incontrol mean of 4. Again, from days 56  100, a no outbreak period was constructed and independent Poisson counts with a constant incontrol mean of /i0( = 4, i = 1, ...,25 were simulated in all regions.
The proposed method using the combined EWMA statistic and local Moranâ€™s I index was compared to a baseline model using a pooled EWMA statistic (chapter 2): in the pooled model counts of each region and its contiguous neighbors were pooled together to calculate regional neighborhood statistics. In calculating the standardized counts for each region for the proposed method, an estimate of the incontrol mean was calculated by averaging counts for 100 no outbreak days (incontrol days) where the counts were simulated using independent Poisson counts of mean 4 for all 25 regions. This additional 100 days of data consist of the incontrol (no outbreak) period. This simulated data were used to calibrate the model. A binary connectivity matrix was used to calculate the local Moranâ€™s I statistic for each region. The weight constant A
92
used in the EWMA statistic was X = 0.2. For the baseline model, counts of each region were
pooled with its immediately adjacent neighbors to create regional neighborhoods.
1 2 3 4 5
4 4 4 4 4
6 7 8 9 10
4 4 4 6 4
11 12 13 14 15
4 4 4 6 5
16 17 18 19 20
4 4 6 6 5
21 22 23 24 25
4 4 4 5 5
Figure 21. The outofcontrol means (Hu, i = 1,2,..., 25) for the simulated outbreak on days 5155 are shown. Only regions 9,14,15,18,19, 20, 24, and 25 experience an outbreak during days 5155 and the outofcontrol means are indicated at the center of each region. Regions 9,14,18, and 19 are correlated with a correlation coefficient of 0.8; for regions 15, 20, 24, and 25 the correlation coefficient is 0.7. The other regions do not experience an outbreak and counts are simulated using an incontrolmean of 4. The region numbers are indicated at the top center part of each region in blue.
3.2 Simulation Results
Figures 22, 23, and 24 show the results of a single simulation over a 100 day period for regions 4, 14, and 15 respectively. Figure 22 shows the results for region 4 where no outbreak was simulated. Note that region 4 shares a border with region 9, where an outbreak was simulated from days 5155 with an outofcontrol mean of 6. In Figure 22(a) the blue line shows the counts of region 4; the red line shows the counts of region 4 and its immediately adjacent neighbors: regions 3, 5, 8, 9, and 10. On the next plot in Figure 22(b) the blue line shows the pooled EWMA statistic and the red line shows the combined Moranâ€™s I and EWMA statistic. Note that for the pooled EWMA statistic (blue line), the estimated pooled incontrol mean is
93
close to 24 (23.32 is the minimum value for the EWMA statistic) as the regional neighborhood centering region 4 contains 6 regions, each having an incontrol mean of 4: so 6 X 4 = 24. However, for the proposed statistic this minimum value is 3.54, which is calculated from bootstrap samples extracted from incontrol data. Therefore, the EWMA statistic for the proposed model (red) is well below the EWMA statistic for the pooled model (blue). Plot 2(c) shows adjusted qvalues for the two models. The red line shows the difference (0.05 â€” qvalue) for the proposed model using the ST procedure, where 0.05 is the overall FDR. The blue line shows the difference (0.05 â€” qvalue) for the baseline model. The grey horizontal line shows a zero difference. An alarm is signaled whenever the differences exceed the grey horizontal line. The alarm plots are shown in Figure 22(d). The proposed method (red) correctly signals no alarms during the entire 100 day period, as no outbreak was simulated in region 4. However, the more sensitive baseline model (blue) signals sporadic false alarms throughout the interval. It is important to highlight the false alarms from days 5259. These alarms are a result of the outbreak that occurs in the neighboring region 9 from days 5155. The spike in counts of the neighboring region 9 increases the pooled neighborhood counts resulting in false alarms in the neighboring region 4 although no outbreak occurs in region 4.
94

Full Text 
PAGE 1
NEW SPATIO TEMPOR A L METHODS FOR PROSPECTIVE SURVEILLENCE OF DISEASE OUTBREAKS by SESHA KALINDA DASSANAYAKE B.A., Wabash College, 1998 M.S., University of Minnesota, 2001 M.S., University of Colorado Denver, 2012 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of P hilosophy Applied Mathematics 2016
PAGE 2
ii Â© 2016 SESHA KALINDA DASS ANAYAKE ALL RIGHTS RESERVED
PAGE 3
iii This thesis for Doctor of Philosophy degree by Sesha Kalinda Dassanayaka h as been approved for the Applied Mathematics Program by Audrey Hendricks, Chair Joshua French, Advisor Stephanie Santorico Loren Cobb Elizabeth Juarez colunga July 30, 2016
PAGE 4
iv Dassanayake , Sesha Kalinda (P h.D., Applied Mathematics) New Spatio temporal Methods for Prospective Surveillance of Disease Outbreaks Thesis directed by Assistant Professor Joshua French ABSTRACT Three new procedures are presented for detecting disease outbreaks using disease counts i n multiple geographic regions. Popular disease surveillance methods are extended and combined with recent methodical advances from other disciplines to design powerful, but straightforward testing procedures. Two popular methods for identifying temporal clusters in statistical process control (SPC) the Cumulative Sum (CUSUM) and Exponentially Weighted Moving Average (EWMA) procedures are extended for detecting space time clusters using accumulated disease counts from neighboring regions. In contrast to controlling the conventional average run length error criterion employed in SPC, false discovery rate is used, which is more suitable in a disease surveillance setting. Instead of using the traditional control limits in SPC methods, p values are used f or making statistical decisions. The use of false discovery rate and p values enables the utilization of popular multiple comparison methods, improving the power of our testing procedures. The gain in power is verified using numerous simulation studies. The rapid detection ability of the methods is illustrated with the 2011 Salmonella Newport case study from Germany. The form and content of this abstract are approved. I recommend its publication. Approved: Joshua Fren ch
PAGE 5
v To my parents, Daisy and Ch andrasena Dassanayake
PAGE 6
vi TABLE OF CONTENTS CHAPTER I. INTRODUC ... 1 II. AN IMPROVED CUSUM BASED PROCEDURE FOR PROSPECTIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE .....6 III. AN I MPROVED EWMA BASED PROCEDURE FOR PROSPECTIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE REGIONS 39 IV. DETECTION OF SPATIO TEMPORAL CLUSTERS IN MULTIPLE GEOGRAPHIC REG . . 78 V. .106 REFERENCES A PPENDIX A. Details of the Storey Tibshirani Procedure and the Estimation of the True Null Hypothesis 113 B. The Bootstrap Procedure Used in Chapters II, III, and IV C. Additional CED and FPR Plots for Pooled EWMA and CUSUM Models . 116 D. CED and FPR Plots for Pooled EWMA and Spatially adjusted EWMA models ..128 E. CED and FPR Plots for Pooled EWMA and Spatially adjusted EWMA Models F. Observed FDR as a Function of Length of Outbreak for the Pooled EWMA and SEWMA 137 G. Implementing the Storey Tibshirani Procedure Usi 138
PAGE 7
1 CHAPTER I INTRODUCTION Emerging disease outbreaks must be detected as early as possible for public health authorities to take necessary preventive measures. Most surveillance systems operating in the United States and Europe record both the time and the location of newly found cases [1]. The data collected by these surveillance systems can be analyzed by either using a retrospective analysis or a prospective analysis. In a retrospecti ve study, the entire data set is availab le , and the analysis is carried out over the entire data set once. A common example is classical hypothesis testing. However, in a prospective analysis, disease counts are collected sequentially over time, and repeated analysis is performed on the data c ollected so far to decide whether an increase in disease incidence has occurred. This is a more complicated problem than retrospective analysis , and techniques such as the traditional h ypothesis testing cannot be applied . Furthermore, i n prospective surv eillance, quick detection is critical to allow adequate time for public health interventions. Unkel et al. [2] broadly classify prospective disease surveillance methods as time series methods, regression based methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detection methods. During the past decade, SPC methods that originated in industrial process control have been successfully utilized for disease surveillance purposes . The Cumulative Sum me thod (CUSUM ) is a popular SPC method that is utilized in Bio Sense, developed by the Center for Disease Control (CDC) and ESSENCE ( Electronic Surveillance System for the Early Notification of Community based Epidemics) developed by the Department of Def ense and the Johns Hopkins University [3]. Anoth er popu lar SPC method, the Exponentially Weighted M oving Average method (EWMA) , is used in the
PAGE 8
2 Early Aberration Reporting System (EARS) developed by the Center for Disease Control and Prevention (CDC) and ES SENCE [4 ]. The CUSUM method cumulates the CUSUM statistic at the previous time step and takes the difference between an observed value and a reference value. The new CUSUM statistic at a given time interval is taken to be the maximum of zero or the cum ulated value. An alarm is signaled when the CUSUM statistic exceeds a pre defined control limit. When the process is in control (i.e. , when there is no outbreak), we do not want to sound an alarm. However, when the process is out of control (i.e. , when there is an outbreak) , we want to signal an alarm. The original CUSUM method was proposed in an industrial process control setting by Page [5] for normally distributed data; Lucas [6] extended the CUSUM method to count data and introdu ced the Poisson CUSU M method . T he EWMA chart is also a popular SPC method that is easy to implement and interpret. The EWMA is a moving average where weights are given to observations in a geometrically decreasing order with more recent observations getting more weight and distant obs ervations getting less weight. The starting value is often set to the target value to be monitored; in a disease surveill ance setting the target value is the expected disease count when there are no outbreaks. S imilar to the CUSUM chart, an a larm is signaled when the EWMA chart exceeds a predefined threshol d. The original EWMA chart was proposed for normally distributed data by Roberts [7] and was extended for Poisson counts by Borror [8]. Both the original CUSUM and the EWMA charts are purel y temporal. Raubertas [9] proposed a method to extend the purely temporal CUSUM chart to a spatio temporal setting [9]. R a u bertas [9] suggested pooling counts of spatially contiguous regions to form regional neighborhoods. This way, spatial information about what is happening in the nearby regi ons is
PAGE 9
3 captured in the resulting pooled CUSUM statistic . As before, an alarm is signaled whenever the pooled CUSUM statistic exceeds a pre defined control limit. Subsequently, Rogerson and Yamada [10] extended th e purely temporal CUSUM chart in two ways: first, they allowed the expected counts to vary over time; second , they suggested forming regional neighborhoods along the lines proposed by Raubertas [9]. The multiple testing problem that arises when testing mu ltiple regional statistics was handled by controlling the family wise error rate (FW E R) using a Bonferroni correction. In conventional multiple CUSUM charts, where a separate CUSUM chart is used to monitor each attribute, similar Bonferroni method s are us ed to control the family wise error rate . FWER based procedures aim to control the probabi lity of even one false positive . In conventional SPC charts, the false alarm rate is defined using the in control average run length, , which is the expected time un til a false alarm. However, when testing a large number of hypothesis simultaneously, co ntrolling the FWER makes it very difficult to reject the null hypothesis, meaning the power for the tests is very low . A bette r alternative is to control the false discovery rate (FDR) which is the expected proportion of false discoveries. FDR procedure s have more statistical power at the expense of increased false alarms. Benjamini and Hochberg [11] proposed a method to contro l FDR when the test statistics are independent. Subsequently, Benjamini and Yekutieli [12] extended the method for depe n dent statistics [13]. FDR controlling methods have been lately adopted for handling the multiple testing problem in multiple process c ontrol charts [13]. In conventional SPC charts, contr ol limits are used to signal alarm s when the charting statistic exce eds a pre defined control limit or a threshold. Li et al. [14] propose a general framework for SPC char t s where p values are used in stead of critical values to signal alarms.
PAGE 10
4 When the in control distribution (i.e. , the distribution of the charting statistic when there is no outbreak) is known, Li et al . [14] suggest ed using Monte Carlo simulation s to simulat e the in control distributi on in estimating the p values. In the most general case, when the in control distribution is un known, Li et al. [14] recommend using bootstrap methods to extract bootstrap samples from the in control (no outbreak) period to estimate the p values. Li and Tsung [15] proposed a method for controlling FDR for multiple CUSUM charts in an industrial process control setting. At each time step , a CUSUM statistic is computed for each attribute. For each CUSUM statistic , a p value is calculated using a random wa lk based a pproximation. When the statistic s are independent Li and Tsung [15] recommend using the Benjam ini Hochberg [11 ] procedure; wh e n the control statistics are dependent , Li and Ts ung suggest using the Benjamini Yekutieli [12] procedure. The three proposed methods described in the rest of this dissertation combine several of these approaches in novel way s, incorporating new methods to capture spatial information and adopting innovative approaches to add more statistical power. In two of the methods , spatial information in the nearby regions is captured by pooling counts to form regional neighborhoods as suggested by Raubertas [9]. The third method, captures spatial information by using a popular spatial index to summarize the local spatial associ at ion in a given locality [16]. In addition, p values are used to signal alarms, instead of using the conventional control lim its, as su g g ested by Li et al. [14]. Furtherm ore, FDR error control is used to address the multiple testing problem that arises wh en comparing multiple control chart statistics, i nstead of using the traditional FW E R based procedures. Moreover, the power of the procedures are further increased using the more recent Storey Tibshirani method [17] for multiple testing, in additio n to us ing the commonly used Benjamini Yekutieli [12] procedure.
PAGE 11
5 The proposed three methods make use of a common underlying algorithm. The first step of the algorithm is to set a tolerable overall false discovery rate. Next, at each time step, a control cha rt statistic is calculated for each regional neighborhood either by pooling neighborhood counts or using a spatial index to model the spatial similarity in the neighborhood . Using past observations, an in control period with no outbreaks is identified. B ootstrap samples are drawn from this in control period and control chart statist ics are computed to estimate the in control distribution s for each region. At each time step, p values are calculated by c omparing the control chart statistic at that ti me poi nt with the sampling distribution of the same statistic for bootstrap samples from the no outbreak time period. The multiple testing probl em that arises when performing multiple statistical tests simultaneo u s ly is handled using FDR based procedures . T his document is structured as follows: chapter 2 discusses the first method where the CUSUM statistic is extended to a spatio temporal setting using the proposed algorithm where spatial information is captured by pooling neighborhood counts . Chapter 3 des cribes the second method where the EWMA statistic is extended to a spatio temporal setting using the proposed algorithm where spatial information is again summarized by pooling neighborhood counts. Chapter 4 introduces the third method where the EWMA stat istic is extended to a spatio temporal setting using th e new algorithm where the spread of the disease in a locality is summarized using a local index of spatial association. Chapter 5, compares the strengths of the three methods and provide directions fo r future research.
PAGE 12
*This chapter h as been accepted for publication in Statistics in Medicine and is included with the permission of the copyright holder. 6 CHAPTER II AN IMPROVED CUSUM BASED PROCEDURE FOR PROSPETIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE REGIONS 1 Introduction Emerging disease clusters must be detected in a timely manner so that necessary remedial action can be taken to prevent the spread of an outbreak. Consequently, prospective disease surveillance has gained prominence as a rapidly growing area of research d uring the past decade [2 ]. In prospective surveillance, a decision is made at regular time interva ls about the incidence of an outbreak, based on the data available up to that time. This is different from the traditional retrospective analysis where the entire data set is available for analysis. Unkel et al. [2 ] broadly classify statistical methods proposed for prospective surveillance into time series methods, regression based methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detection methods. Out of these methods, SPC methods have a lo ng history of application in public health surveillance [18 ]. As the name suggests, many of the SPC methods originated in industrial process control, but have lately been adapted for use in public health surveillance [1 ]. A key example is the Cumulative Sum (CUSUM) method, which is utilized in Bio Sense, developed by the Center for Disease Control, and ESSENSE, developed by the Department of Defense and Johns Hopkins University [3]. The CUSUM method cumulates the previous CUSUM statistic and the differen ce between an observed value and a reference value. The new statistic is the maximum of 0 and the
PAGE 13
7 of is an outbreak), we want to sound an alarm. The method was originally developed for normally distributed data by Page [5] and was later extended to Poisson data by Lucas [6]. Raubertas [9 ] extended the purely temporal Poisson CUSUM method to a spatio te mporal setting. Disease counts from neighboring regions are pooled together to form regional neighborhoods. In pooling counts, a weighted average is calculated using the distance between regions as a weight. A CUSUM statistic is formed for each regional neighborhood, and an alarm is signaled when the CUSUM statistic exceeds the appropriate threshold. Later, Rogerson and Ya ] method in two ways; first, they extended the method so that the expected counts can vary over time ; second, they adopted a method that permits monitoring of regional neighborhood counts for subregions and their surrounding neighbors. The multiple testing problem that arises when testing multiple regional statistics simultaneously was handled by contro lling the familywise error rate (FWER), using the popular Bonferroni correction. This method is somewhat similar to the method used in traditional multiple Poisson CUSUM charts utilized in industrial process control, where the multiple testing problem is handled by using the Bonfer roni correction [15 ]. The objective of FWER procedures is to control the probability of even one false discovery, which exerts a stringent control when a large number of tests are performed simultaneo usly. Benjamini and Hochbe rg [11 ] popularized a less stringent error criterion for false discoveries, namely the expected proportion of false dis coveries or the False Discover R ate (FDR). Compared to FWER controlling methods, FDR controlling methods have more power at the expense o f additional false a larms. Benjamini and Hochberg [11 ] proposed a procedure for controlling the FDR in the context of independent test statistics. The procedure was extended to
PAGE 14
8 dependent statistic s by Benjamini and Yekutieli [12 ]. FDR controlling procedu res have recently been adopted to handle multiple testing problems in SPC ba sed out of control processes [13 ]. Most CUSUM based methods signal an alarm once the CUSUM statistic exceeds the appropriate threshold or critical va lue. In contrast, Li et al. [1 4 ] proposed a procedure using p values, instead of critical values, to determine whether an alarm should be signaled. When the in control dist ribution is known, Li et al. [14 ] suggest using Monte Carlo simulations to simulate the in control distribution a nd estimate p values. When the in control distribution is unknown, bootstrap methods can be utilized to determine the empirical in control distribution and estimate the p values. In the context of industria l process control, Li et al. [15 ] proposed a ne w CUSUM based method that controls the FDR when multiple processes are being considered. One calculates a CUSUM statistic at each time step for each of the processes. The corresponding p value for each process at each time step is calculated using a rando m walk based approximation. Li et al . [15 ] provide the option of using the procedure of Benjamini and Hochberg [11 ] if the multiple statistics are assumed to be independent and the Benjamni Yekutieli extension [12 ] if the statistics are assumed to be depe ndent. The proposed method is a combination of several of these approac hes: as proposed by Raubertas [9 ], the neighborhood disease counts are pooled in calculating the Poisson CUSUM statistic for each regional neighborhood; following the guidelines of Li e t al. [14 ], these CUSUM statistics and bootstrap methods are used to compute p values from which alarms are signaled instead of using critical val ues. Adapting the Li et al. [15 ] method, FDR procedures are used to handle the multiple testing problem as opposed to using FWER based techniques. We further improve the power of the procedure by utilizing an
PAGE 15
9 FDR controlling technique prop osed by Story and Tibshirani [17 ] in addition to the method proposed by Benjamini and Yekutieli [12 ] . One begins the method by first setting an overall FDR level that the procedure is supposed to control. Then, at each time step, the disease counts of immediate neighbors are pooled together to compute a CUSUM statistic for each neighborhood. Next, Monte Carlo simulations or bootstrap methods can be used to approximate the in control (no outbreak) distribution of the neighborhood CUSUM statistics, from which the corresponding p values can be calculated. As multiple dependent statistics are tested simultaneously, two popular F DR based methods are utilized to handle the multiple testing problem. Section 2 provides additional details of the proposed method along with specifics of the Poisson CUSUM method and details of the popular FDR based multiple testing procedures used in the algorithm. Section 3 provides results of a simulation study using the proposed methodology, and Section 4 illustrates the results from applying the method to detect a Salmonella Newport outbreak in Germany; these results are compared with the results fro m traditional multiple CUSUM charts. Finally, Section 5 summarizes the strengths of the proposed method and provides some potential future directions for research. 2 Methods 2.1 Standard CUSUM The CUSUM method was proposed by Page [5] to detect small per sistent changes in the mean of a process. The initial formulation of the CUSUM method assumed the responses were uncorrelated and normally distributed. Later, Lucas [6] extended the CUSUM method to uncorrelated count data generated from a Poisson distrib ution. We describe the CUSUM method in this setting. Let be the response values at times . The goal is to
PAGE 16
10 to the of . The monitoring statistic used fo r the Poisson CUSUM method is where is the count observed at time , is the Poisson CUSUM statistic at time is a value chosen to mi nimize the detection time after a mean shift, and is defined to be The CUSUM statistic is the larger of 0 and the sum of the CUSUM statistic for the previous time step with the difference between the observed value and the reference valu e k. The value in equation (1) is chosen to minimize the time to detect a mean change from to (where and is determined by the formula An alarm is signaled when where is a threshold or a critical value chosen to control the error rate. The threshold is a function of the value of the parameter and the desired in control average run length, is calculated using either Monte Carlo simulations or statistical tables. The is defined as the desired average time between two false alarms when the proce ss is in control. Robertson [19 ] points out that practice, approximations are used to estimate the value for for a chosen [20 ], though this Moustakides [21 ] proved that the CUSUM was the optimal procedure for detecting the mean shift, among all procedures with the same the optimal procedure has the smallest time until it signals a change, once the process shifts to the out of The popularity of the Poisson CUSUM method over other SPC metho ds is perhaps due to this theoretical optimality property.
PAGE 17
11 2.2 Extending Poisson CUSUM to a Spatial S etting The original CUSUM method and its immediate extensions are univariate and purely temporal methods. Furthermore, different spatial extensions of the CUSUM methods have been proposed for spatio tempo ral surveillance by Raubertas [9], Rogerson [22 ], and Roge rson and Yamada [10 ]. Raubertas [9 ] pooled the data in neighborhoods using distance based weights to form regional neighborhoods. Suppose we have spatial regions where disease counts are observed, and let be the disease count in region at time Since larger populations are generally expected to have greater counts of disease incidence, an adjustment must be made before computing the value of , previously described in (2). Let be the expected number of cases in region l , be the at risk population in region l , and be a baseline disease rate common to all regions. Then we have . If w e desire to detect a disease outbreak when the disease rate shifts to , then . Using and , the required for calculating the CUSUM statistic can then be computed using equation (2). Although, , and the corresponding can vary with changing population sizes, , we considered the case of a constant fixed population in our simulation experiments, meaning is fixed for the observed time period. The pooled count of unit at time is where is a measure of closeness between regions and Similar pooling defines and . For each regional neighborhood, a pooled CUSUM statistic is ca lculated and an alarm is triggered if the statistic exceeds a threshold. Extensions of this method allow time varying parameters for and [9 ]. Alternative approach es are given by Purdy et al. [23 ]. Our proposed
PAGE 18
12 methodology maintains the g eneral fra mework of Raubertas [9 ], with important changes described below. We also note that another possible approach for extending the Poisson CUSUM method to a spatial setting is to calculate a univariate CUSUM statistic for each region, and then sound the ala rm when the statistic exceeds a threshold adjusted to account for the multiple testing problem (e.g., adjusting using the Bonferroni correction). This is a method that is used in traditional multivariate process contr ol [15]. Rogerson and Yamada [10 ] out line a similar method that can be applied for disease surveillance over multiple geographic regions. According to this method, the threshold h for each regional CUSUM chart is calculated as follows. First, the familywise Type I error rate, , is dec ided upon (for example, a common choice is ). Rogerson and Yamada [10 ] model run lengths as having an exponential distribution and desire the following relationship to hold: p (run length < m ) = , where m is the number of spatial regions . Using the exponential distribution, p (run length < m ) = 1 exp( m ) , where is the rate parameter of the exponential distribution and is chosen so that 1 exp( m ) = 0.05. This i mplies an average run length of , since the mean of the exponent ial distribution is the reciprocal of the rate parameter. Using this and the region specific k , the threshold h for each region can be calculated. This method is applied to the case study presented in Section 4, where the results from this FWE R based method are compared to the results of the proposed FDR based method. 2.3 Modifications for a more Powerful P rocedure Instead of using the traditional critical value based testing method in the CUSUM procedure, we instead utilize p values as descri bed by Li et al. [14 ]. According to the first set ting considered by Li et al. [14 ], when the in control process distribution is known, p values
PAGE 19
13 may be estimated using Monte Carlo simulations. More specifically, their algorithm starts by collecting in con trol observations at a given time point With these observations, the corresponding CUSUM statistics are computed. Next, data sets from the in control distribution are simulated, and for each of these data sets the corresponding CUSUM statistics are computed. Using these statistics at each time point , the empirical distribution of is determined. At each time step, using and the corresponding empirical distribution, the p values are estimated as the proportion of samples where the associated CUSUM statistic is at least as large as the observed test statistic. When the in control distribution is unknown, p values may be estim ated using bootstrap methods [14 ]. In addition to using p values in our proposed procedure, we also utilize the FDR for error control. Because we will be monitoring multiple spatial regions, we clearly have a multiple testing problem. Li et al. [15 ] utilized the FDR to manage this problem in the context of trad itional SPC methods using critical values; we follow the same pattern using p values, instead of critical values. Benjamini and Hochberg [11 ] popularized the concept of FDR to handle the multiple testing problem. The false discovery rate is defined as th e expected proportion of false discoveries among all discoveries: , where is the number of true null hypotheses that are declared significant and is the total number of hypotheses that are declared significant out of tests.
PAGE 20
14 Numerous procedures have been proposed to control the FDR of numerous tests: In the simplest case, it is assumed that the test statistics are independent. In that case, Benjamini and Hochberg [11 ] showed that the following procedure controls the FDR at level . Consider testing hypotheses and let be the ordered observed p values, with denoting the corresponding hypothesis. Define , and reject . If no such exists, reject no hypotheses. We will refer to this pr ocedure as the Benjamini Hochberg (BH) procedure. Benjamini and Yekutieli [12 ] proposed a modification of the BH procedure when the test statistics of the tests are dependent. Specifically, if the utilized in the BH procedure is replaced w ith , then the FDR will still be controlled at level . We will refer to this as the Benjamini Yekutieli (BY) procedure. Both the BH and BY procedures were used by Li et al . [15 ] to control the FDR. A more powerful and up to date alt ernative is the Storey Tibshirani (ST) procedure, which we describe in more detail below. Storey and Tibshirani [17 ] pointed out that the original FDR method propo sed by Benjamini and Hochberg [11 ] is too conservative and results in a substantial loss of power. To gain more power, Storey and Tibshirani [17 ] introduced the q value as an FDR based measure of significance instead of the p value, which is a FWER based measure of significance. Whereas the p value of a test measures the minimum false positive rate that is incurred when calling the test significant, the q value of a test measures the minimum false discovery rate that is incurred when calling a test significant. We describe this method in greater detail below.
PAGE 21
15 An alternative definition of the q value for a particular set of tests is that it is the expected proportion of false positives incurred when calling a test significant. Using this defini tion, Storey and Tibshirani [17] derived an expression for the false positive rate when calling all tes ts significant whose p value is less than or equal to some threshold where The expected proportion of false positives is approximated by dividing the expected number of false positives by the expected total number of significant tests, for lar ge . The denominator is easily estimated by using the total number of observed p values that are less than or equal to . The numerator, the expected number of false positives, is much trickier and is obtained by multiplying the probability of a true null hypothesis being rejected by the total number of true null hypotheses . In finding the probability of a true null hypothesis being rejected, we need to use the result that the p values corresponding to the null hypothesis are uniformly distrib uted. As a result, the probability that a null p value being less than or equal to is simply So the expected number of false positives is However, since is unknown the proportion of true null hypotheses, , has to be estimated. The key step in the method is an approximation of (see appendix A for approximation of ). In estimating Storey and Tibshirani make use of the fact that the distribution of true null hypothesis follows a uniform distri bution and the distribution of truly alternative hypoth ese s is close to zero. They estimated by considering the density of the uniformly distributed section of the density histogram of the p values. Putting all of these components together, using the estimated proportion of true null hypotheses, the formula for computing the q value, for a given p value, is:
PAGE 22
16 Likewise, a q value for each test can be calculate d and the resulting q values can be ordered at each time step from the smallest to largest. Once a significance threshold of is decided upon, a set of significance tests is identified where a proportion of is expected to be false positive. It is imp ortant to make a distinction between FDR and an alternative quantity called pFDR, under which the q value is technically defined. FDR can be loosely defined by , while the pFDR is defined as , to avoid being un defined when [17 ]. Just as a p value is defined as the minimum false positive rate when calling the test significant, using pFDR, a q value can be defined as the minimum pFDR at which th e test is called significant [24 ]. For large m , when performing surveillance over a large number of smaller geographic units, such as for statewide or nationwide surveillance, . As pointed out by Storey and Tibshirani [17 ], their procedure can be used for models with dependent statistics w s D in [17 ]) 2.4 Detailed Description of the Proposed M ethodology First, the desired FDR of all charts, , is decided. At each time , disease counts for each of the regions are collected. Disease counts of immediate neighbors are pooled to form regional neighborhoods with counts Then, the corresponding CUSUM statistics for the regional neighborhoods are calculated. For each region, data sets are simulated based on the assumed known in control distribution. Using
PAGE 23
17 these, for each regional neighborh ood, an empirical distribution is determined from which the corresponding p values are calculated. Lastly, an appropriate FDR controlling procedure is used to make decisions about whether an alarm should be signaled. A lternatively, bootstrap methods can be used instead to calculate the p values. As before, after defining the overall FDR level, at each time point , disease counts for each of the regions are collected. Then the disease counts of the immediate neighbors are pooled to form counts for regional neighborhoods Using these counts, the corresponding CUSUM statistics for the regional neighborhoods are calcul ated. However, since the in control distribution is unknown in this case, bootstrap methods are used to determine the empirical in control distribution. Fixing the time period for which the counts follow the in control distribution, bootstrap samples of regional counts from the in control time period are drawn with replacement across time while preserving the spatial relationships between the counts (see appendix B for details of the bootstrap procedure) . Specifically, instead of sampling the counts o f individual regions, we instead sample times from the in control time period. The entire set of regional counts for each sampled time then count as a single set of bootstrapped data. For the bootstrap samples, the corresponding CUSUM statistics are computed to determine the empirical in co ntrol distribution. Using this empirical in control distribution and the CUSUM statistics for the regional neighborhoods , the corresponding p values for each regional neighborhood at times t , are calculated. Finally, a FDR controlling procedure is used to determine the alarms, as before, since multiple p values are compared simultaneously. We will compare several FDR controlling procedures in Section 3. The BH procedure will be use d as a baseline. Next, the BY procedure will be used to see how accounting for
PAGE 24
18 dependence between test statistics improves performance. Lastly, the ST procedure will be used to highlight the additional gain in power. 2.5 Contrast with Existing M ethods Th e proposed method is better suited for disease surveillance compared to the conventional methods for several reasons: (1) the use of p values in hypothesis testing is typically preferred over critical values, (2) the use of FDR for error control is better in a public health setting than the standard used in SPC, (3) the multiple testing problem can be easily handled using a variety of off the shelf procedures. These points are elaborated upon below. Conventional SPC methods were designed usin g thresholds or critical values. The use of p values has many advantages over the conventional approach: 1. P values are used by most disciplines using statistics, whereas critical values are only used in certain contexts. 2. When a signal of distributional shi ft is delivered, p values can be used to quantify the strength of evidence of an outbreak. Results can indicate a range from no evidence, weak evidence, moderate evidence, strong evidence, etc. However, with the conventional critical value approach, it is difficult to quantify the strength of evidence since the statistics are not on a standard scale from one setting to another. The FDR is more appropriate for handling the multiple testing problem in surveillance scenarios than the . The concept o f originated in an industrial process control setting. When the process is in operation, an alarm is signaled when the CUSUM statistic crosses a pre defined threshold, and the process is stopped. Sometimes, even when the process is in control the CUSUM will signal an alarm. This is a false alarm and is an alogous to a Type I error. The is defined as the expected time until a false alarm.
PAGE 25
19 The appropriateness of using FDR in a disease surveillance setting over the use of traditional based methods is elaborated upon below: 1. The FDR in a process monitoring scheme may be regarded as th e expected proportion of out of control signals that turn out to be false alarms. On the other hand, the false alarm rate is the probability of concluding that the in control process is out of control when it is actually in control. Therefore, FDR has a more meaningful interpretation in the surveillance setting than the false alarm rate. For example, if the FDR control level is 0.01, it is expe cted that 1 out of 100 alarms may be false. However, a false alarm rate of 0.01 simply means that we would sound an alarm for an in control process 1 out of 100 times. We are more interested in controlling the error rate for out of control processes than in control processes. 2. FDR is directly related to predictive value positive (PVP), one of the seven attributes used by CDC for evaluating surveillance systems, since FDR = 1 PVP. 3. The does not make sense in a health context although it is meani ngful in a SPC context. A run length is defined as the number of observations from the starting point up to the point where the statistic crosses a pre defined threshold. When such a shift occurs, the process might be stopped and restarted again. In con trast to an industrial process, an emerging disease outbreak cannot be stopped: so the concept of a run length is not as appropriate in a health context. Lastly, using FDR provides us with a host of up to date methods to handle the multiple testing probl e m such as Benjamini Hochberg [11 ], Benjamini Yekutieli [12], and Storey Tibshirani [17 ] procedures. It is possible to gain more power with these methods compared to techniques controlling the FWER such as the Bonferroni correction.
PAGE 26
20 3 Simulation Experiment 3.1 Experiment Setup Our simulation experiment consists of a study area comprised of 25 regions on a five by five grid. The time period considered by the simulation experiment was 100 time points or ys (days 1 50); disease counts for each of the 25 regions for the first half of the simulation were simulated as independent Poisson random variables with a constant mean of . However, for the second half of the simulation (days 51 1 00), an outbreak was simulated where the Poisson means for the regions peak in the center and tapered towards the edges. The out of control means are denoted by Specifically, as illustrated in Figure 1, the four corner regions 1, 5, 21, 25 get mean shifts of 0.2 standard deviations from the original mean. Regions 2, 3, 4, 6, 10, 11, 15, 16, 20, 22, 23, and 24 get 0.3 standard deviation mean increases to 4.6. The middle regions 7, 8, 9, 12, 14, 17, 18, and 19 get 0.75 standard deviat ion mean increases to 5.5. Finally, the mean of region 13 increases by 1 standard deviation to 6. The out of control process means are shown in Figure 1. For this simulation, each of the 25 CUSUM statistics was designed with an in control Poisson mean ( of 4 and an out of control Poisson mean ( ) of 6. In other words, since the in control mean is 4 (standard deviation is 2), the CUSUM statistic is built to detect a change of one standard deviation increase. Using t he first method of Li et al . [14 ], ten thousand Monte Carlo simulations were used to determine the empirical in control distribution. The desired level of the FDR was set to .
PAGE 27
21 Figure 1. The out of control mean disease counts ( , of the 25 regio ns for days 51 100 of the simulation experiment are indicated in the center of the regions. Region numbers are indicated at the top of each region in smaller font. 3.2 Simulation Results Figures 2 and 3 show the results for a single simulation for region s 1 and 13, respectively. Region 1 has the smallest shift in mean with a 0.2 standard deviation increase while region 13 has a 1 standard deviation increase. Figure 2 summarizes the results of a single simulation over the 100 day period for region 1, wh ere the process mean was 4 for times 1 50 and 4.4 for times 51 100. Figure 3 (a) shows the disease counts for region 1 over the 100 day period for a single simulation. The blue line shows the observed counts in region 1, whereas the red line shows the po oled counts: pooled counts were calculated by totaling the counts of the region of interest, in this case region 1, and its immediate neighbors that share a common boundary (regions 2, 6, and 7). In other words, the region of interest, and its immediate n eighbors, each get a weight of 1. Figure 3 (b) shows the CUSUM statistic for the observed (blue) and pooled (red) counts; note that the CUSUM statistic for the observed counts does not show any large increase, except for a relatively small spike at
PAGE 28
22 day 80 , in the presence of the outbreak (days 51 100). Recall that region 1 gets the smallest increase of a 0.2 standard deviation shift in the mean. The CUSUM statistic using the observed regional counts is designed to respond to a much larger increase of a one standard deviation shift and is not sensitive to this relatively small change. However, the CUSUM statistic for pooled regional neighborhood counts captures information from neighboring counts as well, and indicates a substantial rise after day 51. In Figure 3 (c) the blue line shows the difference, p value for the independent model, where is the p value threshold using the BH procedure; similarly, the red line shows the difference, value , for the pooled model where is the p value threshold using the BY procedure; finally, the orange line shows the difference, q value) for the pooled model where 0.05 is the overall FDR using the ST procedure. All three plots are displayed together for ease of comparison. Note that the point at which an alarm wil l be signaled is the time at which the difference exceeds 0, shown by the grey horizontal line. The last figure, Figure 3 (d), displays these alarms for the three multiple comparison procedures previously discussed: BH, BY, and ST. The BH procedure (appli ed to the CUSUM statistic from the observed counts) sounds only one alarm on day 80. The BY procedure (applied to the CUSUM statistic for pooled counts) starts to sound alarms after day 57, while the ST procedure starts indicating regular alarms much ear lier (beginning at day 51), and always sounds an alarm when the BH procedure sounds an alarm. Figure 3 contains the plots for region 13, with the highest step increase of 1 standard deviation. Note that the CUSUM statistic in this simulation was designed to detect a change of this magnitude. The alarms are more persistent for all three procedures, with the ST procedure (orange line) identifying the outbreak the earliest (on day 51, right at the beginning of the
PAGE 29
23 outbreak), followed by the BY procedure (re d line) on day 53, and the BH procedure (blue line) on day 57. Figure 2. Summary plots for region 1: (a) disease counts (b) CUSUM statistics (c) differences, and (d) alarms. In plot (c) the blue line shows the difference, value , where is the threshold using the BH procedure for the independent model. Similarly, the red line shows the difference, value , where is the threshold using the BY procedure for the pooled model; the orange line shows the differen ce, value , where 0.05 is the overall FDR using the ST procedure for the pooled model. The grey horizontal line indicates a zero difference.
PAGE 30
24 Figure 3. Summary plots for region 13: (a) disease counts (b) CUSUM statistics (c) differences, and (d) alarms. In plot (c) the blue line shows the difference, value , where is the threshold using the BH procedure for the independent model. Similarly, the red line shows the difference, value , where is th e threshold using the BY procedure for the pooled model; the orange line shows the difference, value , where 0.05 is the overall FDR using the ST procedure for the pooled model. The grey horizontal line indicates a zero difference.
PAGE 31
25 To confirm that the three procedures are controlling the FDR at the desired level of , we calculated the observed proportion of false discoveries for each procedure for 100 independent simulations, using the same experimental set up in Section 3.1 for each si mulation. Boxplots of the results for each procedure are displayed in Figure 4. Clearly, the overall FDR is below the specified 0.05 for all of the three procedures. Figure 4. The box plots of proportion of false discoveries for all three procedures o ver 100 simulations. Results for the BH procedure is shown on the far left, the BY procedure in the middle, and the ST procedure on the far right. 3.3 Additional Investigation of P ower show that these results are consistent, the conditional expected delays (CEDs) and the probability of alarms (PAs) for a large number of simulations were calculated for the three models.
PAGE 32
26 CED and PFA are commonly used measures of evaluat ion in surveillanc e. FrisÃ©n [25 ] discusses these two measures and several performance measures used in statistical surveillance. Conditional expected delay (CED) is a measure commonly used to compare speed of detection and probability of a false alarm (PFA) is used to comp are false alarm rates. Let be the alarm time of a surveillance statistic crossing the threshold Conditional expected delay (CED) is the average delay time until an alarm when the change occurs at time point and defined a s The other measure, probability of a false alarm is define as, ). Robertson et al. [19 and few false alarms. t al. [26 surveillance method need to trade Therefore, PFA and CED are useful measures to evaluate the performance of surveillance systems. In our sett ing, we are controlling the FDR. Since the FDR should be held constant across procedures, a more appropriate measure of power in our context is simply the probability of an alarm (PA). Since the FDR at each time step is held constant across procedures, t he most powerful procedure will be the one that sounds an alarm most frequently; we estimate this by dividing the total number of alarms by the total number of tests. 95) were considered (recall that the simulation results presented earlier was for the case where the change point was day 50). For each change point, 100 simulations were performed. For
PAGE 33
27 example, 100 data sets were generated where the change point for th e outbreak was at day 5. etc. Then for each change point, the CEDs of the 100 simulated data sets were averaged for each of the three models respectively. In order to avoid issues in estimating the CED for breakpoints near the upper end of the original simulation time period , the time period under consideration was extended up to so that a change could be detected for all data sets. Figure 5 shows the CED plots for the BH procedure (blue), the BY procedure (red), and the ST procedure (orange) for regions 1, 2, 7, and 13. Recall that the increase in the out of control process mean for each region is different, with region 1 having the smallest increase and region 13 having the largest. The top left plot in Figure 5 shows the average CED against change point time for region 1 for all three models. Clearly, for all four regions, the ST procedure has the lowest CED, followed by the BY and BH pr ocedures, respectively, across all change points. In general, as evident by the four plots in Figure 5, the procedures using the pooled regional neighborhood counts (models using BY and ST procedures) signal much faster than the model that only uses obser ved regional counts (model using the BH procedure); within the pooled models, the ST procedure detects outbreaks faster than the BY procedure.
PAGE 34
28 Figure 5. Average CED versus change point time for BH model, BY model and ST model for regions 1, 2, 7 a nd 13
PAGE 35
29 We now compare performan ce of the procedures using PA. Figure 6 shows the probability of an alarm against change points for all three procedures in regions 1, 2, 7, and 13. sidered. For each change point, the PAs for 100 simulations were averaged for all three methods. As expected, in all four regions the ST procedure has the highest PA followed by the BY procedure and then the BH procedure. However, for larger increases i n the mean, the vertical distance between the three lines tends to diminish, indicating that the gains in power tend to be smaller for larger step increases in the mean. Since FDR is controlled at a 0.05 level, the higher the total number of alarms, the hi gher the number of true alarms. So when FDR is controlled, we want to sound the alarm more frequently. As expected, the ST procedure (orange) sounds the most alarms followed by the BY procedure (red) and finally the BH procedure (blue). The downward tre nds in the plots are due to the following reason: when the outbreak occurs earlier during the 100 day period, the duration of the outbreak is longer, allowing more chances to detect the out of control process.
PAGE 36
30 Figure 6. Average probability of an alarm versus change point time for BH model, BY model and ST model for regions 1, 2, 7 and 13
PAGE 37
31 4 Application to Salmonella D ata The proposed method was applied to a data set of Salmonella Newport cases reported weekly from 16 German federal states between years 2004 2014. The data were obtained from the Robe rt Koch Institute in Germany [27 ]. Since Salmonella is not a contagious disease, without trend or seasonality, the counts were assumed to be independent. The 16 German federal states are displayed in Figure 7. Figure 7 . Map of the sixteen federal states of Germany
PAGE 38
32 The first two years of data (2004 2005) were used to estimate the in control distribution in each state since there were no unusually high disease counts reported from any of the states during this period . A Poisson dispersion test [28 ] was used to ensure that the Poisson counts were not overdispersed. No overdispersion was detected in the first two years of data at a type I error rate of 0.01. Also, it is reasonable to assume that th e data are spatially independent over the first two years, which we use as the in control time period. The proposed surveillance model was applied to data from 2006 2013. Since, states with larger populations are likely to have larger disease counts, th e method proposed by Raubertas [9 ] detailed in section 2.2 was adopted to address this issue. For the pooled model, disease counts from each state were pooled together with those of the immediate neighbors, using a binary connectivity matrix similar to th e one used in the simulation experiment. For the pooled counts, the CUSUM statistic was computed, the p values for each region were estimated, and an alarm was signaled using an FDR controlling procedure. P values were estimated using the bootstrap metho d since the in control distribution was unknown. The BH procedure was used to handle the multiple testing problem for the independent counts model while the ST procedure was used for the pooled model; for both models the overall FDR was set to 0.05 to min imize false alarms. A plot of the results for both the independent counts and pooled counts models are shown in Figures 8 and 9 for the states Bavaria and Saxony Anhalt, respectively. The data set contained counts for 528 weeks starting from the first we ek of January in 2004 to the second week of February in 2014. The time axis shows the number of weeks, starting from the first week of January 2004 (week 1). Since the Salmonella Newport outbre ak actually occurred in 2011 [29 ], the results are plotted fr om the last week of September 2009
PAGE 39
33 (week 300) to the second week of February 2014 (week 528). The results indicate that the pooled model using the ST procedure was successful in detecting the Salmonella outbreak in the first week of November 2011 (week 41 0), and this was consistent for all 15 states (The pooled model using the BY procedure also gave alarms on week 410 in all 15 states as well, as the rise in disease counts was fairly large). In contrast, the model using independent state counts was rather inconsistent in signaling an alarm around the period of the outbreak: the model did not detect an outbreak in Bremen, which has the smallest population; in Bavaria, the outbreak was detected 53 weeks later; in Rhineland Palatinate and Saxony Anhalt it wa s detected two weeks later; in Baden Wuerttemberg and Schleswig Holstein it was detected a week later. These results using the proposed method employing FDR techniques can be compared with the performance of traditional multiple Poisson CUSUM methods usi ng FWER based multiple testing procedures. The traditional method using 15 univariate Poisson CUSUM charts with a Bonferroni correction produced rather inconsistent results in signaling the alarm: in Bremen and Saxony, no change was detected; in Bavaria, the outbreak was detected 53 weeks later; in Baden Wurttemberg it was detected 4 weeks later; in Rhineland Palatinate, Schleswig Holstein, and Thuringia the outbreak was detected two weeks later; in Mecklenburg Vorpommern it was detected a week later. Th e alarm plot on Figure 8(c) shows the results for Bavaria: using the pooled model with the ST procedure, the outbreak is detected in week 410, consistent with the results of the other states, but there is a much longer delay (53 weeks) in detection using t he independent counts model. Since the increase in disease counts is relatively large in week 410, both ST and BT procedures detect the outbreak in week 410. However, note that there is another relatively smaller increase in counts after week 450. This smaller shift is detected first by the more
PAGE 40
34 powerful ST procedure in week 452, followed by the BY procedure in week 456, and then the BH procedure in week 359. Although the ST and BY procedures are comparable in terms of detection speed when it comes to d etecting large increases, the ST procedure is quicker in detecting relatively smaller changes, consistent with the simulation results. Figure 9(c) shows the results for Saxony Anhalt: using the pooled model with the ST procedure, the outbreak is again dete cted in week 410, but there is a two week delay in detection using the independent model. The pooled model sounds a steady alarm after week 410 for a longer period; however, the independent model sounds an alarm irregularly for a relatively shorter period . In summary, the pooled model using the Storey Tibshirani procedure was successful in detecting the outbreak simultaneously throughout the country as opposed to the independent model, which signaled alarms inconsistently and following considerable delays .
PAGE 41
35 Figure 8 . Pooled disease counts (a), CUSUM statistics (b), and alarm plots (c) for Bavaria. The blue line in (a), (b), and (c) represent the independent count model and the red line represent the pooled count model for Bavaria.
PAGE 42
36 Figure 9. Pooled disease counts (a), CUSUM statistics (b), and alarm plots (c) for Saxony Anhalt . The blue line in (a), (b), and (c) represent the independent count model and the red line represent the pooled count model for Saxony Anhalt.
PAGE 43
37 5 Discussion In the proposed new procedure, regional disease counts from multiple regions are aggregated to compute an alarm statistic using the popular Poisson CUSUM method. Two novel aspects of the proposed method are particularly advantageous in a disease surveillance set ting. First, the use of p values (instead of the commonly used critical values) enables us to evaluate the strength of the out of control signal. Second, the use of FDR for error control instead of the standard FWER or allows the use of powerf ul tools to handle the multiple comparison pro blem. The Storey Tibshirani [17 ] procedure was the most powerful procedure in our tests, in comparison with the more co nservative Benjamini Hochberg [11] and Benjamini Yekutieli [12 ] procedures. The simplicit y of the algorithm and the improved speed of detection make the proposed method useful in a practical disease surveillance setting, as illustrated by the Salmonella Newport example in Germany. We emphasize that the proposed procedure controls the FDR at e ach time step. If one wanted to control FDR over all time steps simultaneously, the BY and ST procedures would still apply, though the procedure would no longer be prospective. As evident from simulation results, pooling regional neighborhood counts can increase the speed of detection in comparison to individual regional counts, when the shift, the deviation from the mean, occurs in several units that make up the regional neighborhood (this was previ ously emphasized by Raubertas [9 ]). However, pooling ma y result in a delay in detection if the shift occurs in only one or two regions that make up a neighborhood because the effect of the shift will be diluted. Additionally, pooling may result in an increase in false alarms in regions sharing a common bounda ry. Another issue related to the utilization of the CUSUM statistic is in identifying the time at which the outbreak ends. The CUSUM statistic may not decrease quickly after an outbreak has ended,
PAGE 44
38 leading to incorrect decisions when the outbreak is over. This issue i s addressed by Gandy and Lau [30 ]. The proposed methodology is applicable for the detection of foodborne outbreaks, a class of outbreaks with no trend or seasonality. A future research direction is to extend the method to a broader class of settings, encompassing outbreaks with trend and seasonality. Additionally, we have assumed spatial and temporal independence of the observed counts (though the pooled CUSUM statistics are correlated). It is certainly possible that the counts are depen dent, even after adjusting for a non stationary mean structure. This has been add ressed by Rogerson and Yamada [10 ] among others. We are actively working on extending the proposed methodology to the setting where counts are dependent in time and/or space . [31 ] was used to calculate the q values for the simulations and the case study (see Appendix G for . We would like to thank several anonymous referees and an associate editor for their very helpful su ggestions toward improving this manuscript.
PAGE 45
39 CHAPTER III AN IMPROVED EWMA BASED PROCEDURE FOR PROSPECTIVE DISEASE SURVEILLANCE FOR COUNT DATA IN MULTIPLE REGIONS 1 Introduction Quick detection and prevention of emerging outbreaks is an important p roblem in public health surveillance. Large amounts of data on a variety of diseases are collected in the United State and Europe to monitor and control the spread of disease outbreaks [1]. Both prospective and retrospective methods can be used to analyze the collected data to answer various public health questions. During the past decade, there has been increased interest in using prospective methods for surveillance of disease outbreaks. In prospective disease surveillance, the number of reported cases a re collected sequentially over time and analysis is performed on the most up to date data to decide whether there has been an increase in the disease incidence rate. This is different from retrospective analysis where the entire data set available for ana lysis and the entire data set is analyzed at the same time, as in traditional hypothesis testing. In prospective surveillance, it is important to detect an increase in the incidence of the disease as quickly as possible and report that to public health au thorities to take timely action. Prospective surveillance methods are often broadly classified as time series methods, regression based methods, statistical process control methods (SPC), methods incorporating spatial information, and multivariate detectio n methods [2]. During the recent years, SPC methods that originated in industrial process control have been successfully utilized for surveillance of disease outbreaks [3]. Two SPC methods that have been successfully extended for use in disease surveilla nce include the Cumulative Sum (CUSUM) method and the Exponentially Weighted Moving Average method (EWMA). Both have gained prominence in
PAGE 46
40 practical disease surveillance: CUSUM is used in BIOSENSE [4], while the EWMA method is used in the Early Aberration Reporting System (EARS) developed by the Center for Disease Control and Prevention (CDC) and ESSENCE ( Electronic Surveillance System for the Early Notification of Community based Epidemics) developed by the Department of Defense and Johns Hopkins Universit y [4] . The EWMA chart is easy to implement and interpret similar to the CUSUM chart, the other popular SPC method that is also used in p ractical disease surveillance [32 ]. The EWMA control chart is a moving average of current and past observations, where the weights of the past observations decrease as in a geometr ic series [32 ]. In an industrial process control setting, the goal of the EWMA chart is to quickly detect shifts in the process mean. The starting value is often set to the process mean which is the target value to be monitored: in disease surveillance this value is the mean number of cases when there is no outbreak. Similar to the CUSUM control chart, the EWMA chart also has a lower control limit and an upper control limit. The process is co nsidered to be out of control when the EWMA statistic crosses the control limits. of utbreak), we want to sound an alarm. The method was originally proposed by Roberts [7 ] for normally distributed responses and was later exten ded to Poisson data by Borror [8 ]. The EWMA chart is a purely temporal control chart. Other purely temporal con trol charts have been extended to spatio temporal settings in t he past: namely, Raubertas [9 ] extended the purely temporal Poisson CUSUM method t o a spatio temporal setting. R a ubertas [9 ] formed neighborhoods by pooling counts of regions that share a com mon boundary. In other words, counts are pooled using closeness as a weight, where the weights are binary. This way,
PAGE 47
41 overlapping neighborhoods are defined with each region having its own neighborhood. The pooled data within the neighborhood is used to ca lculate a CUSUM statistic for each neighborhood. An alarm is signaled when the neighborhood CUSUM statistic exceeds its control limit. Rogerson and Yamada [10 forming neighborhoods consisting of cont iguous regional units where neighborhoods can be extend beyond regions sharing a common boundary. Instead of using binary weights, Rogerson and Yamada used weights that decease with increasing distance from the region. The local neighborhood statistics f ormed this way are monitored to signal alarms. The resulting multiple testing problem that is encountered when testing multiple neighborhood statistics simultaneously is handled by controlling the familywise error rate (FWER), using the popular Bonferroni correction. Bonferroni correction methods are also employed in traditional multiple control charts utilized in industrial process control to handle the associa ted multiple testing problem [15 ]. FWER is the probability of making one or more false discov eries among all hypothesis performed. Therefore, FWER procedures can be too strict when a large number of tests are performed simultaneously. In such situations, the False Discovery Rate (FDR), originally proposed by Benjamini and Hochberg [11], is a bet ter alternative. FDR is the expected the proportion of false discoveries among all discoveries. FDR controlling procedures exert less stringent control than the traditional FWER procedures. As a result, FDR procedures have more power at the expense of i ncreased false alarms. Benjamini and Hochberg [11] proposed a FDR based method to handle the multiple testing problem for independent statistics. The method was
PAGE 48
42 subsequently extended for dependent statistics by Benjamini and Yekutieli [12]. FDR methods have been recently employed to address multiple testing problems in SPC charts [13]. In conventional SPC charts, control limits are used to signal alarms. Li et al. [14] proposed a general algorithm to use p values, instead of critical values, to signal alarms in a univariate setting. When the in control distribution is known, they proposed using Monte Carlo simulations to simulate the in control distribution to calculate the p values. When the in control distribution is unknown, they proposed using bo otstrap methods to generate the empirical in control distribution to estimate the p values. For multiple control charts, Li and Tsung [15 ] proposed a FDR based method, instead of the conventional FWER based methods. According to their method, first, the multiple chart statistics are calculated at each time step. Then, the corresponding p values are calculated at each time step using a random walk based approximation. When the multiple statistics are independent Li and Tsung [15 ] use the Benjamini and H ochberg procedure [11] and when the statistics are dependent they use the Benjamni Yekutieli [12] procedure. Several of these approaches are combined in the proposed method. Regional neighborhoods are for med, as proposed by Raubertas [9 ], by pooling count s of regions sharing a common boundary. Then EWMA statistics are computed for these neighborhood using p values instead of critical values as suggested by Li et al. [14]; further, adopting their method, bootstrap methods are used to estimate the empirical in control distribution. The multiple testing problem that arises when testing multiple neighborhood statistics simultaneously is handled by using FDR procedures as opposed to using FWER based techniques , adapting the Li and Tsung [15 ] method. The powe r of the proposed method was further improved by utilizing techniques
PAGE 49
43 prop osed by Story and Tibshirani [17 ], in addition to using FDR controlling techniques proposed by Benjamini and Yekutieli [12] . The first step of the method is to decide an acceptable F DR level for all chart. Then, disease counts of regions sharing a common boundary are pooled together at each time step to compute neighborhood EWMA statistics. The in control (no outbreak) distributions of the neighborhood EWMA statistics are approximat ed by using bootstrap methods since the in control distributions are not known. Using the neighborhood EWMA statistics at each time step and the in control distributions of the neighborhood EWMA statistics, p value are estimated. Since multiple dependent statistics are tested simultaneously, both the Benjamini Yekutieli [12 ] and Storey Tibshirani [17 ] procedures are used to handle the multiple testing problem. In what follows, we will extend the EWMA method to the setting where the data are counts, while incorporating nearby spatial information and also minimizing the assumptions made about the disease generating process when there is not an outbreak. Furthermore, we increase the power of the method by controlling the False Discovery Rate (FDR) instead of the Average Run Length criterion typically employed in SPC. Section 2 provides the necessary background information and details of the extended methodology. Section 3 provides results of a simulation study comparing the extended EWMA method to a simila r ly constructed CUSUM method [33 ]. In section 4 the extended EWMA method is applied in the detection of a Salmonella Newport outbreak in Germany and the results are compared with the results from the CUSUM method. Section 5 highlights the strengths of the proposed method and discusses potential research directions.
PAGE 50
44 2 Methods 2.1 Standard EWMA The original EWMA chart was proposed by Roberts [7 ] for normally distributed responses. This method was exten ded to Poisson data by Borror [8 ]. Let be the response values at times Borror [8 ] assumed that the responses are uncorrelated and identically distributed Poisson random variables with mean . The mean is often referred to as the in control mea n. The objective of the EWMA method is to detect an upward shift in the mean from to a pre specified upper limit . The test statistic for the Poisson EWMA method is where is the count observed at time is the Poisson EWMA statistic at time and is a weighting constant such that . is usually defined to be With this method, a weight of is given to the current observation at time and a weight of is given to the EWMA statistic from the previous time step. The expected value and the variance for the Poisson EWMA statistic are and respectively [8 ]. Therefore, for large values of t , the asymptotic variance is given by An alarm is signaled whenever , where the critical val ue is determined by the formula ,
PAGE 51
45 where the out of control mean we would like to detect is standard deviations above : note that the asymptotic standard deviation is used in this formula. 2.2 Exten ding the Purely Temporal EWMA C hart t o a Spatio temporal S etting The original Poisson EWMA chart proposed by Borror [8 ] is a purely temporal control chart. Different methods have been proposed to extend the EWMA chart to a spatial setting, for example, me thods proposed by Lin et al. [34 ] and Sparks and Ellis [35]. Lin et al. [34 ] proposed a spatial EWMA scan statistic by assigning different weights to data with different radius levels from the specific center of inve stigation. Sparks and Ellis [35 ] propo sed using the EWMA statistic to build temporal memory to the scan statistic for detecting outbreaks of unknown size and shape. Both of these methods attempt to extend a spatial statistical method the scan statistic to a spatio temporal setting. Howev er, it is possible to extend temporal statistical methods, such as SPC charts, to a spatio temporal setting. The method proposed by Raub ertas [9 ] for extending the Poisson CUSUM chart to a spatio temporal setting is often highlighted in the literature on e xtending the purely temporal SPC charts to a spatio temporal setting. We apply this method in extending the EWMA chart to a spatio temporal setting. Ruabertas [9 ] proposes pooling counts from adjacent regions to form neighborhoods according to the spatial proximity matrix . If we have m spatial regions, the geographic relationship among the m regions is often quantified through the use of an spatial proximity matrix , with denoting the measure of closeness between region and [16 ]. Common weighting schemes include binary weights ( if regions i and l that share a common boundary and otherwise) or inverse distance weights ( for some power where is the distance betwee n the centroids of regions i and j ;
PAGE 52
46 otherewise) . Let be the disease count in region at time and let be a measure of closeness between regions and Then the pooled count for region at time , is Note that we set as counts of region i are given a weight of 1: a weight of 1 is given to counts of region i and its immediately adjacent neigh bors when pooling counts. If an increase in disease counts occur over an area that consist of more than one region, then by pooling counts one increases the sensitivity of detecting a shift in the mean disease counts. However, pooling may have a negative effect if the increase in counts occur only in one region, because of dilution. Since regions with larger populations are likely to have larger disease counts, a correction should be made in calculating the EWMA statistic for each neighborhood. Letting denote the population at risk in region i , and be a baseline disease rate common to all regions, the expected number of cases in region i, i s given by the relationship . In adopting ] method, whi ch was originally proposed for the CUSUM chart, to the EWMA chart, we used the same procedure for pooling counts within neighborhoods and performing the population correc tion as proposed by Raubertas [9 ]. For each regional neighborhood consisting of regio n i and its immediate neighbors, a pooled EWMA statistic was calculated at time t In calculating the pooled EWMA statistic the pooled expected number of cases for regional neighborhood i at time t, was calculated by pooling the in control means of region i and its immediate ly adjacent neighbors, just as pooling counts. Both the pooled regional neighborhood counts and the pooled expected cases were used in equation (1) to calculate the pooled EWMA statistic for regional neighborhood i .
PAGE 53
47 2.3 Improving the Power of the P rocedure In conventional EWMA charts, control limits are used to signal alarms. Instead, we propose using p values to signal alarms. In the context of SPC methods, this appears to have been first proposed by Li. et al. [1 4], and the approach was utilized in the disease surveillance setti ng by Dassanayaka and French [33 ]. Li. et al. [14] estimated the p value in several settings by drawing samples from an in control distribution. In the most general case, when the in cont rol distribution is unknown, Li. et al. [14] proposed using the bootstrap method to estimate the in control distribution. This idea was extended to the prospective dis ease surveillance setting in [33 ], and we adopt a similar procedure here. First a set o f in control times are specified based on prior knowledge. Next, B bootstrap samples are collected from the set of in control data (see appendix B for details of the bootstrap procedure) . In a typical bootstrap setting, the responses are treated as being independent and identically distributed, which we do not have in our setting. However, the collection of responses at each time step of the in control time period have the same joint distribution, and the responses at one time are assumed to be independen t of the responses at other times. In order to preserve the spatial structure of the data, each bootstrap sample will correspond to the complete set of responses observed at a particular in control time period. If the set of in control times is , then the bootstrap samples effectively correspond to sampling times with replacement from , and then using the collection of responses at each sampled time as a single bootstrap sample. For each bootstrap sample, the corresponding EW MA statistics are computed for each of the m regions. Using these statistics, the empirical in control distribution of the EWMA statistic for each region is estimated. At each new time point t , observations are collected f or each region and regional neighborhoods are formed by pooling counts of each region and its immediate
PAGE 54
48 neighbors. Then, the corresponding pooled EWMA statistics for each regional neighborhood are computed. Afterward s, each of the pooled EWMA statistics is compared with the empirical in control distribution for the corresponding regional neighborhood to calculate a p value for each regional neighborhood. Using the resulting set of p values , alarms are signaled using an appropriate FDR controlling multiple testing procedure. In order to improve the power of our testing a lgorithm, we utilize the FDR [15 ] for error control. An FDR controlling procedure for the CUSUM metho d was previou sly proposed by Li and Tsung [15 ] using a random walk based method for approximating the p values . We employ a similar idea for the EWMA method, but instead utilize the p values estimated using the bootstrap procedure so that numerous FDR con trolling procedure are available to us. The FDR is the expected proportion of false discoveries among all discoveries: , w here V is the total number of true null hypotheses that are rejected (i.e., false alarms) and R is the total number of hypothesis both true and false that are rejected (i.e., all alarms) out of a total of m tests. FDR error control methods have greater power at the expense of increased false alarms. We briefly mention three FDR controlling multiple testing procedures that are relevant to our setting. When m independent test statistics are compared simultaneously [in our setting, ass uming that no pooling is done], Benjamini and Hochberg [11] proposed a method that can be used to control FDR at level . Suppose that m hypotheses are to be tested, with corresponding p values . Let denote the ordered the p values in ascending order, with corresponding ordered hypotheses . Next, f ind the largest j such that
PAGE 55
49 . Finally, reject all hypotheses for i j . We will refer to this method as the BH procedure. The EWMA statistics calculated at each time step will be correlated across space because of the poo ling procedure. Consequently, the BH procedure is not the most appropriate procedure. Two popular FDR controlling procedures for correlated statistics are Benjamini Yekutieli procedure and the Storey Tibshirani pro ced ure. The Benjamini Yekutieli procedu re [12] modified the original BH procedure and reject for the largest j such that , where We will refer to this method as the BY procedure. Storey and Tibshirani [17 ] argued that the Benjamini and Hochberg [11] procedure is too cons ervative and proposed a new FDR based procedure to gain more power. They proposed using the q value, which is a FDR based measure of significance, instead of the p value, which is a FWER based measure of significance. The q value is defined as the minimu m false discovery rate that is incurred when calling a test significant, whereas the p value is the minimum false positive rate that is incurred when calling a test significant. The computation of the q value is fairly complicated, but is simplified by th e us ] (see Appendix G . Details of the q value computation are provided in [33 ]. In summary however, the FDR will be controlled at level when null hypotheses are rejected when the associa ted q value is less than . We will refer to this procedure as the ST procedure. 2.4 Summarization of Testing P rocedure The first step of the proposed method is to define an allowable FDR level , for all m regions. Then at each time step t , disease counts are collected in each of the m regions. Next, the regional neighborhoods are formed by pooling counts of each region and its adjoining neighbors using the Raubertas method in (3). Using these regional neighborhood
PAGE 56
50 coun ts , the corresponding EWMA statistics are computed for each regional neighborhood, replacing with . Bootstrap methods are used to approximate the in control distrib utions (null distributions) of the pooled EWMA statistics for each regional neighborhood. From the in control period (no outbreak period), B bootstrap samples are drawn with replacement; each bootstrap sample is a vector of pooled neighborhood counts from each of the m neighborhoods at a given time in the in control period. For each bootstrap sample, pooled EWMA statistics are calculated for each of the m regions. The B bootstrap samples are used to construct a bootstrap distribution for each region and t he p value associated with the EWMA method is then computed as. The BY or ST procedure is then used to determine the regions for which an alarm should be signaled. In section 3, three different FDR based controlling procedures will be used. For the bas e line model using independent regional counts, the BH procedure will be used. For the proposed pooled count model, two different FDR controlling procedures will be used. To show how pooling improves performance, first, the BY procedure will be used for the dependent regional neighborhood statistics. Second, to gain more statistical power, the ST procedure will be used. 2.5 Comparison of the Proposed Method with Conventional SPC M ethods The proposed method is more appropriate for disease surveillance th an the conventional methods for the following reasons: (i) the method uses p values as opposed to using critical values as in conventional SPC methods; (ii) FDR error control, which is more appropriate for disease surveillance, is used with the new method in contrast to using , which is the standard measure in SPC ; (iii) the use of FDR facilitates the adoption of more powerful multiple testing procedures such as the BH, BY, and ST procedures compared to more conservative FWER based procedures suc h as the Bonferroni correction.
PAGE 57
51 With the conventional SPC methods, critical values are used to signal alarms; however, the proposed method uses p values, which is better suited for disease surveillance for the following reasons: 1. A p value is a common stati stical measure and is used in a variety of applications that make use of hypothesis testing. However, critical values are not an as commonly used. 2. With p values, one could easily quantify the strength of evidence for an outbreak, for example, no evidence, weak evidence, moderate evidence, or strong evidence. However, with the conventional critical value approach it is not straightforward to quantify the outbreak signal. The use of FDR is more fitting for disease surveillance than . The concept of originated in an industrial process control setting. When a process is in operation, the period from the start of the process to the point where the control chart crosses a threshold is considered a run. Sometimes, even when the process is in control the control chart can cross the threshold, signaling a false alarm. This is similar to a Type I error in classical hypothesis testing. is defined as the average time between two false alarms. The and the Type I rate are d irectly related: namely, the reciprocal of is the Type I error rate [2]. Employing FDR as an error control measure in disease surveillance is more appropriate than the use of for the following reasons: 1. FDR has a more meaningful interp retation in disease surveillance than the Type I error rate, the reciprocal of . FDR is the proportion of false alarms among all alarms. However, Type I error rate is a proportion of false alarms among all hypothesis test performed. For exampl e, an FDR of 0.05 would mean that on average, if there are 100 alarms, then 5 of those alarms are false alarms. However, if the Type I error rate is 0.05,
PAGE 58
52 that would mean that out of 100 tests (this is the same as 100 time intervals, as a test is performe d at each time interval) we would expect 5 false alarms. Clearly, it is more useful to know the proportion of false alarms among all alarms than among all tests. 2. Predictive Value Positive (PVP) is one of the seven attributes used by the Center for Disease Control to evaluate a new surveillance system. PVP is the proportion of true alarms among all alarms; so FDR is the compliment of PVP (i.e. , FDR = 1 PVP ) 3. As described above, is related to the concept of a run in industrial process control: a run is defined as the time from the start of the process to the point when an alarm is signaled and the process is stopped. However, the concept of a run is not meaningful in disease surveillance as we do not have the control over an outbreak to terminat e it, unlike an industrial process line. With conventional multiple SPC charts, FWER based control procedures, such as the Bonferroni correction is used to address the multiple testing issue. These methods are known to be conservative. With use of FDR er ror control, there is a host of more powerful multiple testing procedures such as the BH, BY, and ST procedures. 2.5 Comparison of the Proposed EWMA based M ethod with Poisson CU SUM M ethod Dassanayake and French [33 ] proposed a method for the Poisson CUSUM chart that is similar to the one here. The method requires that the disease data follow a Poisson model. In the design of a Poisson CUSUM chart, a reference value is specified which minimizes the time to detect a specific shift in the mean. This referen ce value is determined assuming a Poisson count model. Hawkins and Olwell [28 dispersion test, since the CUSUM chart is sensitive to departures from the assumed Poisson
PAGE 59
53 model. Although, the Poisson model is reasonable for count data, there is no guarantee that disease counts follow the desired Poisson model. Therefore, a better alternative is the EWMA chart , which does not require the counts to follow a specific distribution. With less stringent assumptions, the EWMA chart is more flexible for disease surveillance than the Poisson CUSUM chart. It is important to point out that the EWMA chart is an appli cation of simple exponential smoothing. Therefore, the input to the chart does not have to be a discrete count; it can as well be a continuous input. So the proposed method can be used for identifying space time clusters of continuous measures also. For example, the method can be used to detect unusual buildups in pollution levels over multiple geographic regions. High pollution levels affect the health of populations living in those areas. Therefore, the use of the EWMA chart expands the utilization of the method to other health monitoring applications as well. 3 Simulation Study 3.1 Experiment Setup For our spatial domain we considered a five by five grid consisting of 25 regions. Disease counts were simulated in each of the twenty five regions for a 100 day period. No outbreak was simulated for days 1 50, which is the in control period. For this period, independent Poisson counts were generated in each region with a constant in control mean of . For days 51 100 an outbr eak was simulated with an increase in the mean in each region. These out of control means, denoted by , peaked in the center region and tapered towards the edges, as illustrated in Figure 1 0 . Regions 1, 5, 21, and 25 received a 0.2 st andard deviation increase in the mean (to a mean of 4.4). Regions 2, 3, 4, 6, 10, 11, 15, 16, 20, 22, 23, and 24 were given a slightly higher 0.3 standard deviation increase in the
PAGE 60
54 mean (to a mean of 4.6). The nine regions adjacent to the center region, regions 7, 8, 9, 12, 14, 17, 18, and 19, were given a higher 0.75 standard deviation increase (to mean of 5.5). The center region 13 was given the highest, a 1 standard deviation increase (to a mean of 6). The FDR level at e ach test time was set to 0.05 . An EWMA statistic was designed for each region with weighting constant of and an in control process mean of ( Montgomery [36 ] recommend selecting a value where To generate empirical in control (no outbreak) distributio ns, B = 10,000 bootstrap samples were drawn with replacement from the no outbreak period (day 1 50). Figure 1 0 . A heat map of the mean of each region during the outbreak period during days 51 100 . The out of control mean for each region is indicated in black at the center of each region, while the region numbers are shown at the top center in blue. 3.2. Simulation Results Figure 11 and Figure 12 show the results of a single simulation over a 100 day period for two specific regions in the 25 region area . Figure 11 shows the results for region 1, which received the smallest shift in the mean of a 0.2 stand ard deviation increase. Figure 12 shows the results for region 13, which received the largest shift in the mean, a one standard deviation
PAGE 61
55 increase. Th e two figures can be compared to see how the magnitude of the mean shift affects the speed of the outbreak detection. The performance of three models were compared in terms of speed of detection and false alarms. The first model is the baseline model usin g the BH procedure utilizing independent counts of each region. The second and the third models are the proposed models pooling counts of regional neighborhoods using the BY and ST procedures respectively. Figure 11 shows the results for a single simu lation for region 1: recall that for all regions, during the no outbreak period from day 1 50, the in control mean was set to 4; region 1 received the smallest shift in the mean, a 0.2 standard deviation increase. So during the outbreak period from day 51 100, the mean shifted up from 4 to 4.4 in regions 1. Figure 11 (a) shows the simulated disease counts for region 1 (blue) and the pooled counts (red) for region 1 and its adjacent neighbors , regions 2, 6, and 7. Figure 11 (b) shows the EWMA statistic for the independent model only using counts of regions 1(blue) and the pooled model using counts of regions 1, 2, 6, and 7 (red). Note the gap between the independent and pooled EWMA models, which is simply a reflection of the fact that the pooled model neces sarily is larger since it adds the counts from neighboring regions. Also note that there is not much of a variation in the EWMA statistic for the independent model (blue); however, for the pooled model there is a noticeable increase of the EWMA statistic d uring the outbreak period f rom days 51 100 (red). Figure 11 (c) shows the difference between p values or q values and their corresponding thresholds. Specifically, the blue line shows the difference p value , where is the threshold for the independent model for region 1 using the BH procedure; the red line shows the difference p value , where is the threshold for the pooled model using the BY procedure; t he yellow line shows the difference q value , where 0.05 is the overall FDR
PAGE 62
56 for the pooled model using the ST procedure. The grey horizontal line indicates a zero difference. An alarm will be signaled whenever a difference curve exceeds the grey l ine. Figure 11 (d) displays whether an alarm is sounded at each time step (1 = alarm, 0 = no alarm). Although the outbreak starts on day 51, the independent model (blue) detects it only on day 80, after a 29 day delay. The pooled model is much faster in s ignaling alarms: the pooled model using the BY procedure (red) detects the outbreak on day 53, just two days after. This model produces more persistent but discontinuous alarms over the outbreak period. The pooled model using the more powerful ST procedu re detects the outbreak as it happens on day 51 and continues to sound a steady alarm during the outbreak period from days 51 100. It is important to emphasize the sensitivity of the pooled model using the ST procedure even a slight 0.2 standard deviatio n increase in the mean can be continuously detected with this model.
PAGE 63
57 Figure 11 . Summary plots for region 1 over time: (a) disease counts, (b) EWMA statistics, (c) differences, and (d) alarms. In plot (c), the blue line indicates the difference value , where is the threshold using the BH procedure for the independent model; the red line shows the difference value , where is the threshold for the pooled model using the BY procedure; the yellow line di splays the difference value , where 0.05 is the FDR for all 25 regions for the pooled model using the ST procedure. The grey horizontal line indicates a difference of zero.
PAGE 64
58 Figure 12 provides the same information as Figure 11 , but for re gion 13 which received the largest shift in the mean (a one standard deviation increase). For regions 13, the mean during the no outbreak period is 4, as fo r all other regions. In Figure 12 (a) the blue line shows the disease counts for region 13 and the r ed line shows the pooled counts for region 13 and its adjacent neighbors, regions 7, 8, 9, 12, 14, 17, 18, and 19. Figure 12 (b) shows the EWMA statistics for both the independent (blue) and the pooled (red) models. Again, notice that there is a clear incr ease in the EWMA statistic for the pooled model during the outbreak period from days 51 100, while the increase is less noticeable for the independent EWMA statistic. Compared to the alarm plot s for region 1 shown in Figure 11 (d), alarm plots for region 13 show n in Figure 12 (d) display noteworthy differences. The independent model (blue) starts to signal alarms much earlier starting at day 57. As the mean increase during the outbreak is larger in region 13 compared to region 1, there are more alarms even with the independent model during the outbreak period, although irregular. The pooled model using the BY procedure (red) starts to signal on day 53 and sounds continuous alarms from day 55 100, unlike the intermittent alarms in region 1 during the outbre ak period. The ST procedure again detects the outbreak at the very beginning on day 51 and sounds steady alarms throughout the outbreak period from days 51 100.
PAGE 65
59 Figure 12 . Summary plots for region 13: (a) disease counts, (b) EWMA statistics, (c) differences, and (d) alarms. In plot (c), the blue line indicates the difference value , where is the threshold using the BH procedure for the independent model; the red line shows the difference value , where is the threshold for the pooled model using the BY procedure; the yellow line displays the difference value , where 0.05 is the FDR for all 25 regions for the pooled model using the ST procedure. The grey horizontal line indicates a differe nce of zero.
PAGE 66
60 To verify that the three procedures control the FDR at the specified level, 100 independent simulations were carried out for each model using the experimental procedure outlined in 3.1. For each simulation, the proportion of false discoveries was calculated for each procedure. The resulting boxplots of these propo rtions are displayed in Figure 13 . The ST procedure, which is the most sensitive of the three procedures, shows the most variation in the proportion of false discoveries, though the proportions were generally below the specified level. The BY procedure appears to be very conservative compared to the ST procedure. The BH procedure produces false discovery proportions similar to the ST procedure, but its power is less because it does not pool counts. Figure 13 . The bo x plots of the proportion of false discoveries in each simulation for all three procedures over 100 simulations. The far left plot is for the independent model using the BH procedure, the middle plot is for the pooled model using the BY procedure, and the far right plot is for the pooled model using the ST procedure.
PAGE 67
61 3.3 Simulation Comparing EWMA and CUSUM C harts Next, a simulation was conducted to compare the proposed EWMA method to the CUSUM method propos ed by Dassanayake and French [33 ]. The standard CUSUM chart is known to have a drawback once it starts to signal alarms, it can take a considerable time to stop signaling alarms even after the outbreak is over. We wanted to compare the performance of the two methods in this sense. A single 100 day simulation similar to the one described in section 3.2.was conducted. For the no outbreak period, from days 1 50, independent Poisson counts were generated for each region with a constant in control mean of 4. However, for this simulation, unlike the one described in section 3.2, an outbreak was simulated only for weeks 51 75, with out of control Poisson means as displayed on Figure 1 0 . Since the outbreak ends on week 75, the remaining weeks 76 100 is again a no outbreak period with an in control Poisson mean of 4. The resulting alarm plots using the three models for regio n 13 are displayed in Figure 14. Figure 14 (a) shows EWMA method using the BH procedure (orange dotted line) signaling an alarm starting from week 59 and continuing to sound alarms till w eek 75, with few interruptions in between (the simulated outbreak is from weeks 51 75). The CUSUM statistic using the BH procedure (purple dotted line) starts to sound alarms from week 61 and signals steady alarms till week 80 although the outbreak is o ver on week 75, it continues to sound alarms till week 80. The CUSUM statistic takes much longer to drop after the outbreak has ended, leading to more false a larms. Notice that in Figures 14 (b) and (c), using BY and ST procedures respectively, the EWMA method detects an outbreak faster than the CUSUM method, while also ceasing to sound an alarm almost immediately after the outbreak has concluded, while the CUSUM method continues to sound alarms. With all three procedures,
PAGE 68
62 the EWMA statistic performs b etter at correctly signaling the end of the outbreak compared to the CUSUM statistic. Figure 14 . Alarm plots for region 13, for a simulated outb reak from weeks 51 75. Figure 14 (a) shows the alarm plots for the independent model using the BH procedur e for the EWMA (orange) and CUSU M (purple) statistics. Figure 14 (b) shows the alarm plots for the pooled model using the BY procedure for the EWMA (orange) and CUSU M (purple) statistics. Figure 14 (c) shows the alarm plots for the pooled model using the S T procedure for the EWMA (orange) and CUSUM (purple) statistics.
PAGE 69
63 3.4 Comparing Performance of the Three Procedures Robertson et al. [19 the performance of the three procedures, we used two standard measures used to compare performance in surveillance systems; one to measure the detection speed and the other to measure the false alarm rate. Conditional Expected Delay (CED) is a measure co mmonly used in surveillance to compare detection speeds; the Probability of a False Alarm (PFA) is a common measure used to compar e false alarm rates. FrisÃ©n [25 ] discusses these two measures, along with some others in the context of surveillance. The C onditional Expected Delay (CED) is the average delay from a specific change point : a change point is the time at which the actual change in the mean occurs. Let be the alarm time when the procedure signals an alarm. Then the CED can be defined as FrisÃ©n [25 ] defines the second performance measure, the probability of a false alarm as, ). According to this definition, PFA is the probability that an alarm occurs before the actual change in the mean at time , as persistent outbreaks are considered. Since we are considering outbreaks of a limited duration, we use a similar measure the false positive rate (FPR). FPR is the pro portion of a false alarm s before or after the outbreak, as fa lse alarms can occur before and after the outbreak. In calculating CED and FPR , we consider ed outbreaks starting on week 51 with durations 2, 3, 4, 5, 6, 7, 8, 9, and 10. For example, for an outbreak of dura tion 2 weeks starting at week 51 , th e outbreak w ill start on week 51 and end on week 52 . For each duration, we simulated 100
PAGE 70
64 data sets and calculated the detection delay for each simulation for all three procedures; then we averaged the delays for each model to estimate the CEDs for the three procedures . Plots of the estimated CED plots are displayed on Figure 15 (see appendix C for additional CED plots) . First, consider the EWMA pooled model using the BY (red solid line) and ST (orange solid line) procedures. Recall that lower CED means better perform ance, indicating lower delays in detection. Recall that region 1 received a 0.2 standard deviation increase, region 2 a 0.3 standard deviation increase, region 7 a 0.75 standard deviation increase, and region 13, a 1.0 standard deviation increase. Regard less of the magnitude of the increase in the means, for all four regions the ST procedure is performing consistently better (lower CED) compared to the BY procedure across all outbreak lengths. The CEDs for the pooled EWMA model using the ST procedure are close to zero for most outbreak lengths, indicating speedy detection of outbreaks of varying magnitudes. Compared to the model using the ST procedure, the model using the BY procedure shows substantial delays, particularly for smaller shifts in the m ean: for regions 1 and 2 the CEDs are between 2 to 4 weeks; for regions 7 and 13 the CEDs are between 1 to 2 weeks. As pointed out earlier, speedy detection of an outbreak is critical in disease surveillance. Therefore, the ST procedure is preferred over the BY procedure.
PAGE 71
65 Figure 15 . CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in regions 1, 2, 7, and 13. Second, the speed of detection of the pooled EWMA and CUSU M can be compared using Figure 15 . The EWMA statis tic using the BY procedure seem to be performing somewhat better (lower CED) than the CUSUM statistic using the BY procedure for most outbreak lengths in regions, 1, 2, and 7. However, for region 13, with the highest shift in the mean, the performance of t he two statistics using the BY procedure seems to be comparable. On the other
PAGE 72
66 hand, both the EWMA and CUSUM statistics using the ST procedure seems to perform comparably across all outbreak lengths, for all regions regardless of the size of the shift in t he mean. However, a careful inspection of the plots would reveal that the pooled EWMA statistic using the ST procedure is slightly faster (a statistically significant difference) than the pooled EWMA model using the ST procedure, for medium to large incre ases in the mean (regions 2, 7, and 13), across all outbreak lengths considered. Overall, in terms of detection speed, the pooled EWMA statistic using the ST procedure has a slight edge over the pooled CUSUM statistic using the ST procedure for medium to large shifts in the mean. Next, we considered the second performance measure to decide whether one statistic is better than the other. Figure 16 shows the plots for FPR in regions 1, 2, 7, and 13 (see appendix C for additional FPR plots) . Using the same experimental setup for generating CED plots, outbreaks of length 2, 3, 4, 5, 6, 7, 8, 9, and 10 wer e considered starting at week 51 . For each outbreak length, 100 data sets were simulated, an d the false positive rate in each simulation w ere averaged to est imate the FPR for each of the three procedures. For example, for outbreak length 2, 100 data sets were simulated and the resulting FPR s were averaged for each procedure. The proportion of false alarms for each simulation was calculated by dividing the to tal number of false ala rms by the total number of no outbreak time periods (total number of time periods when the null hypothesis is true). Likewise, FPR s were calculated for other outbreak lengths 10 as well. A lower FPR indicates better performa nce (other considerations being held constant), as false alarms are undesirable.
PAGE 73
67 Figure 16. FPR s for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in regions1, 2, 7, and 13. First, we compared the performance of the EWMA pooled model using the BY procedure (red solid line) with the EWMA pooled model using the ST procedure (orange solid line). In all four regions with different shifts in means the more sensitive EWMA pooled model using the ST procedure has somewh at more false alarms, as expected, compared to the EWMA pooled model using the BY procedu re. Also, note that there is a slight upward trend in the curves as the duration of the outbreak increase: for longer outbreaks, both the CUSUM and the EWMA
PAGE 74
68 statistic s take longer to signal the end of the outbreak resulting in more false alarms . Compared to the EWMA, the CUSUM takes a substantial amount of additional time to signal the end of the outbreak resulting in increased false alarms for longer outbreaks with l arger shifts in the mean (as illustrated in plots for regions 2, 7, and 13). Furthermore, the distance between the red and the orange lines for region 13 is smaller than that of region 1. With a more prominent shift in the mean, region 13 has fewer false alarms compared to region 1. Second, we compared the performance in terms of FPR for the pooled EWMA and CUSUM statistics. The red dotted line shows the CUSUM statistic using the BY procedure and the red solid line shows the EWMA statistic using the BY pr ocedure. Clearly, across all regions, the EWMA statistic using the BY procedure is performing better (lower FPR ) than the CUSUM statistic using the BY procedure. Furthermore, as the magnitude of the outbreak increases from region 1 to region 13, the EWMA statistic using the BY procedure performs even better (lower FPR ) compared to the CUSUM statistic using the BY procedure. Next, we compared the performance of the pooled EWMA statistic using the ST procedure (orange solid line) with the pooled CUSUM stat istic using the ST procedure (orange dotted line). For smaller shifts in the mean in regions 1 and 2 the two models perform comparably. However, for larger shifts in the mean region 7 and 13 the EWMA statistic using the ST procedure is performing better, for most outbreak lengths, than the CUSUM statistic using the ST procedure; noticeably, for larger outbreak lengths, the performance is even better with the ST procedure than the BY procedure. Also, the performance of the EWMA statistic using the ST procedure is better compared to the CUSUM statistic using the ST procedure in region 13 than in region 7, as the magnitude of the outbreak is larger with a more prominent shift in region 13.
PAGE 75
69 In summary, although both EWMA and CUSUM models perform co mparably in terms of detection speed (CED), EWMA model outperforms the CUSUM model in terms of the false positive rate (FPR ) (see appendix E for additional plots on observed FDR as a function of outbreak length for both EWMA and CUSUM models) . 4. Salmone lla Newport C ase S tudy Disease counts of weekly Salmonella Newport cases from each of the 16 German federal states were available from Ro bert Koch Institute, Germany [27 ], from the first week of January 2004 (week 1) to the second week of February 2014 (we ek 528). The proposed method was applied to this data set for outbreak detection. Since Salmonella is a non contagious disease without any trend or seasonality, independence in disease counts was assumed. A map of the 16 German federal states are display ed on Figure 17 . Since there were no unusually high disease counts during the first two years from 2004 2005 (weeks 1 104), data from this period was used to calculate the in control mean for each state. It is also reasonable to assume spatial independenc e of counts during this in control period, as verified by spatial analysis. The proposed method was applied to data from 2006 2014. Since the population is unevenly distributed within the country, certain states with larger populations, such as Bavaria , had a high expected disease counts, while certain other states with low populations, such as Bremen, had a low expected disease counts. To account for this, a population adjustment was performed following the method used by Ruabertas [9 ] which is explai ned in section 2.2. Regional neighborhoods were formed by pooling counts of each region and its neighbors sharing a common boundary using a binary weights matrix. The method outlined in Section 2.4 was applied to the Salmonella data to identify outbreaks of Salmonella cases, utilizing 10,000
PAGE 76
70 bootstrap samples and an FDR level of 0.05 at each time step. Both the BY and ST procedures were used to account for the multiple testing problem. As a baseline, the EWMA method was also applied without pooling neigh boring counts, using the BH procedure to control the FDR at 0.05. Figure 17 . Map of the 16 German federal states. Figures 18 and 19 show the results for the states of Bavaria and Rhineland Palatinate. The x axis shows the week number of the time period of interest, where the first week of January 2004 is week 1. The actual outbreak occurred in the first week of November 2011 [29 ] (week 410); so the results are shown from the second week of September 2010 (week 350) to the fourth week of July 2013 (week 500). Of the three procedures, the ST procedure using the pooled
PAGE 77
71 model speedily identified the outbreak as it started on week 410 in all 15 states (two of the 16 states, Saarland and Rhineland Palatinate, were combined). The pooled model using the BY pro cedure also identified the outbreak on week 410 in all states as a substantial shift in the counts occurred at the onset of the outbreak. However, the baseline model using only the individual state counts produced delayed alarms: in one state the alarm wa s detected two weeks later and in five other states it was detected a week later. Figure 18 (c) shows the alarms in the state of Bavaria for the three models. The pooled model using the ST procedure (as well and the BY procedure) promptly identified the ou tbreak on week 410, when the actual outbreak started. However, with the baseline model using only the counts of Bavaria, the outbreak was detected a week later on week 411. Also, note that there is a relatively smaller peak on week 460. The pooled model using the more sensitive ST procedure starts to signal alarms from week 456, as the counts start to climb; then again, the pooled model using the BY procedure picks up the alarm only on week 460 4 weeks later. The independent count model using the ST p rocedure, however, does not detect this peak at all. Figure 19 (c) shows the alarm plots for the state of Rhineland Palatinate. Again, both the pooled model using the ST and the BY procedures signal the outbreak promptly on week 410. However, there is procedure picks up a small cluster of increased counts around week 450 while the other two model do not signal any alarms.
PAGE 78
72 Figure 18 . Disease counts (a), EWMA statistics (b), and alarm plots (c) for the state of Bavaria. The blue line in plots (a), (b), and (c) represent the independent count model. The red line in plots (a) and (b) indicate the pooled count model. In plot (c) the red line indicate the alarm plots for the po oled model using the BY procedure and the orange line shows the alarm plots for the pooled model using the ST procedure.
PAGE 79
73 Figure 19 . Disease counts (a), EWMA statistics (b), and alarm plots (c) for the state of Rhineland Palatinate. The blue line in plots (a), (b), and (c) represent the independent count model. The red line in plots (a) and (b) indicate the pooled count model. In plot (c) the red line indicate the alarm plots for the pooled model using the BY procedure and the orange line shows the alarm plots for the pooled model using the ST procedure. Compared to the EWMA statistic, the CUSUM statistic has a major disadvantage the CUSUM statistic continues to signal alarms long after the outbreak is over. Figure 20 (c) shows the results for Ba varia using the pooled models for both the CUSUM and EWMA statistics with the ST procedure. Both statistics sound alarms at the beginning of the outbreak on week 410.
PAGE 80
74 However, the CUSUM statistic takes longer to signal the end of the outbreak: although t he EWMA statistic signals the end of the outbreak on week 425, the CUSUM statistics signals the end of the outbreak on week 434 9 weeks later. Also, note the second peak after week 450. Both the CUSUM and the EWMA statistics start to signal alarms on w eek 456. However, the EWMA statistic signals the end of the outbreak on week 463 while the CUSUM statistics takes 5 additional weeks to signal the end of the outbreak on week 468. As shown in Figure 20 (b) the CUSUM statistic builds up more rapidly at the onset of the outbreak compared to the EWMA statistic and takes longer to stabilize. The CUSUM statistic increases rapidly and takes longer to return to zero as the only factor that decreases the statistic is a constant parameter. However, the EWMA increa fall off exponentially as in a geometric ].
PAGE 81
75 Figure 20 . Disease counts (a), EWMA and CUSUM statistics for the pooled model using the ST procedure (b), an d alarm plots (c) for the state of Bavaria. The blue line and red line in plots (a) represent the independent counts and the pooled counts. The orange lines in plots (b) and (c) represent the pooled EWMA model using the ST procedure while the purple line s depict the pooled CUSUM model using the ST procedure.
PAGE 82
76 5. Discussion In the proposed new procedure, aggregated disease counts in multiple geographic regions are used for surveillance of disease outbreaks using the popular EWMA statistic. Two new fea tures of the method are practically beneficial in a disease surveillance setting: first, the use of p values, instead of the conventional critical values, makes it possible to quantify the strength of evidence for an outbreak. Second, the use of FDR error control, in place of the traditional FWER methods, makes it possible to utilize a variety of more powerful multiple testing procedures such as the BH, BY, and ST procedures. Of the three procedures, the ST procedures is the most powerful and detects outb reaks the fastest. The use of the EWMA statistic over the Poisson CUSUM statistic has several benefits in disease surveillance. First, the Possion CUSUM method requires the disease counts to follow a Poisson model and is sensitive to deviations from this assumption, as expla ined in section 2.5 [28 ] . However, the proposed method using the EWMA statistic does not require the counts to follow any specific distribution. Also, as illustrated by the results, compared to the Poisson CUSUM method, the proposed EWMA model has a reduced false alarm rate: the EWMA statistic is much quicker in signaling the end of the outbreak, compared to the CUSUM statistic. It is important to point out that FDR is controlled at each time step in the proposed methodology. Also, the proposed method pools counts of neighboring units to signal an alarm. As pointed out by Raubertas [9 ], pooling may increase the speed of detection when the shift in the mean occur in several units of the regional neighborhood. However, pooling may f urther delay the detection speed if the shift occurs only in one or two regions in the neighborhood. In addition, pooling may result in an increase in the false alarms of the neighboring regions.
PAGE 83
77 As mentioned before, the input to the method does not hav e to be discrete counts continuous inputs are also valid. Therefore, one possible future research direction would be to study the application of the method for continuous inputs: as an example, monitoring of high pollution levels in multiple geographic regions. Furthermore, the current model is applicable for detecting food borne outbreaks, a class out outbreaks without trend or seasonality. Another possible future research direction would be to generalize the method to detect outbreaks having seasonal ity, and also for when disease counts are correlated across time and/or space.
PAGE 84
78 CHAPTER IV DETECTION OF SPATIO TEMPORAL CLUSTERS IN MULTIPLE GEOGRAPHIC REGIONS Emerging disease outbreaks must be detected as early as possible for public health authorities to take prompt preventive measures. Disease surveillance is the primary method used by public health practitioners to monitor the spread of an outbreak. Most public health surveillance systems record the time and location of reported cases [1]. Space time surveillance methods utilize both the time and location information to identify emerging space time disease clusters. Most of these methods consider an area of study consisting of small geogr aphic units where disease counts in each unit are collected sequentially over time. In a retrospective analysis seeking to identify whether a disease outbreak occurred in the past, a single decision is made for the entire data set, as in classical hypothe sis testing. However, in prospective surveillance, decisions are made about whether there is currently a disease outbreak based on the data available so far. Prospective surveillance is also referred to as on line surveillance in the literature. The aim of prospective surveillance is to correctly identify an outbreak as quickly as possible to initiate public health interventions. Prospective surveillance methods are classified by Unkel et al. [2] as time series methods, regression based methods, statisti cal process control methods (SPC), methods incorporating spatial information, and multivariate detection methods. Of these methods, SPC methods that originated in industrial process control have been successfully adopted for practical disease surveillance purposes [4 ]. Specifically, The Exponentially Weighted Moving Average method (EWMA) is a popular SPC method that has been used in the Early Aberration Reporting System
PAGE 85
79 (EARS) developed by the Center for Disease Control and Prevention (CDC) and ESSENCE ( El ectronic Surveillance System for the Early Notification of Community based Epidemics) developed by the Department of Defense and Johns Hopkins University [4 ]. The EWMA is a statistic that is widely used to monitor industrial processes in which more weight is given to recent observations and less weight to past observations. The starting value of the EWMA statistic is a target value to be monitored which is often taken to be the process mean in an industrial process control setting: in a disease surveillanc e setting, this is the mean disease count when there is no outbreak present. The center line for the EWMA statistic shows the mean for the in control process and control limits are pre defined with a fixed number of standard deviations above or below the center line (the standard deviation of the EWMA statis t ic is also estimated from historical data). An alarm is signaled, indicating the process is out of control, when the EWMA statistic crosses the control limits. When the process is in control, we do n ot want to sound an alarm. When the process is out of control, an alarm should be signaled. Adopting SPC language for prospective disease surveillance, an out of control process corresponds to a disease outbreak and an in control process to no outbreak. We wish to signal an alarm anytime there is disease outbreak. The original EWMA statistic was designed to identify purely temporal clusters. The EWMA method was originally proposed in an industrial process control setting for normall y distributed data by Roberts [7 ] and was subsequently extended to Poisson count data by Borror [8 ]. The purely temporal SPC charts have been extended for identifying space time clusters: namely the Poisson CUSUM chart, was extended for spatio temporal c luster detection by Rau bertas [9 ]. The method captures local spatial information by combining counts within a
PAGE 86
80 each region along with neighboring regions sharing a common boundary . These pooled counts are used to calculate a pooled statistic for each regional neighborhood. An alarm is signaled whenever a regional neighborhood statistic exceeds a predefined threshold, indicating a rise in counts in that regional neighborhood. Subs equently , Rogerson and Yamada [10 ] extended the purely temporal Poisson CUSUM chart in two ways: first, similar to the method outlined by Raubertas [9 ], the Poison CUSUM chart was extended to monitor counts in regional neighborhoods; second, the Poisson CUSUM char t was extended allowing expected counts to vary over time. The multiple testing problem that arises when testing multiple neighborhood statistics simultaneously was handled by controlling the Family Wise Error Rate (FWER), using the popular Bonferroni Cor rection. Although pooling counts is a simple way of capturing spatial information, there are more sophisticated methods to model spatial autocorrelation in a given neighborhood. Spatial ements taken at different ], such as the correlation among disease counts in different regions. Local Indicators of Spatial Association (LISA) [37 ] provide a local measure of similarity between a One of the most widely used LISAs is a local have similar values, which indicates spatial clustering of similar counts (positive spatial autocorrelation). Th provides an efficient method to assess for spatial clustering of a disease in a neighborhoo d. The FWER procedures aim to control the probability of even one false discovery. FWER controlling procedures can be too stringent when testing a large number of tests. In
PAGE 87
81 contrast to FWER procedures, False Discovery Rate controlling (FDR) procedures c ontrol the expected proportion of false discoveries among all discoveries. As a result, FDR based procedures provide a more lenient alternative when testing a large number of hypothesis tests simultaneously. FDR procedures offer more power at the expense of increasing the number of false alarms. Benjamini and Hochberg [11] proposed a simple and easy to implement FDR controlling method when the test statistics are independent. Subsequently, the method was extended for dependent statistics by Benjamini an d Yekutieli [12]. Procedures have been proposed for the implementation of EWMA chart using FDR error control in the context of industrial process control [13] In conventional EWMA charts an alarm is signaled whenever the EWMA statistic exceeds a pre defin ed control limit. Using this method, we would only know whether the process is out of control or not at a given point in time. However, Li et al. [14] proposed a general procedure for control charts, where p values can be used, instead of control limits, to signal alarms. This way, when a distributional shift occurs, by looking at the p values, we would know how stable the process is (the process is stable w hen it is in control). Li et al. [14] outlined a procedure to estimate the in control distribution of a control chart statistic, extracting bootstrap samples from the in control data. At each time, a p value is calculated by comparing the control chart statistic at that time point with the approximated in control distribution. An alarm is signaled wh en the p value is less than a pre specified significance level. Li and Tsung [15] proposed a methodology to simultaneously monitor multiple attributes of a manufacturing process using multiple CUSUM charts along with FDR error control. At each time ste p Li and Tsung [15] compute a CUSUM statistic for each control chart. Using these CUSUM statistics, p values are calculated at each time step using random walk based
PAGE 88
82 approximations. When the multiple CUSUM statistics are assumed to be independent, Li and Tsung [15] suggest using the Benjamini and Hochberg [11] procedure to signal alarms while controlling the FDR; furthermore, when the CUSUM statistics are known to be dependent Li and Tsung [15] recommend using the Benjamini and Yekutieli [12] method. In t he proposed method, a new statistic is proposed by extending the purely temporal EWMA statistic to a spatio temporal setting by modifying a pooled EWMA statistic to allow for and Tsung [15], p values are used, instead of traditional control limits. FDR based multiple testing methods are used as opposed to FWER based method: FDR controls the rate of false discoveries (e.g. falsely rejected null hypotheses) among all discoverie s (e.g. total number of rejected hypotheses). Since we are more interested in detecting a true deviation from the null than incorrectly rejecting true null hypothesis, controlling FDR is more appropriate than controlling FWER. The multiple testing proble m that arises when testing a large number of regional statistics simultaneously is handled by employing a more powerful FDR based testing procedure prop osed by Story and Tibshirani [17 ]. The paper is organized as follows: section 2 describes the proposed methodology with details of the design of the EWMA control chart, modeling for spatial autocorrelation using the local based multiple testing using the Storey Tibshirani procedure. Section 3 shows a simulation study along with a performance comparison of the method with a simpler model that does not model for spatial autocorrelation of counts. Section 4 discusses the strengths of the method and discusses future research directions.
PAGE 89
83 2. Methods 2.1 Standard EWMA The original EWMA control chart was proposed by Roberts [ 7] for normally distributed data. Subsequently, Borror et al. [18] extended the method for Poisson data. Borror et al. [18] considered a sequence of counts which are independent and identically distributed Poisson random variables with mean , collected at times When the process is in control, The objective is to detect a shift in the mean to some other valu e. The Poisson EWMA chart is defined as where is the observed disease count at time t , the Poisson EWMA statistic at time t , a weighting constant between The starting value of the EWMA statistic is usually the process target so that The parameter is the weight given to the most recent observation. Montgomery a nd Douglas [36 ] recommend selecting a value where The EWMA statistic gives more weight to the most recent observation and weights on past observations fall off as in a geometric series [32 ]. With this definition of the EWMA statistic, the mean and the standard deviation can be derived as given by Borror et al. [18], an d . For large values of t , the asymptotic variance is given by
PAGE 90
84 because as . Using the asymptotic standard deviation, the control limit h of the on e way control chart is given by the formula , where the control limit h is L asymptotic standard deviations above the in control mean . An alarm is signaled whenever . 2.2 Extending the Purely Temporal EWMA Char t to a Spatio temporal Setting: Forming Regional Neighborhoods by Pooling Counts Within N eighborhoods The original Poisson EWMA statistic proposed by Borror [8 ] is a purely t emporal statistic. Raubertas [9 ] proposed a simple way to extend the purely tempo ral Poisson CUSUM chart into a spatio temporal setting by pooling counts of each region and its neighboring regions to form regional neighborhoods. These pooled counts are used to calculate a pooled Poisson CUSUM statistic for each regional neigh borhood. Although Raubertas [9 ] used the Poisson CUSUM statistic, a similar method can be used to extend the EWMA statistic to a spatio temporal setting. If there are m spatial regions, the geographic relationship among the m regions can be summarized by an matrix containing a measure of closeness between each pair of regions. For example, the closeness between two regions i and l if they share a common boundary can be defined as ; if they do not share a common boundary, the closeness can be d efined as . In this manner, counts of each region will be pooled with its immediately adjacent neighboring regions to form regional neighborhoods; the area under surveillance will consist of overlapping regional neighborhoods. As Raubertas [9 ] points out, if an increase in counts occur in more than one region within a neighborhood, pooling will increase the sensitivity of detecting a shift in the mean counts. However, pooling will have a negative effect if the
PAGE 91
85 increase happens in only one regi on because of dilution. Let be the disease count in unit l at time t and be a binary measure of closeness between regions i and l . Then the pooled counts of region i and its immediately adjacent neighbors at time t will be 2.3 Extending the Purely Temporal EWMA Chart to a Spatio temporal Setting: Modeling for Spatial Autocorrelation in Regional N eighborhoods Rather than simply pooling counts within regional neighborhoods, a local version of odel for the spatial autocorrelation. Just a I index [38 ] provides a summary of the spatial similarity over an entire study areas, the local Wa ll er and Gotway [16 identify local disease clusters in a retrospective analysis. Let be the observed disease count of region i where i= N , following a Poisson distrib ution, be the expected counts in region i , and be a weight describing the proximity between regions i and j . Then the adjusted local Thus, is the product of standardized counts in regions i and the weighted sum of standardized counts of the neighbors of region i . If the observed counts of both region i and its immediately adjacent neighbors are similar, for example higher t han what is expected, then would be positive. On the other hand, if the observed counts of region i and its neighbors are dissimilar,
PAGE 92
86 then to summ arize the spatial similarity of observed counts in the local neighborhood. 2.4 Modifications to a More Powerful P rocedure The proposed method is better suited for disease surveillance over the traditional SPC methods for the following reasons: with the pro posed method, p values are used instead of the traditional cont rol limits ; FDR based multiple testing procedures are used as opposed to conven tional FEWER based procedures ; the use o f p values and FDR based multiple testing make it possibl e to use more rec ently developed testing procedures such as the Storey Tibshirani procedure. These three points are explained below. The procedure for estimating the p value s used to decide whether an outbreak is in progress is based on the procedure proposed by Li et al . [14] for CUSUM statistics. We assume that the in control distribution of the data is unknown, so we use the bootstrap method to approximate the null distribution s of the EWMA statistics. However, Li et al. [14] did not have to deal with spatial structur e in their data. In a typical bootstrap sample, one samples observations with replacement from the observed data . This is repeated a large number times. In our setting, we must preserve the spatial structure of the data since the responses are not necessarily independent and identically distributed across space. The statistics of interest are then calculated for each sample. In this case, EWMA statistics at each time point t , were calculated. A fter that, the EWMA statistics at each time point t were compared with the null distribution of the EWMA statistic to compute p values at each time point t . Since statistics belonging to multiple regions are tested simultaneously, a multiple testing proble m arises. Li and Tsung [15] address the multiple testing problem using an FDR 
PAGE 93
87 controlling procedure that uses conventional critical values. We instead use p values, allowing for the use of many FDR based testing procedures. When numerous hypotheses are tested simultaneously, the use of FDR based testing is often more appropriate than the FWER based methods . FWER is the probability of making at least one false discovery among all hypotheses tested. When many hypotheses are tested, FWER control can be ve ry stringent, resulting in procedures with little testing power. FDR is an alternative testing error criterion related to the expected proportion of false discoveries. Formally, , where V is the number of true null hypothesis that are declared significant and R the total number of hypotheses that are declared significant out of m hypotheses tested. Using FDR as the error criterion allows us to employ more recently developed multiple testing procedures such as the Storey Tibshirani [17] pro c e dure . Storey and Tibshirani [17 ] proposed a FDR based multiple testing procedure that can be applied for locally dependent statistics, such as the EWMA statistics generated from overlapping neighborhoods in the proposed m odel. Storey and Tibshirani [17 ] point out that the original multiple testing procedure proposed by Benjamini and Hochberg [11] for independent statistics is too conservative. They proposed a more powerful meth od using q values instead of the conventional p values: just as a p value is the minimum false positive rate that is incurred when calling a test significant, a q value is the minimum false discovery rate that is incurred when calling a test significan t. T he Storey and Tibshirani [17 ] method for multiple testing is detailed below. We will refer to this procedure as the ST procedure.
PAGE 94
88 For a given list of p values, the ST method calculates a list of q values. A q value, by definition, is the expected propo rtion of false positives incurred when calling a test signifi cant. Storey and Tibshirani [17 ] derived an expression for the false positive rate when calling all tests significant whose p value is less than or equal to a threshold t , where When a large number of tests m are considered, the expected proportion of false positives can be approximated by dividing the expected number of false positives by the expected number of significant tests. The denominator , which can be easily calculated as the e xpected number of significant tests , is the total number of observed p values that are less than or equal to the threshold t . The calculation of the expected number of false positives in the numerator is more complex. The expected number of false positi ves is the product of the probability of a true null hypothesis being rejected and the total number of true null hypothesis, . In calculating the probability of a true null hypothesis being rejected, we need to use the fact that the p values of true null hypotheses follow a uniform distribution. As a result, the probability of a p value being less than or equal to the threshold is simply t . Therefore the expected number of false positives is Since is an unknown, the proportion of t rue null hypotheses has to be estimated. A key step in the ST procedure is the estimation of (see appendix A for approximation of ). Using the fact that p values of true null hypotheses follow a uniform distribution and true alternative hypotheses have a distribution close to zero, Storey and Tibshirani [17 ] determine an estimate of the proportion of null hypotheses, , by considering the density of the uniformly distributed section of the histogram of p values. Combi ning all the steps, the q value for a given p value is
PAGE 95
89 where is the indicator function. Using this expression, a q value can be calculated for each of the m p values, by substitu ting in for , . Finally, the q values are ordered in the ascending order and all q values less than a pre defined FDR level are called significant. It is important to point out that a q value is technically defined using a quantity c alled pFDR. The quantity FDR can be loosely defined by , whereas pFDR is defined as , since there is a non zero possibility of [17 ]. Using pFDR, a q value can be technically defined as the minimum pFDR incurred when calling a test significant, just as a p value is defined as the minimum false positive rate whe n calling a test significant [24 ]. When a large number of regions such as counties within a state or constituencies within a country are simultaneously mon itored, then . Storey and Tibshirani [17 ] remark that their method can be applied to models with of thumb, the more local the dependence is, the more likely it is to meet the weak dependence (see R emark D in [17 2.5 Details of the Proposed Methodology The first step of the method is to define a tolerable overall FDR level fo r all m regions. Then at each time period t , disease counts are collected for each of the m regions. Next, the counts are standardized using the counts of the in control period (i.e. , no outbreak period) to estimate the in control mean and the standard deviation for each region . Note that as counts are assumed to be distributed as Poison random variables, if the estimated in control mean for region i is then the standard devia tion for region i is . So the standardized counts for region i and time t is .
PAGE 96
90 Thereafter, the standardized counts are used to calculate an adjusted local Mo r I statis tic for each region where Here is a spatial proximity matrix. We considered to be a binary connectivity matrix whose elements are if regions i and j share a common boundary and if regions i and j do not share a common boundary. Note that we omit the spatial similarity of region i with itself and set for sta ndardized count of region i and the sum of standardized counts among neighbors. After , an adjusted count for region i at time t is calculated using centered counts for as Again, the binary weights are defined as before. These adjusted counts are used to calculate EWMA statistics for each of the regions at time , with The same statistics are calculated for each of the bootstrap samples (see appe ndix B for details of the bootstrap procedure) , with denoting the EWMA statistic calculated using the adjusted counts in (1) at location for time of bootstrap sample . This way, there will be B number of EWMA statistics calculated fo r each region from which empirical in control distributions for each region can be approximated. Then the EWMA statistic is compared with the empirical in control distribution for region i to estimate the p values
PAGE 97
91 Finally, the ST procedure is used to calculate the corresponding q values and identify significant q values to sound alarms for the corresponding regions. It is important to note that, purely temporal SPC methods have been ex tended to spatio temporal surveillance by pooling cou nts of regional neighborhoods [9 ]. However, pooling neighborhood counts can increase false alarms in neighboring regions as a spike in counts in a neighbor can directly affect the pooled counts of the r egional neighborhood. However, by modelling for spatial autocorrelation in the regional neighborhood, one can lessen the effect of an increase of counts in a neighbor. Therefore, it is advantageous to summarize for spatial dependence between regions usin simply pooling counts. Last, purely temporal SPC methods have been extended to spatio temporal surveillance by pooling counts of regional neighborhoods [9 ]. However, pooling neighborhood count s can increase false alarms in neighboring regions as a spike in counts in a neighbor can directly affect the pooled counts of the regional neighborhood. However, by modelling for spatial autocorrelation in the regional neighborhood, one can lessen the ef fect of an increase of counts in a neighbor. 3 Simulation Experiment 3.1 Experiment Setup A study area made up of 25 regions located in a five by five grid was considered. Data r days 1 50 and independent Poisson counts with an in control mean of were simulated for each region. Starting from days 51 55, an outbreak was simulated in selected regions as shown on Figure 2 1. Regions 9, 14, 15, 18, 19, 20, 24 and 25 experience an outbreak and the out of 
PAGE 98
92 control means are shown on Figure 2 1. During the outbreak period, in regions 9, 14, 18, and 19 correlated Poisson counts with an out of control mean of 6 were simulated with a correlation coefficient of 0.8; in regions 15, 20, 24, and 25 correlated Poisson counts with an out of control mean of 5 were simulated with a slightly lower correlation coefficient of 0.7. In simulating correlated Poisson counts, the algorithm prese nted by Barbiero and Ferrari [39 ] was used. First, 25 variables were drawn from a joint normal dist ribution with mean zero and a correlation matrix as described above. Next, univariate normal cumulative distribution function (CDF) of the variables were applied to derive probabilities for each variable. Finally, the inverse CDF of the Poisson distribut ion function was applied with the mean structure displayed in Figure 2 1. In all other regions, no outbreak was simulated, so independent Poisson counts were simulated with the same in control mean of 4. Again, from days 56 100, a no outbreak period wa s constructed and independent Poisson counts with a constant in control mean of were simulated in all regions. compared to a baseline model using a p ooled EWMA statistic (chapter 2): in the pooled model counts of each region and its contiguous neighbors were pooled together to calculate regional neighborhood statistics. In calculating the standardized counts for each region for the proposed method, an estimate of the in control mean was calculated by averaging counts for 100 no outbreak days (in control days) where the counts were simulated using independent Poisson counts of mean 4 for all 25 regions. This additional 100 days of data consist of the in control (no outbreak) period. This simulated data were used to calibrate the model. A binary connectivity
PAGE 99
93 used in the EWMA statistic was . For the b aseline model, counts of each region were pooled with its immediately adjacent neighbors to create regional neighborhoods. Figure 2 1. The out of control means ( , for the simulated outbreak on days 51 55 are shown. Only regions 9, 14, 15, 18, 19, 20, 24, and 25 experience an outbreak during days 51 55 and the out of control means are indicat ed at the center of each region . Regions 9, 14, 18, and 19 are correlated with a correlation coefficient of 0.8; for regions 15, 20, 24, and 25 the correlation coefficient is 0.7. The other regions do not experience an outbreak and counts are simulated using an in control mean of 4. The region numbers are indicated at the top center part of each region in blue. 3.2 Simulation Results Figure s 2 2 , 2 3, and 2 4 show the results of a single simulation over a 100 day period for regions 4, 14, and 15 respectively. Figure 2 2 shows the results for region 4 where no outbreak was simulated. Note that region 4 shares a border with region 9, where an out break was simulated from days 51 55 with an out of control mean of 6. In Figure 2 2(a) the blue line shows the counts of region 4; the red line shows the counts of region 4 and its immediately adjacent neighbors: regions 3, 5, 8, 9, and 10. On the next pl ot in Figure 2 2 (b) the blue line shows the Note that for the pooled EWMA statistic (blue line), the estimated pooled in control mean is
PAGE 100
94 close to 24 (23.32 is the minim um value for the EWMA statistic) as the regional neighborhood centering region 4 contains 6 regions, each having an in control mean of 4: so . However, for the proposed statistic this minimum value is 3.54, which is calculated from bootstrap samples extracted from in control data. Therefore, the EWMA statistic for the proposed model (red) is well below the EWMA statistic for the pooled model (blue). Plot 2(c) shows adjusted q values for the two models. The red line shows the difference q va lue) for the proposed model using the ST procedure, where 0.05 is the overall FDR. The blue line shows the difference q value) for the baseline model. The grey horizontal line shows a zero difference. An alarm is signaled whenever the differences exceed the grey horizontal line. The alarm plots are shown in Figure 2 2(d). The proposed method (red) correctly signals no alarms during the entire 100 day period, as no outbreak was simulated in region 4. However, the more sensitive baseline model (bl ue) signals sporadic false alarms throughout the interval. It is important to highlight the false alarms from days 52 59. These alarms are a result of the outbreak that occurs in the neighboring region 9 from days 51 55. The spike in counts of the neigh boring region 9 increases the pooled neighborhood counts resulting in false alarms in the neighboring region 4 although no outbreak occurs in region 4.
PAGE 101
95 Figure 2 2 . Summary plots for region 4: (a) disease counts, (b) EWMA statistics, (c) differences, and (d) alarms. The red lines in plots (b), (c), and (d) represent the proposed method while the blue line shows the baseline model. The red line in plot (c) shows the difference value where 0.05 is the overall FDR using the ST procedure for the proposed method; the blue line in plot (c) shows the difference value for the pooled EWMA model using the ST procedure. The grey horizontal line in plot (c) shows a difference of zero.
PAGE 102
96 Figure 2 3 shows the results for region 14 where an outbreak was simulated from days 51 55 with an out of control mean of 6. Region 14 borders 4 regions regions, 15, 18, 19, and 20, where the same outbreak occurs from days 51 55. The proposed model starts sounding alarms from the very beginning on day 51 and continues to signal alarms till day 61. The baseline model starts sounding alarms from day 52 one day later and continues to sound alarms till day 62, one additional day. Also note the scattered false alarms on days 32, 66, 68, 70, and 81 with the baseline model. Figure 2 4 shows the results for region 15 where an outbreak was simulated from days 51 55 using an out of control mean of 5. Region 15 has 4 neighboring regio ns regions 9, 14, 19, and 20 in which an outbreak occurs during the same days 51 55. The proposed model starts signaling alarms from day 53 and continues to signal alarms till day 56. There is one additional false alarm signaled on day 75. Although the base line model using the pooled EWMA statistic starts to signal alarms from day 52 a day earlier the signals continue till day 62 7 days after the end of the outbreak. The baseline model takes much longer to signal the end of the outbreak. Fu rthermore, there are additional false alarms with the baseline model as evident from the false alarms on days 66 68, 70, 75, and 81.
PAGE 103
97 Figure 2 3. Summary plots for region 14: (a) disease counts, (b) EWMA statistics, (c) differences, and (d) alarms. Th e red lines in plots (b), (c), and (d) represent the proposed method while the blue line shows the baseline model. The red line in plot (c) shows the difference value where 0.05 is the overall FDR using the ST procedure for the proposed metho d; the blue line in plot (c) shows the difference value for the pooled EWMA model using the ST procedure. The grey horizontal line in plot (c) shows a difference of zero.
PAGE 104
98 Figure 2 4. Summary plots for region 15: (a) disease counts, (b) EWM A statistics, (c) differences, and (d) alarms. The red lines in plots (b), (c), and (d) represent the proposed method while the blue line shows the baseline model. The red line in plot (c) shows the difference value where 0.05 is the overall FDR using the ST procedure for the proposed method; the blue line in plot (c) shows the difference value for the pooled EWMA model using the ST procedure. The grey horizontal line in plot (c) shows a difference of zero.
PAGE 105
99 To calculate the ob served FDR for the proposed model and the baseline model, we performed 100 independent simulations using the parameter values described in Section 3.2. An outbreak was simulated from days 51 90 using the same out of control means used in section 3.2 (il lustrated on Figure 21). On days 1 50 and days 91 100 no outbreaks were simulated (in control periods) and independent Poisson counts were simu lated in each region with an in control mean of 4. False discovery rates were calculated for the combined regio n experiencing the outbreak (regions 9, 14, 15, 18, 19, 20, 24, and 25) for both models . Although the observed FDR is moderately higher than the specified level of , as shown in Figure 25, the proposed model has a median FDR around 0.11: however, t he pooled EWMA model (baseline model) has a higher median around 0.16 and shows more dispersion compared to the proposed model. Figure 2 5. Boxplots of the observed proportion of false discoveries over 100 simulations for the pooled EWMA method (left) an
PAGE 106
100 3.3 Comparing Performance of the Proposed Model with the Baseline M odel. To verify that the results obtained in the simulations are consistent, we used two performance comparison measures the conditional expected delay (CED ) and the false positive rate (FPR ) to compare detection delays and false a larm rates of the two models. CED is a commonly used measure in surveillance to compare detection delays. Lawson and Kleinman [40 ] define CED as the expected delay in time of d etection for a specific change point , where the change point is the time at which the distributional shift actually occurs: where is the time of the alarm. Clearly, a lower CED would indicate better performance. To compare false alarms, we use d t he false positive rate (FPR ). FPRs are calculated by dividing the total number of false alarms by the total number of no outbreak periods (total number of time periods when the null hypothesis is true). For an outbreak of a specific duration, false alarm s can occur either before the onset of the outbreak or after the end of the outbreak: so a lower FPR would indicate better performance. Robertson et al. [19 ] call to attention that a surveillance system should provide detection and few false a off between false alarm probabi ]. Therefore, to compare the performance of the p roposed model with the base line model, both the CED and FPR measures were used. When calculating CEDs and FPR considered starting from day 51 (Note that the simulation study described in Section 3.2 uses a single outbreak length of 5 days, starting from day 51). The out of control means and the correlation structure is the same as for the simulation in Section 3.2. For each outbreak duration,
PAGE 107
101 veraged for both the proposed model and the baseline model. For example, 100 independent data sets were generated with an outbreak length of 2 days starting at day 51 (i.e. , the outbreak lasted from days 51 52). Similarly, 100 independent data sets each ons were averaged, for each model. Figure 2 6 shows the average CEDs for both the models for regions 9, 18, 20 and 24 (se e appendix D for CED plots of regions 14, 19, 15, and 25) . As explained in section 3.2 all four of these regions experience the outbreak. As illustrated in Figure 2 1, region 9 has 2 neighbors experiencing the outbreak, region 18 has 3, region 24 has 4, a nd region 20 has 5. The red line index (we will refer to this as SEWMA, spatially adjusted EWMA); the blue line shows the CEDs for the baseline model using the pooled EWMA statistic (we will refer to this as PEWMA, pooled EWMA). Recall that a lower CED means better performance. Overall, neither model performs better than the other over all outbreak durations. For shorter outbreaks lengths of 2 5 days, we have mixed re sults with neither model consistently outperforming the other (the simulation results in 3.2 also illustrate that neither model is faster than the other consistently in detecting the 5 day outbreak). The PEWMA model is slightly faster for larger outbreak l engths over 7 days; however, there is no substantial gain in speed of detection. Furthermore, for regions with more neighbors experiencing outbreaks, the PEWMA model has better performance (lower CEDs), as pooled counts increase with more neighbors having raised counts. This difference in performance is not as pronounced with the proposed SEWMA method as an index of spatial
PAGE 108
102 similarity in used instead to accumulate neighboring counts. In summary, in terms of speed of detection both models have comparable performance. Figure 2 6. Average CED values versus the duration of the outbreak for proposed EWMA The sec ond performance criterion is FPR . Similar to the calcula t ion of CED, when calculating FPR 10. Then f or each outbreak length, the FPR s were calculated for each of the 100 data sets and
PAGE 109
103 averaged for each model. Figure 27 shows the av erage FPR 19, and 24 (see appendix D for FPR plots of other regions ) . Both regions 1 and 13 do not experience any outbreaks; however, regions 19 and 24 experience outbreaks. As shown is Figure 2 1, region 1 does not hav e any neighbors experiencing any outbreaks; region 13 has 4 neighbors, region 19 has 6 neighbors and region 24 has 4 neighbors experiencing outbreaks. The red line shows t he average FPR for the proposed model (SEWMA) and th e blue line show the average FPR for the baseline model ( PEWMA). Recall that a lower FPR means better performance. Clearly, the proposed model is performing better than the baseline model across all outbreak lengths in all four regions. Region 1 does not experience any outbreaks and d oes not have any neighbors experiencing outbreaks. Still some alarms are signaled, though a much smaller proportion is signaled for the proposed method. Although region 13 does not experience any outbreaks, it has 4 neighbors (Figure 2 1) experiencing outb reaks. The rise in disease counts in the neighbors trigger more alarms with the PEWMA model. In fact, with the PEWMA model there is a general trend for regions with more neighbors experiencing outbreaks to have more false alarms. Although the same effec t can be seen with the SEWMA model where neighborhood counts are accumulated using an index of spatial similarity, it is not as noticeable. Regions 19 and 24 both experience outbreaks. In addition to signaling false alarms during the no outbreak period, t he PEWMA model tends to take longer than the SEWMA model to signal the end of the outbreak, further increasing the number of false alarms (the same effect was seen in the simulation study in section 3.2.). Overall, the proposed model outperforms the basel ine model substantially in reducing the false alarm rates (see appendix F for addi tional plots on observed FPR as a function of outbreak length for both PEWMA and SEWMA models) .
PAGE 110
104 Figure 27. Average FPR values versus the duration of the outbreak for prop osed EWMA
PAGE 111
105 4 Discussion A new method is proposed for prospective disease surveillance where spatio temporal disease clusters are identified by combining a local EWMA statistic used for identifying temporal clusters. Three new aspects of the method are beneficial for disease surveillance. First, the use of p value enables us to quantify the strength of evidence for an outbre ak as opposed to adhering to conventional control limits commonly used in SPC charts. Secon d, the use of FDR based multiple testing procedures allows us to make use of state of the art testing procedures such as the Storey Tibshirani procedure. Third, the use of for the spatial autocorrelation within a local neighborhood rather than simply pooling the counts. The proposed SEWMA method overcomes two shortc omings of the baseline PEWMA model. First, when neighborhood counts are pooled, an increase in counts in any of the neighboring regions can trigger a false alarm; however, when the autocorrelation structure in the neighborhood is modeled using the local M are reduced. Second, the pooled EWMA statistic takes longer to single the end of an outbreak; comparatively, the proposed model signals the end of the outbreak more promptly. The current model is a pplicable for detecting a class of outbreaks with no trend or seasonality. A future research direction would be to extend the model to incorporate trend and seasonality in order to detect a broader range of outbreaks.
PAGE 112
106 CHAPTER V CONCLUSIONS Three new procedures have been introduced for prospective surveillance of disease outbreaks where regional disease counts are accumulated to calculate alarm statistics using two popular SPC methods namely, the CUSUM and the EWMA methods. Three novel aspects of th e proposed procedures are advantageous for practical disease surveillance purposes. First, the proposed methods use p values to signal alarms as opposed to using conventional control limits. The use of p values provides the ability to measure the strengt h of evidence for an outbreak, which is not possible with the commonly used control limits. Second, rather than controlling the average run length criterion in the monitoring process, we use the false discovery rate, which is more appropriate in a public health context. Third, the use of p values and FDR based multiple testing , permits the use of m ore innovative FDR based testing procedures. Of the FDR based procedures, the Storey Tibshirani [1 7 ] procedure is the most powerful compared to the w idely used Benjamini Hochberg [11] and Benjamini Yekutieli [12 ] procedures. The practical utility of the methods are illustrated by the Salmonella Newport example in Germany. The first method pools disease counts of contiguous regions and calculates a pooled Poisso n CUSUM statistic for each regional neighborhood. This method which uses p values to signals alarms along with FDR based multiple testing procedures is faster compared to the conventional multiple Poisson CUSUM charts using control limits and FWER based (u sing to specify false alarm rates) multiple testing procedures. In addition, the proposed method of pooling counts is faster in detecting disease outbreaks and produces less false alarms compared to a baseline model using the same algorithm wi th p v alues and FDR based multiple testing but only using individual regional counts.
PAGE 113
107 However, as Raubertas [9 ] points out pooling can further delay the detection of an outbreak if the increase in disease counts only occur in one or two regions, because o f dilution. Also, pooling counts can raise false alarms in neighboring regions. Furthermore, the CUSUM statistic requires a considerable time to signal the end of the outbreak. Additionally, the Poisson CUSUM method requires the disease counts to follow the Poisson model and is sensitive to deviations from this model assumption. The proposed second method uses the EWMA statistic and counts of immediately adjacent neighbors are pooled to form regional neighborhoods, as with the first method. Again p values are used and FDR based multiple testing procedures are used. The second model using the EWMA statistic overcomes several limitations of the Poisson CUSUM model. Compared to the Poisson CUSUM method the EWMA model is typically faster in detecting a n outbreak. Additionally, the EWMA method produces fewer false alarms because it signals the end of an outbreak earlier than the Poisson CUSUM model. Additionally, the EWMA model has less stringent distributional assumptions and does not require the dise ase counts to follow a specific distribution. Therefore, the EWMA model is a better alternative to the Poisson CUSUM model. However, this method has two lingering weaknesses. First, pooling neighborhood counts increases the false alarm rates among the n eighboring regions when the outbreak is not present in those regions. Second, although the EWMA model is quicker in signaling the end of the outbreak compared the CUSUM model, there can still be a sizable delay. The third model tackles the above mentione d issues. Rather than simply pooling neighborhood counts to calculate a pooled EWMA statistic, in the third model, the spatial similarity of counts in contiguous regions are modeled using a local index of spatial autocorrelation
PAGE 114
108 input to the EWMA statistic. This improvement in the model reduces the occurrence of false alarms in the neighboring regions this was a problem with both the first and the second models where neig hborhood counts were simply pooled together. Also, in terms of detection speed the third model is comparable to the second model. In addition, the third model is the quickest of all three models to single the end of the outbreak. Therefore, the third mo del resolves shortcomings of the first and the second models. So the three proposed models provide progressively improving algorithms for prospective surveillance of disease outbreaks. Future R esearch The Center for Disease Control identifies the investi gation of foodborne outbreaks as a critical public health function. The improved methods previously discussed are well suited for the investigation of foodborne outbreaks, a class of outbreaks without seasonality or trend; a practical application was illu strated where the method was applied to detect a Salmonella outbreak in Germany. An immediate goal would be to generalize the methodology to a broader class of settings that allows for detection of diseases with trend and/or se asonality. Elbert and Burko m [41 ] presented a promising algorithm for detecting outbreaks with trend and seasonality using generalized exponential smoothing, known as the Holt Winters method; the algorithm was presented in a univariate setting without spatial modeling. The proposed algorithm based on simple exponential smoothing (EWMA model), should be generalizable using the Holt Winters method to develop a broader procedure capable of handling diseases with trend and seasonality, as well as utilizing spatial information. The fiel d of surveillance is shifting from surveillance based on disease counts data, to surveillance based on syndromic data that provide an earlier outbreak signal such as over the counter drug sales, searches on medical web sites, calls to nurse hotlines, schoo l absenteeism
PAGE 115
109 records, and complaints from individuals e ntering emergency departments [42 ]. These sources of prediagnostic information provide an early outbreak signal, much before diagnostic information is available allowing more time for public health i nterventions. Although they do not provide the case counts of any specific disease, they do provide measurable effects of care seeking behavior well before the onset of a possible outbreak. Shmueli and Burkom [42 ] highlight surveillance based on syndromi c data as an emerging field with a multitude of research opportunities. Therefore, a long term goal would be to extend the current models to a multivariate setting, to utilize multiple syndromic data sources. Both the CUSUM and the EWMA charts have been successfully extended to a multivariate setting using vector accumulation m ethods [2 ]; considering the advantages of using the EWMA over CUSUM, the multivariate EWMA extension looks a promising line of research for future work.
PAGE 116
110 REFERENCES 1 . Sonesson C, Bock D. A review and discussion of prospective statistical surveillance in public health. J. R. Statist. Soc. A 2003; 166 :5 21. 2. Unkel S, Farrington CP, Garthwaite PH, Robertson C, Andrews N. Statistical methods for the prospective detect ion of infectious disease outbreaks: a review. Journal of the Royal Statistical Society A 2012 ; 175 (1):49 82. 3. Tsui KL, Chiu W, Gierlich P, Goldsman D, Liu X, Maschek T. A review of healthcare, public health, and syndromic surveillance. Quality Enginee ring 2008 ; 20 (4): 435 450. 4. Burkom HS. Development, adaptation, and assessment of alerting algorithms for biosurveillance. Johns Hopkins APL Technical Digest 2003; 24 (4):335 342. 5. Page ES. Continuous inspection schemes. Biometrika 1954 ; 41 :100 115. 6. Lucas J Technometrics 1985 ; 27 :129 144. 7. Roberts SW. Control Chart Tests Based on Geometric Moving Averages. Technometrics 1959; 1 :239 250. 8. Borror CM, Champ CW, Rigdon SE. Poisson EWMA control charts. Journal of Quality T echnology 1998 ; 30 :352 361. 9. Raubertas RF. An analysis of disease surveillance data that uses the geographical locations of the reporting units. Statistics in Medicine 1989 ; 8 :267 271. 10. Rogerson PA, Yamada I. Monitoring change in spatial patterns of disease: comparing univariate and multivariate cumulative sum approaches. Statistics in Medicine 2004 ; 23 :2195 2214. 11. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist . Soc. B 1995; 57 :289 300. 12. Benjamini Y, Yekutieli D. False discovery rate adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association 2005; 469 :71 81. 13. Lee SH, Park JH, Jun C. An exponentially w eighted moving average chart controlling false discovery rate. Journal of Statistical Computation and Simulation 2014 ; 84 (8):1830 1840. 14. Li Z, Qui P, Chatterjee S, Wang Z. Using p values to design statistical process control charts. Statistical Pa pers 2013; 54 (2):523 539, DOI: 10.1007/s00362 012 0447 0 .
PAGE 117
1 11 15. Li Y, Tsung F. Multiple Attribute Control Charts with False Discovery Rate Control. Quality and Reliability Engineering International 2012; 28 (6):857 871. 16. Waller LA, Gotway CA . Applied spatial statistics for public health data . John Wiley & Sons New York, NY, 2004; 223. 17. Storey JD, Tibshirani R. Statistical significance for genomewide studies . Proceedings of the National Academy of Sciences USA 2003; 100 :9440 9445. 18. Woodall WH. The use of control charts in health care and public health surveillance. Journal of Quality Technology 2006 ; 38 :89 104. 19 . Robertson C, Ne lson TA, MacNab YC, Lawson AB. Review of methods for space time disease surveillance. Spatial and Spatio temporal Epidemiology 2010; 1 :105 116. 20 . Siegmund D. Sequential analysis: tests and confidence intervals . Springer Verlag: New York, 1985. 21 . Mou stakides GV. Optimal stopping times for detecting changes in distributions. Annals of Statistics 1986; 14 (4):1379 1387. 22 . Rogerson P. Monitoring point patterns for the development of space time clusters. Journal of the Royal Statistical Society A 2001 ; 164 (1):87 96. 23 . Purdy GG, Richards SC, Woodall WH. Surveillance of Non homogeneous Poisson Processes. Technometrics 2015; 57 (3): 388 394. 24 . Efron B, Storey JD, Tibshirani R. Technical Report 2001 217 (Stanford University, Palo Alto, CA). 25 . FrisÃ© n M. Evaluations of Methods for Statistical Surveillance. Statistics in Medicine 1992; 11 :1489 1502. 26 . Jiang W, Shu L, Zhao H, Tsui KL. CUSUM procedures for health care surveillance. Quality and Reliability Engineering International 2013; 29 (6):883 897. 27. The data are queried from the Survstat@RKI database of the German Robert Koch Institute: http://www3.rki.de/SurvStat/ . Accessed March 16, 2015. 28 . Hawkins DM, Olwell DH. Cumulative sum charts and charti ng for quality improvement . Springer: New York, NY, 1998; 119 120. 29 . Bayer C, Bernard H, Prager R, Rabsch W, Hiller P, Malorny B, Pfefferkorn B, Frank C, de Jong A, Friesema I, Start K, Rosner BM. An outbreak of Salmonella Newport associated with mung bean sprouts in Germany and the Netherlands, October to November 2011. Eurosurveillance 2014; 19 (1).
PAGE 118
112 30 . Gandy A, Lau FD. Non restarting Cumulative Sum Charts and Control of the False Discovery Rate. Biometrika 2013; 100 : 261 268. 31 . Strimmer K. fdrtool: a versatile R package for estimating local and tail area based false discovery rates. Bioinformatics 2008; 24 (12):1461 1462. 32 . Lucas JM, Saccucci, MS. Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 199 0; 32 : 1 12. 33 . Dassanayake S, French JP. An improved CUSUM based procedure for prospective disease surveillance for count data in multiple regions. Statistics in Medicine , Accepted. 34 . Lin C, Tsui K, Lin C. A Spatial EWMA Framework for Detecting Cluste ring. Quality and Reliability Engineering International 2014; 30 (2):181 189. 35 . Sparks R, Ellis P. Detection of multiple outbreaks using spatio temporal EWMA ordered statistics. Communications in Statistics Simulation and Computation 2014; 43 (10):2678 27 01. 36. Montgomery DC . Introduction to Statistical Quality Control . New Jersey . Wiley & Sons New York, N Y, 2005:338. 37 . Anselin L. Local indicators of spatial association: LISA. Geographic Analysis 1995; 27 (2): 93 116. 38 . Moran PAP. Notes on continuous stochastic phenomena. Biometrica 1950; 37 : 17 13 . 39 . Barbiero A, Ferrari PA. Simulation of correl ated Poisson variables. Applied Stochastic Models in Business and Industry 2015; 31 (5): 669 680. 40 . Lawson AB, Kleinman K. Spatial and Syndromic Surveillance for Public Health. Wiley & Sons West Sussex, England, 2005: 37. 41 . Elbert Y, Burkom HS. Devel opment and evaluation of a data adaptive alerting algorithm for univariate temporal biosurveillance data. Statistis in Medicine 2009 ; 28 , 3226 3248. 42 . Shmueli G, Burkom H. Statistical challengers facing early outbreak detection in biosurveillanc. Techn ometrics 2010; 52 :39 51. 43. Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinformatics 2008; 9 :303.
PAGE 119
113 Appendix A Details of the Storey Tibshirani Procedure and the Estimation of the Proportion of True Null Hypothesis The Storey Tibshirani procedure [17] can be used to estimate q values for a given set of p values. Storey and Tibshirani [17] state that a q is definition, a q value is obtained through a process of estimating the FDR . Let be the number of false positives (out of tests) and be the number of significant tests (out of tests) when using a threshold . We need to estimate . We know that for a large number of tests , First, consider the estimation of the numerator . Let be the number of tests where the null hypothesis is true. Using the fact that p values corresponding to tests where the null hypothesis is true follow the standard uniform distribution, the probability of rejecting a null p value less than or equal to is . So an estimate of . Note that is equivale nt to where is the proportion of tests for which the null hypothesis is true. We describe a process for estimating below. Second consider the estimation of the denominator A simple estimate of is the observed . Since is the number of significant tests out of tests, . With the estimates of the numerator and the dominator, the estimated FDR at a threshold t is:
PAGE 120
114 The estimation of in the above equation is what follows. In estimating Storey and Tibshirani [17] use the fact that null p values are uniformly distributed between [0, 1], while p values for truly alternative tests tend to be clo se to zero. Thus, most of the p values close to 1 will actually be for tests where the null hypothesis is true. Storey and Tibshirani [17] choose a tuning parameter so that any tests where are for hypotheses where (we believe) the null hypo thesis is actually true Therefore, the will be the number of true null hypothesis with p values between [ , 1]. However, since p values of true null hypothesis are uniformly distributed between [0, 1], there will still be some te sts with true null hypothesis with p values less than . Thus, . Rearranging this expression, we get . Therefore, can be estimated as . Storey and Tibshirani [17] use an automated procedure using a cubic spline based method to estimate . First, is calculated for . Then a natural cubic spline wi th 3 degrees of freedom is fitted to this data: the degrees of freedom is set to 3 to limit the curvature to be similar to a quadratic function that fits the data well. A final estimate of is obtained by evaluating the natural cubic spline at (since the bias of decrease with increasing , the biase is smallest when approaches 1). The q value is the minimum FDR that can be obtained when using a threshold t . Therefore, the estimated q value as a function of p value is .
PAGE 121
115 APPENDIX B The Bootstrap Procedure Used in Chapters II, III, and IV In a typical bootstrap are independent and identically distributed and we sample B values, with replacement. The bootstrap samples for the three procedures proposed in chapters II, III, and IV are collected as follows: Let where be a vector of counts collected from each of the m regions at time i . Let be t he no outbreak (in control) period. We assume to be independent and identically distributed. Then we sample disease count vectors from (which is a matrix containing disease count vectors for all m regions during the no outbreak period from as shown below) with replacement B times.
PAGE 122
116 APPENDIX C Additional CED and FPR Plots for Pooled EWMA and CUSUM Models Additional CE D Plots for Pooled EWMA and CUSUM Models CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in regions 3, 4, 5, and 6.
PAGE 123
117 CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in regions 8, 9, 10, and 11.
PAGE 124
118 CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in regions 12, 14, 15, and 16.
PAGE 125
119 CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in regions 17, 18, 1 9, and 20.
PAGE 126
120 CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in regions 21, 22, 23, and 24.
PAGE 127
121 CEDs for both EWMA and CUSUM methods for the pooled model using BY and ST procedu res in region 25.
PAGE 128
122 Additional FPR Plots for Pooled EWMA and CUSUM Models False Positive Rates (FPRs) for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in regions 3, 4, 5, and 6.
PAGE 129
123 False Positive Rates (FPRs) for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in regions 8, 9, 10, and 11.
PAGE 130
124 False Positive Rates (FPRs) for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in regions 12, 14, 15, a nd 16.
PAGE 131
125 False Positive Rates (FPRs) for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in regions 17, 18, 19, and 20.
PAGE 132
126 False Positive Rates (FPRs) for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in regions 21, 22, 23, and 24.
PAGE 133
127 False Positive Rates (FPRs) for both EWMA and CUSUM statistics for the pooled model using BY and ST proced ures in region 25.
PAGE 134
128 APPENDIX D CED and FPR Plots for Pooled EW MA and Spa tially adjusted EWMA M odels Additional CED Plots for Pooled EWMA and Spatially adjusted EWMA models Average CED values versus the duration of the outbreak for proposed EWMA and WMA baseline model for regions 14, 19, 15 , and 25 .
PAGE 135
129 Additional FPR Plots for Pooled EWMA and Spatially adjusted EWMA models Average FPR values versus the duration of the outbreak for proposed EWMA and WMA baseline model for regions 2, 3, 4, and 5 .
PAGE 136
130 Ave rage FPR values versus the duration of the outbreak for proposed EWMA and WMA baseline model for regions 6, 7, 8, and 9 .
PAGE 137
131 Average FPR values versus the duration of the outbreak for proposed EWMA and I statistic and the pooled E WMA baseline model for regions 10, 11, 12, and 14 .
PAGE 138
132 Average FPR values versus the duration of the outbreak for proposed EWMA and WMA baseline model for regions 15, 16, 17, and 18 .
PAGE 139
133 Average FPR values versus the duration of the outbreak for proposed EWMA and WMA baseline model for regions 20, 21, 22, and 23 .
PAGE 140
134 Average FPR values versus the duration of the outbreak for proposed EW MA and WMA baseline model for region 25 .
PAGE 141
135 APPENDIX E Observed FDR as a Function of Length of Outbreak for the Pooled CUSUM and EWMA Models The Observed FDR for both the pooled CUSUM models using the BY and ST procedures are comparable across all outbreak lengths. CUSUM models, having a higher rate s of false alarms, display higher FDR across most outbreak lengths compared to EWMA models having relatively lower rates of false alarms. T he o bserved FDR for the more sensitive pooled EWMA model using the ST procedure is consistently higher than the model using the BY procedure. In both EWMA curves there is a visible downward slope with increasing outbreak length. Note that
PAGE 142
136 With increasing length of outbreak there are more possibilities for true alarms in the denominator, which results in the downward trends in FDR as length of outbreak increases f or the EWMA models . However, this downward trend is not as visible with the CUSUM models as the false alarms also increase with longer outbreaks as the CUSUM statistic takes longer to signal the end of the outbreaks.
PAGE 143
137 APPENDIX F Observe d FDR as a Function of Length of Ou tbreak for the Pooled EWMA and SEWMA Models Both models display comparable observed FDR rates across all outbreak lengths. In both models there is a downward slope with increasing outbreak length. Note that With longer outbreaks there are more possibilities for true alarms in the denominator, which results in the downward trends.
PAGE 144
138 APPENDIX G Implementing the S torey In implementing the Storey used to calculate q values . proportion of true n ull hypothesis (the notation is used in [31]) and FDR than what was used in the Storey Tibshirani [17] procedure . . The first step of the method is to determ ine a suitab le truncation point (the notation is used in [17]) for the p values , which separates null p values from non null p values . This truncation point is found by first fitting an approximate null model and then minimizing the false non discovery rate as a function of , where is defined as #{false negatives}/{#false negatives + #true negatives}. Thus, must be known to determine the truncation point; but to compute , a truncation point must be specified. To ov ercome this circular inferential step procedure. First an approximately optimal is obtained by fitting an approximate null model (by matching the median of the null model with the median of the observed p values) and obtaining an approximate curve. In the second step, a refined fit of the null model is obtained via a truncated maximum likelihood method (using the approximate obtained in the first step) to find the value of that min imizes the estimated . Once the truncation point is calculated, the data are censored to obtain a truncated null density function . The underlying assumption here is that all data points below the truncation point belong to the null portion. Thus, the truncated null density for and otherwise ( is the null density function; for p values, the null
PAGE 145
139 density is the uniform distribution between [0, 1]). Afterwards, the corresponding likelihood function is maximized (the negative log likelihood of the truncated density function is minimized) to obtain . Using the calculated in the previous step, the modified Grenander estimator [43] is emp loyed to estimate the distribution function of the p values. Essentially, the standard Grenander density estimator is the decreasing piece wise constant function equal to the slopes of the least concave majorant (LCM) of the empirical cumulative density f unction (ECDF). The LCM can be described as follows: if a density f is monotonically decreasing over the positive half line , then the corresponding distribution function F is concave over the positive half line. When non parametric maximum likelihood is employed to obtain an estimator of f , the corresponding estimator of F ends up being the LCM of the ECDF. The modified Grenander estimator is the standard Grenander estimator calculated from the modified ECDF of the p values. Finally, using the empirical distribution function of p values , is obtai ned by calculating the tail area based FDR . Generally, for an observed test statistic , the tail area based FDR is defined by where is the null distribution function . Specifically, when the observed statistic is a p value, . Als o, the p value corresponding to the test statistic equals Using the fact that null density of the p values is uniformly distributed between [0,1], . Since a q value is the minimum FDR that can be obtained when using a threshold t , the estimated q value as a function of p value is .

