
Citation 
 Permanent Link:
 http://digital.auraria.edu/AA00007149/00001
Material Information
 Title:
 Prospective disease surveillance with the CUSUM and spatial scan methods
 Creator:
 Hall, Lauren M.
 Place of Publication:
 Denver, CO
 Publisher:
 University of Colorado Denver
 Publication Date:
 2019
 Language:
 English
Thesis/Dissertation Information
 Degree:
 Doctorate ( Doctor of philosophy)
 Degree Grantor:
 University of Colorado Denver
 Degree Divisions:
 Department of Mathematical and statistical sciences, CU Denver
 Degree Disciplines:
 Applied mathematics
 Committee Chair:
 Austin, Erin
 Committee Members:
 French, Joshua
Santorico, Stephanie Hartke, Stephen Anthamatten, Peter

Downloads 
This item has the following downloads:

Full Text 
PROSPECTIVE DISEASE SURVEILLANCE WITH THE CUSUM AND SPATIAL
SCAN METHODS by
LAUREN M. HALL B.A., Colorado State University, 2012 B.S., University of Colorado Denver, 2014 M.S., University of Colorado Denver, 2017
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Applied Mathematics Program
2019
This thesis for the Doctor of Philosophy degree by Lauren M. Hall has been approved for the Applied Mathematics Program by
Erin Austin, Chair Joshua French, Advisor Stephanie Santorico Stephen Hartke Peter Anthamatten
Date: May 18, 2019
11
Hall, Lauren M. (Ph.D., Applied Mathematics)
Prospective Disease Surveillance with the CUSIJM and Spatial Scan Methods Thesis directed by Associate Professor Joshua French
ABSTRACT
The cumulative sum (CUSUM) control chart is a method for detecting whether the mean of a time series process has shifted beyond some tolerance (i.e., is outofcontrol). CUSUM control charts have been widely used for prospective surveillance due to their ability to quickly detect both large, sudden increases and small, persistent increases in reported cases of a disease associated with an outbreak. Originally developed in an industrial process control setting, the CUSUM statistic is typically reset to zero once a process is discovered to be outofcontrol since the industrial process is then recalibrated to be incontrol. In a disease surveillance setting, resetting the CUSUM statistic is unrealistic, and a nonrestarting CUSUM chart is used instead. In practice, the nonrestarting CUSUM provides more information, but suffers from a high false alarm rate following the end of an outbreak. In this thesis, we propose a modified hypothesis test for use with the nonrestarting CUSUM when testing whether a process is outofcontrol. By simulating statistics conditional on the presence of an outofcontrol process in recent time periods, we are able to retain the CUSUMâ€™s power to detect an outofcontrol process while controlling the postoutofcontrol false alarm rate at the desired level. Additionally, we propose a new method for incorporating spatial information into CUSUM control charts by combining the nonrestarting CUSUM with spatial scan methods. We compute the cumulative sum of scan statistics, or CUSCAN statistics, using a nonparametric form of the CUSUM in conjunction with the circular and elliptic scan methods. Using publicly available benchmark data, we demonstrate that the CUSCAN has high power to both quickly detect a new outbreak and identify the spatial regions where the outbreak exists. Lastly, we propose a method for extending the CUSCAN for dynamic scan methods that do not use a fixed set of windows for detecting spatial clusters, using the restricted flexible scan
m
method as an example. By linking overlapping windows across time, we are able to compute the CUSCAN statistic for scan methods with timevarying cluster windows, which allows the CUSCAN to detect clusters of arbitrary shape.
IV
TABLE OF CONTENTS
CHAPTER
I. INTRODUCTION............................................................. 1
II. A MODIFIED CUSUM TEST TO CONTROL POSTOUTBREAK FALSE
ALARMS................................................................... 7
2.1 Introduction ............................................................ 7
2.2 The NonRestarting CUSUM False Alarm Problem............................. 9
2.3 An Adaptive Hypothesis Test for the NonRestarting CUSUM................ 12
2.3.1 A Modified NRCUSUM Test.................................... 13
2.3.2 Simulating Under the Alternative Hypothesis ...................... 14
2.4 Data Demonstration: Conditional Hypothesis Testing on Simulated Data . . 15
2.4.1 Results using known Xa............................................ 17
2.4.2 Results using Ai.................................................. 19
2.4.3 Results using Xm.................................................. 19
2.4.4 Simulation with Bootstrap......................................... 22
2.4.5 Summary of Simulation Results..................................... 22
2.5 Demonstration: Salmonella data ......................................... 22
2.6 Discussion.............................................................. 26
III. CUSCAN: DETECTING EMERGING DISEASE CLUSTERS WITH THE
CUMULATIVE SUM OF SCAN STATISTICS ...................................... 28
3.1 Introduction ........................................................... 28
3.2 Review of Methods....................................................... 32
3.2.1 Spatial Scan Methods ............................................. 32
3.2.2 The NonRestarting CUSUM ......................................... 35
3.3 Proposed Methodology.................................................... 36
3.3.1 The CUSCAN Method ............................................... 36
3.3.2 Selection of k.................................................... 37
v
3.4 Demonstration: Simulated Data based on New York Leukemia Data....... 40
3.5 Demonstration and Power Assessment: New York City Benchmark Data . . 45
3.5.1 Data Description.............................................. 45
3.5.2 Analysis Design............................................... 48
3.5.3 Results....................................................... 51
3.6 Discussion.......................................................... 60
IV. AN EXTENSION OF THE CUSCAN FOR DYNAMIC SCANNING METHODS: THE RESTRICTED FLEXIBLE SCAN STATISTIC ............................ 63
4.1 Introduction ....................................................... 63
4.2 Review of Methods................................................... 65
4.2.1 The CUSCAN Method ........................................... 65
4.2.2 The Restricted Flexible Scan Method........................... 67
4.3 CUSCAN with the Restricted Flexible Scan Method.................... 69
4.4 Demonstration: Simulated Data based on Northeast Benchmark Data .... 72
4.5 Discussion.......................................................... 79
V. CONCLUSION.......................................................... 83
REFERENCES................................................................ 93
vi
CHAPTER I
INTRODUCTION
Disease surveillance is a key tool in public health applications. By monitoring the occurrence and spread of disease, we gain information that allows us to better react to potential threats to public safety. One of the most important features of surveillance is early detection of new outbreaks, as the sooner a health event is identified, the more effective any available intervention will be. Prospective and syndromic surveillance methods assess data in realtime as updates become available and search for evidence of any unusual counts or distributions of cases that could indicate a new or ongoing health crisis. In prospective disease surveillance, the data streams monitored are primarily counts of confirmed diagnoses of a disease of interest, whereas syndromic surveillance seeks patterns in data on related variables, such as medication sales or reports of flulike symptoms that precede formal diagnosis and may provide earlier detection of an emerging outbreak [Das et ah, 2005].
In addition to rapid detection of a new disease outbreak, prospective surveillance methods should also be able to accurately identify the beginning and end of an outbreak, i.e., to accurately distinguish outbreak from nonoutbreak time periods [Frisen, 1992], When counts are aggregated within physical regions, such as cities, counties, or census tracts, it is also important that the location of the outbreak be accurately determined. In essence, the performance of prospective surveillance methods is usually measured by (1) time required to detect a new outbreak, (2) sensitivity to determine the times/locations an outbreak is present, and (3) specificity to determine the times/locations with no outbreak.
When testing for the presence of an outbreak in surveillance data, statistical methods often work by comparing observed case counts to a baseline level that is expected in the absence of an outbreak. If observed counts are significantly higher than expected, then we say there is evidence of an outbreak of disease. Different techniques exist to estimate the baseline expected counts depending on the type and quantity of data available. When
1
covariate information such as population demographics exist, regression methods offer a flexible set of tools for computing expected counts. When case counts are indexed over time, timeseries regression such as autoregressive integrated moving average (ARIMA) models can be used [Diebold, 2007]. When counts are indexed over space, spatial regression techniques such as simultaneous autoregressive (SAR) or conditional autoregressive (CAR) models can be used [Waller and Gotway, 2004], These methods can be combined for data that is indexed over both time and space, along with other types of regression methods including generalized linear models (GLMs) and Bayesian models. When covariates are available, regression methods can produce informed and more accurate expected counts, improving the ability to identify abnormal counts of disease. Further discussion of these and other regression methods can be found in surveillance method review papers such as Unkel et al. [2012], Robertson et al. [2010], or Tsui et al. [2008].
When population demographics or other covariates are unavailable, other techniques may be used that are designed to work with more sparse information. Another popular approach for prospective surveillance is statistical process control, where control charts are used to monitor the mean of a process over time. Originally developed to monitor industrial processes, control charts are used with time series data to detect the point in time when the mean of a data generating process changes. In a public health setting, we may think of the spread of disease as a process we wish to monitor. When no outbreak is present, we expect case counts to follow some known distribution. When an outbreak occurs, the mean of the distribution shifts, and the number of reported cases increases. When no outbreak is present, the process is considered incontrol, and when an outbreak occurs and the mean shifts, the process is declared to be outofcontrol and the control chart will signal an alarm. Since the spread of disease can be simply modeled as a datagenerating process, statistical process control methods are a natural choice for monitoring disease cases over time, and many have been adapted for use in a surveillance context [Sonesson and Bock, 2003]. One method commonly used for surveillance is the
2
cumulative sum (CUSIJM) control chart, which is currently in use in surveillance systems like the Electronic Surveillance System for the Early Notification of Communitybased Epidemics (ESSENSE) and the Centers for Disease Control and Preventionâ€™s BioSense system [Tsui et ah, 2008]. The CUSUM is designed to detect shifts in the mean of a process by accumulating deviations from expected levels over time. This allows for the detection of both large changes and smaller but sustained changes, making it a powerful tool for identifying outbreaks of varying size. While the CUSUM was originally developed for use with normally distributed data [Page, 1954], the Poisson CUSUM [Lucas, 1985] is the version more commonly used due to the discrete count nature of the data. Other common types of control charts are exponentially weighted moving average (EWMA) charts [Roberts, 1959] and Shewhart charts [Shewhart, 1931], the latter of which have been used in many public health settings, as they can be used to show a proportion of incidents for a fixed period of time [Fasting and Gisvold, 2003, Coory et ah, 2007]. Due to their strength and broad applications for use, this thesis will focus primarily on the properties of CUSUM control charts. While demographics or other covariates may be used when estimating the baseline distribution of a disease process, the CUSUM does not require that such data be available and can easily be used to full effect when only the total case counts are available.
The Poisson CUSUM has several userchosen parameters that directly affect the timeliness and sensitivity of detecting an outbreak. The mean of the disease process in the absence of an outbreak (called the incontrol mean) needs to be known or estimated from baseline data, and a potential size for the shifted mean needs to be provided. In general, the value given for the shifted mean is typically the smallest shifted mean that we want the chart to be able to detect. If the value provided is too large and overestimates the size of a future outbreak, it may take much longer for the outbreak to be detected, causing false negatives and low sensitivity. If the value provided is instead too small, normal variation in the incontrol process may seem more significant than it is and result in additional false positives and low specificity. Additionally, an important decision to be made when using
3
the CUSIJM method is whether to restart the control chart following an alarm. In its traditional use in industrial process control, resetting the chart is standard. Once an industrial process has been determined to be outofcontrol, the process is typically shut down and restarted back in an incontrol state. Since a disease process cannot be forcibly returned to an incontrol state, resetting the chart in a surveillance setting is less appropriate. Gandy and Lau [2012] described a version of the CUSUM that does not reset following an alarm, and this nonrestarting CUSUM is a better choice for use with surveillance data. The restarting CUSUM prioritizes type I error control and reduces the total number of alarms, but if the process remains out of control following the reset, it may take several time periods before another alarm is sounded, resulting in a high type II error rate. In contrast, the nonrestarting CUSUM retains the ability to detect a continuous set of time periods with abnormal disease activity, and thus has higher overall power during an outbreak. In return, the nonrestarting CUSUM is known to suffer from high false alarm rates following the end of an outbreak, as it may take several time periods to register as back in control [Gandy and Lau, 2012, Dassanayake and French, 2016].
While the CUSUM has the ability to detect small shifts in a mean, it may take several periods of accumulated data for an outbreak to be recognized. For example, an outbreak that results in a total 5% increase in case counts may be difficult to identify without several days to accumulate evidence. However, if that 5% increase in total cases corresponds to a 20% local increase in cases in a smaller subset of the data, then incorporating that information can increase the chances of early detection. As such, we want to make use of spatial information when it is available. As computing technology improves, surveillance methods using spatial data are becoming more commonplace. For this reason, CUSUM control charts that utilize spatial information are preferred. A prime example of a method that incorporates spatial information into a CUSUM framework is the nearest neighbor CUSUM [Raubertas, 1989], which groups regions with their nearest neighbors and computes a CUSUM of the combined counts. This allows for the isolation of a set of
4
regions affected by an outbreak, making them easier to detect. The previously mentioned Dassanayake and French [2016] also uses a nearestneighbor approach to aggregate case counts for the nonrestarting CUSIJM. Sonesson [2007] proposed an extension of the nearest neighbor CUSUM called the circular CUSUM, which aggregates counts across all regions within a set of circular windows of various size. These circular windows are centered at each region in the study area, and increase in size until the windows contain a prespecified amount of the total population. Once the windows are fitted, the case counts in the regions in each window are aggregated, the nearestneighbor CUSUM is computed, and the largest statistic is used to determine if and where an outbreak is present.
The circular windows described by the circular CUSUM are derived from the Poisson spatial scan method developed by Kulldorff [1997], which is used to detect clusters in spatial data. In the spatial scan method, the same circular windows are used to aggregate counts from across multiple regions. The aggregate disease incidence rate inside each window is compared with the rate in the regions outside the window using a likelihood ratio statistic, with the maximum over all windows used to test for the presence of a disease cluster. While originally developed for use on retrospective data, Kulldorff [2001] extended the circular scan method to include windows spanning multiple time periods, allowing it to be used as a prospective tool with timeseries data. Many other spatial scan methods exist, such as the elliptic scan method [Kulldorff et ah, 2006] which uses ellipses rather than circles to aggregate regions, as well as methods that do not use fixedshape windows, such as upper level set scanning method [Patil and Taillie, 2004] and the dynamic minimum spanning tree method [Assuncao et ah, 2006]. These methods offer more information on the distribution of disease counts across space than simply aggregating case counts, but most are designed for use specifically for retrospective analysis and not for prospective surveillance.
In this thesis, we aim to create an improved version of the CUSUM method that incorporates a higher degree of spatial information than previous methods such as the
5
nearest neighbors CUSIJM. Given the strengths of the CUSIJM as a prospective surveillance tool and the power of spatial scan methods to identify local disease clusters, we propose a combination of the two methods. By taking the cumulative sum of scan statistics over time, we can use the strengths of each to be able to quickly and accurately detect a local increase in case counts corresponding to the beginning of a new outbreak.
The structure of this thesis is as follows: in Chapter II, we examine the properties of the nonrestarting CUSIJM that make it difficult to use in practice, and propose a modified hypothesis test to control postoutbreak false alarms while retaining power during outbreaks. With its primary weakness mitigated, the nonrestarting CUSUM is the best choice for use with our new surveillance method. In Chapter III, we describe in detail the proposed methodology for computing the cumulative sum of scan statistics. This method, which we call the CUSCAN, uses a nonparametric form of the nonrestarting CUSUM in conjunction with spatial scan methods to identify both time periods and spatial locations where an outbreak is present. Using simulated benchmark data and the circular and elliptic scan methods, we demonstrate that the CUSCAN has considerable power to rapidly detect both large and small disease clusters. In Chapter IV, we provide an extension to the CUSCAN that allows for the use of other scan methods, specifically those that do not use windows of fixed shape. This allows the CUSCAN to detect disease clusters of arbitrary shape, creating a modular surveillance method with the capacity to be used in a wide variety of applications. Finally, in Chapter V, we summarize the findings of this research and discuss future directions of work.
6
CHAPTER II
A MODIFIED CUSUM TEST TO CONTROL POSTOUTBREAK FALSE
ALARMS
Acknowledgements
The content of this chapter was accepted for publication in Statistics in Medicine in 2019. In accordance with the copyright agreement, the original submitted version of the manuscript appears in this chapter. Other than formatting, no changes have been made. The final accepted version, coauthored by Joshua French, may be found at https://doi.org/10.1002/sim.8088.
2.1 Introduction
Disease surveillance is a key tool in public health applications. By monitoring the occurrence and spread of disease, we gain information that allows us to better react to potential threats to public safety. One of the most important features of surveillance is early detection of new outbreaks, as the sooner a health event is identified, the more effective any available intervention will be. In addition to timeliness (the speed at which a surveillance method can detect the presence of an outbreak) a method should also be able to accurately separate outbreak time periods from nonoutbreak periods [Frisen, 1992], In other words, a good surveillance method should be able to sound an alarm during time periods when an outbreak is present (high sensitivity/power in detecting an ongoing outbreak) and sound no alarm during periods where there is no outbreak (high specificity).
While timeliness, sensitivity, and specificity are all important in creating a useful surveillance tool, no method will perform perfectly in all settings. Researchers often have to choose which features to prioritize depending on their specific needs and the purpose of their surveillance. A method that prioritizes quick detection of an outbreak and high sensitivity may experience a higher rate of false alarms during nonoutbreak periods. Likewise, a method that prioritizes controlling false alarms may have lower power to detect true outbreaks, due to the natural tradeoff between power and type I error rate.
7
One commonly used surveillance method where prioritizing such decisions need to be made is the cumulative sum (CUSIJM) control chart. Originally developed for industrial process control by Page [1954], the CUSUM control chart is designed to detect persistent shifts in the mean of a process by aggregating deviations from the mean over time. The process is deemed â€œoutofcontrolâ€ when a persistent shift is detected. Lucas [1985] extended the CUSUM control chart to Poisson count data. The Poisson CUSUM has recently been used in many prospective surveillance applications [Woodall, 2006], including surveillance systems such as BioSense and the Electronic Surveillance System for the Early Notification of Communitybased Epidemics (ESSENCE) [Tsui et ah, 2008]. In the disease surveillance context, the counts are the disease incidence counts at each time, and a process is declared outofcontrol when there is an outbreak of disease beyond the standard incidence behavior.
The CUSUM method has several userchosen parameters that directly affect the timeliness and sensitivity of detecting an outbreak. Additionally, an important decision to be made when using the CUSUM method is whether to restart the control chart when an alarm is sounded. The restarting CUSUM, the traditional form of the chart, prioritizes type I error control and necessarily lacks the ability to identify all active outbreak time periods, as the chart is reset immediately after an outbreak is identified. In contrast, the nonrestarting CUSUM retains information about the length of an ongoing outbreak by having the ability to detect a continuous set of time periods with abnormal disease activity, and thus, higher overall power during an outbreak. In return, the nonrestarting CUSUM is known to suffer from high false alarm rates following the end of an outbreak [Gandy and Lau, 2012, Dassanayake and French, 2016].
In this paper, we take a closer look at the false alarm problem inherent to the nonrestarting CUSUM. We propose a modification to the nonrestarting CUSUM that will allow the method to retain its ability to monitor a continuous outbreak, while reducing the number of postoutbreak false alarms. In Section 2.2, we describe the nonrestarting
8
CUSIJM and explain the origin of the postoutbreak false alarms. In Section 2.3, we detail the proposed modification to the CUSUM. We then demonstrate the use and effects of this modification, first in Section 2.4 with simulated data, and again in Section 2.5, by applying the modified CUSUM to the detection of a known outbreak of Salmonella Newport in Germany in 2011 [Bayer et ah, 2014], Lastly, we provide further discussion in Section 2.6.
2.2 The NonRestarting CUSUM False Alarm Problem
The aim of the CUSUM method is to assess whether the mean of a time series process {Yt, t = 1,2,...} has a persistent mean shift over time. Most commonly, at each time step t, the CUSUM method decides between H0 : A = A0 vs Ha : A > A0, where A is the (stationary) mean of the process. We define the CUSUM statistic for Poisson count data at time t, Ct, by the following recursive formula [Lucas, 1985, Hawkins and dwell, 1998]:
Ct = max{0,Ct_i + Yt  k} , (2.1)
where Co = 0, Yt is the observed count at time t, and k is a constant calculated to control a type I error rate. When Y is Poisson distributed, we define the constant k by
= Ai ~ AÂ° (2 2)
hi(Ai) â€” hi(A0) â€™ ( j
where Ao is the incontrol mean and Ai is typically the smallest outofcontrol mean we want to detect. Based on these values, we determine the critical value h that controls the error criterion at the desired level. At each time step, the CUSUM statistic Ct is compared to h, and if Ct > h, we declare the process to be outof control.
In an industrial process control setting, when a process is declared outofcontrol, the datagenerating process is shut down and can be restarted in an incontrol state. For this reason, the CUSUM is traditionally reset after an outbreak, either to zero [Page, 1954] or to another value such as h/2 [Lucas and Crosier, 1982], In a public health setting, we cannot repair and restart our disease process, so we should not reset the CUSUM following
9
an alarm. This is known as a nonrestarting CUSIJM (NRCUSUM) control chart. By allowing the CUSUM to continue monitoring the process, we continue to receive signals as long as the process remains outofcontrol, providing valuable information about the timeline of the detected outbreak.
One of the assumptions we make in disease surveillance is that outbreaks of disease represent a transient shift in our process, and the process will selfcorrect over time and return to an incontrol state. For this reason, when monitoring a data stream using a CUSUM process, we often prefer to use a NRCUSUM, where the monitoring process is not reset following the detection of an outbreak [Gandy and Lau, 2012], By allowing the process to continue, we can gain information about the length and severity of an outbreak, and monitor a process for evidence that it has returned to its incontrol state.
Determining the time period when a process returns to its incontrol state can be difficult if the recorded deviation was significantly high, either from a long outofcontrol period or the presence of a particularly large outbreak. These situations cause the CUSUM statistic to rise rapidly, and it may take many time periods after the end of an outbreak for the CUSUM statistic to return to previous levels.
We demonstrate the CUSUM postoutbreak false alarm problem using a simulated example. Consider a set of count data observed at 100 consecutive time steps. We assume the responses are independent random variables having a Poisson distribution. The first 30 time periods are incontrol with mean Ao = 4, the next 40 periods are outofcontrol with mean Ai = 6, and the last 30 time periods return to previous incontrol levels. Figure 2.1 shows the time series of the CUSUM statistic over the 100 times. We see that the CUSUM statistic remains elevated for several time periods after the process returns to an incontrol state. During this time, the process will continue to signal an outbreak at each time period it remains above the significance threshold h, resulting in a large number of postoutbreak false alarms. A recent example of this problem can be found in Dassanayake and French [2016], where a nonrestarting CUSUM framework was used. Gandy and Lau [2012]
10
CUSUM Value
Figure 2.1: An example of a nonrestarting CUSUM displaying elevated levels
postoutbreak.
11
propose placing an upper limit on the CUSIJM statistic, limiting how high it can grow and increasing the speed at which it can drop back below the threshold following an outbreak. While this method does reduce the false alarm rate, the rate still remains high relative to a restarting CUSUM, and limiting the CUSUM statistic growth can result in a loss of information on the size and duration of an outbreak.
2.3 An Adaptive Hypothesis Test for the NonRestarting CUSUM
In this section, we propose a solution to the postoutbreak false alarm problem that controls the false alarm rate without sacrificing information from the CUSUM. By accounting for the presence of an outbreak in recent time periods, we can adjust the way we perform our hypothesis tests to better detect the end of an outbreak and the return to an incontrol state.
Traditionally, the CUSUM method uses a threshold h chosen to control the average run length before a false alarm. When the CUSUM statistic crosses this threshold, an alarm is sounded and we conclude that an outbreak has occurred. We choose instead to use a pvalue approach to hypothesis testing. Our adaptive test will have a nonstationary null distribution over time, so there cannot be a predetermined threshold h. A pvalue approach facilitates a uniform testing approach and tells us how strong the evidence of an outbreak is at each time step.
To obtain a pvalue for the NRCUSUM test, we simulate Nsim data streams assuming the null hypothesis is true and calculate the CUSUM statistics for each stream. Let C.^ denote the CUSUM statistic at time t for simulated data stream i. The pvalue for the test at each time step t is computed as the proportion of statistics, including the observed data [Waller and Gotway, 2004], which are at least as large as the observed:
Pt
Nsim
1+ E i
i= 1
cf > ct
1 + Ns.
(2.3)
The inclusion of the observed statistic in the calculation prevents observing a pvalue
12
of zero in the case where none of the simulated statistics are as large as the observed. The primary cause of postoutbreak false alarms for the pvalue method of hypothesis testing is the use of null hypothesis assumptions while simulating the data streams. An outbreak in recent time periods will cause the observed CUSIJM statistic to remain elevated compared to the simulated streams with no prior outbreak, resulting in false alarms once the outbreak ends.
2.3.1 A Modified NRCUSUM Test
We propose a modified hypothesis test that allows us to simulate data conditional on the presence of an outbreak. Let denote the simulated count of disease cases at time t for data stream i. Beginning with our initial CUSUM value C0 = 0, we perform the following steps at each time step t:
1. Simulate Nsim counts under the null hypothesis, i.e., simulate
{UtW, i = 1,2,..., Nsim}, where ~ Poisson(A0) for z = 1,2,..., Nsim.
2. Calculate for i = 1,2,..., Nsim.
3. Calculate the pvalue, pt, for the hypothesis test by (3.5).
4. Determine whether there is an outbreak.
a. If we fail to reject the null hypothesis and no outbreak is detected, we increment t and return to 1.
b. If the null hypothesis is rejected and we conclude that an outbreak is present in the current time step:
i Simulate Nsim counts under the alternative hypothesis, i.e., resimulate {Y}1\ i = 1,2,..., Nsim}. One can either assume ~ Poisson(Aa) for i = 1,2,..., Nsim, or obtain a sample via bootstrap. We will discuss this step in more detail in what follows.
13
ii Recalculate C.f'1 for z = 1,2,..., Nsim using the counts simulated in 4.b.i.
iii Increment t and return to 1.
This process results in Nsim simulations over t time periods, with elevated rates simulated for time periods where we have evidence of an outbreak. This allows our simulated CUSIJM statistics to more closely follow the observed statistics, allowing for more specificity to detect a return to the incontrol state and thus fewer postoutbreak false alarms. We note that there is an implicit assumption that the outbreak mean \a is constant.
2.3.2 Simulating Under the Alternative Hypothesis
The biggest question raised by the proposed NRCUSUM modification is how to simulate data under the alternative hypothesis. We propose four methods for simulating data during outbreak periods:
1. Simulate using a known Xa: if the mean of the disease outbreak process, Xa, is known, we can simulate data as random draws from a Poisson distribution with mean Xa. This is unrealistic for real data.
2. Simulate using chosen Ap when the true outofcontrol mean is unknown, we can simulate Poisson counts using the Ai parameter we select to calculate the CUSUM constant k, i.e., setting Aa = Ai.
3. Average outofcontrol counts: with this approach, we estimate Aa by the average of the counts over all time periods where an outbreak has been detected up to that time. Note that we reestimate the outbreak mean each time a new outbreak time period is observed to include the new data from that time period. Denote this estimate as Xm.
4. Bootstrap observations from outbreak time periods: rather than simulate observations from a Poisson distribution, this method generates new counts by sampling with replacement the disease counts for times where an outbreak has been detected. The
14
pool of observations we sample from increases each time we identify a new outbreak time period.
2.4 Data Demonstration: Conditional Hypothesis Testing on Simulated Data
We will now demonstrate the validity of this approach using simulated data when the true outbreak mean and duration are known. We created 100 sets of simulated Poisson counts. Each data set contains observations for 125 time steps, with an outbreak simulated over 25 time periods from t = 51 through t = 75. We simulated the data under the null hypothesis using Ao = 5, with an outbreak level of Aa = 10. For all tests, we utilize Ai = 1.5Ao = 7.5, resulting in
k
7.55
ln(7.5)  ln(10)
Â« 6.1658.
For each of these data sets, we performed the analysis five times: once using the uncorrected NRCUSUM test and then the modified (corrected) NRCUSUM for each of the simulation approaches proposed in Section 2.3.2. A significance level of a = 0.05 was used for all tests. We note that Gandy and Lau [2012] propose a method that lowers the false alarm rate of NRCUSUMs. However, their method controls the False Discovery Rate (FDR [Benjamini and Hochberg, 1995]), so the results are not comparable with our method, which controls the traditional type I error rate.
We randomly selected one of these data sets (set #27) to serve as a demonstration of the effects of the test. The time series of case counts is shown in Figure 2.2. We see a clear increase in the case counts, on average, between 1 = 51 and t = 75 during the simulated outbreak time period.
The average results of the uncorrected test are shown in Table 2.1. These results will be replicated in what follows for easier reading.
15
Cases
Figure 2.2: Cases over time for simulated data set ^27
Table 2.1: Average simulation results for uncorrected NRCUSUM
PreOutbreak Outbreak PostOutbreak
False Alarm Rate Alarm Rate False Alarm Rate
0.053 0.958 0.990
16
2.4.1 Results using known Aa
We begin with the scenario when Aa is known. While unrealistic in practice, this simulation will allow us to establish a baseline for the performance of the modified hypothesis test under ideal conditions.
The results for the uncorrected and modified test (using known Aa) for data set ^27 are shown in Figure 2.3. The points marked with an â€œxâ€ in the plots indicate time periods where we rejected the null hypothesis and the process was declared to be out of control. The uncorrected CUSIJM test has three preoutbreak false alarms at t = 24, t = 26, and t = 27. The outbreak is detected at time t = 53 and this approach continues to sound alarms until the end of the simulated data. With the corrected hypothesis test used, the only preoutbreak false alarm is at t = 24, and the process is declared back in control at t = 78, with one additional false alarm at t = 93. The tradeoff, then, comes in terms of our power during an outbreak. With no hypothesis test correction, an alarm is sounded during each time period where the process was out of control following initial detection of the outbreak [t = 53 to t = 75). After applying the correction, we experience additional false negatives at t = 54 and t = 61.
The averaged results from running the tests on all 100 simulated data sets are summarized in Table 2.2. While the number of outbreak time periods correctly identified decreased with the corrected hypothesis test, we see dramatic improvement in the postoutbreak false alarm rate. Notably, using the corrected hypothesis test, the postoutbreak false alarm rate is now controlled at the a = 0.05 level.
17
CUSUM
(a) No Correction: U to Detect = 7.5, k = 6.17
(b) Simulation Correction with A,a = 10
Figure 2.3: A comparison of NRCUSUM results for the uncorrected test (a) and modified
test with known Aa (b) for data set #27.
Table 2.2: Average simulation results when Aa is known.
M , PreOutbreak Outbreak PostOutbreak et lod pa}se Alarm Rate Alarm Rate False Alarm Rate
Uncorrected 0.053 0.958 0.990
Known A0 0.019 0.782 0.047
18
2.4.2 Results using Ai
In this scenario, we assume Ao = 5 is known, but the true value of Aa is unknown. However, we set Aa = Ai = 7.5 when simulating outbreak data in the modified test. A comparison of the results for the uncorrected and corrected tests for data set %27 are shown in Figure 2.4.
Using the corrected version of the test, there is only one preoutbreak false positive at t = 26. The outbreak is detected at t = 53, and the alarm is continuously sounded until the process is declared back in control at t = 99. The averaged results from running this version of the test on all simulated data sets are summarized in Table 2.3.
As the choice of A i we wanted to detect was lower than the true outbreak mean of 10, our simulation underestimates the size of the outbreak and the simulated statistics remain too far below the observed values to effectively control the postoutbreak false alarm rate at the desired level. Despite this, the false alarm rate did see drastic reduction (from 99% alarm rate with no correction to 36.5% with correction), so this method provides improvement if no better estimates of Aa are available.
2.4.3 Results using \m
As in the previous scenario, we assume Ao = 5 is known and Aa is unknown. With this method, when simulating under the alternative hypothesis, we estimate Aa by taking the mean of observed counts over all current and previous outbreak time periods, and then simulating new Poisson counts using that mean as our parameter. A comparison of the results for the uncorrected and corrected tests for data set %27 are shown in Figure 2.5.
In Figure 2.5, we see one false alarm at t = 24 and no false alarms following the end of the outbreak. In exchange, we experience more false negatives, with missing alarms at t = 59, 60, 61, 69 and 72. Averaged results are summarized in Table 2.4.
19
CUSUM
(a) No Correction: Xt to Detect = 7.5, k = 6.17
(b) Simulation Correction with A,a = Xt
Figure 2.4: A comparison of NRCUSUM results for the uncorrected test (a) and modified
test with Aa = Ai (b) for data set #27.
Table 2.3: Average simulation results with Aa = Ai
,, , , PreOutbreak Outbreak PostOutbreak
et lod pa}se Alarm Rate Alarm Rate False Alarm Rate
No Correction 0.053 0.958 0.990
A0 = Ai 0.023 0.936 0.365
20
CUSUM
(a) No Correction: ^ to Detect  7.5, k  6.17 (fc>) Simulation Correction with A,a = A,m
Figure 2.5: A comparison of NRCUSUM results for the uncorrected test (a) and modified
test with Xa = Xm (b) for data set #27.
Table 2.4: Average Results for Simulation with A = Am
. . PreOutbreak Outbreak PostOutbreak
Method pgjgg Alarm Rate Alarm Rate False Alarm Rate
No Correction 0.053 0.958 0.990
Aa = Xm 0.018 0.614 0.024
21
The type II error rate is higher with this method compared to simulating with a fixed value of Ai, but the postoutbreak type I error rate is effectively controlled at the a = 0.05 level, and we require no assumptions about the value of Ai.
2.4.4 Simulation with Bootstrap
We once again assume that A0 = 5 is known and Xa is unknown. With this method, rather than attempting to guess or estimate the value of Xa, we use a bootstrap method to generate simulated outbreak observations by sampling with replacement from previously observed outbreak time periods.
With this method, we observe a single false alarm at t = 24 and no postoutbreak false alarms. False negatives occur at t = 59, 61, 68 and 72, as seen in Figure 2.6. Average results across 100 data sets for the bootstrap simulation method are presented in Table 2.5. As with the previous method where we used a sample mean to estimate Aa, we see an increase in the type II error rate in exchange for control of the type I error rate.
2.4.5 Summary of Simulation Results
In this section, we examined the effectivness of the proposed modified hypothesis test using simulated data. We considered four methods of simulating data under the alternative hypothesis. We summarize the results across all four demonstrations in Table 2.6.
In all cases, changing the way we simulate data during outbreak time periods has a noticeable effect on the postoutbreak false alarm rate, lowering it considerably. When our simulated statistics underestimate the size of the outbreak, the control on the false alarm rate is weakened; however, when the outbreak is simulated at close to the correct size, the postoutbreak type I error rate is effectively controlled at the chosen level of a.
2.5 Demonstration: Salmonella data
To further demonstrate the effectiveness of the proposed modified hypothesis test, we used a CUSIJM approach to detect a recorded outbreak of Salmonella Newport in Germany in 2011. The data contain the number of Salmonella Newport cases reported across 16
22
CUSUM
(a) No Correction: X\ to Detect = 7.5, k = 6A7 (b) Simulation Correction with Bootstrap
Figure 2.6: A comparison of NRCUSUM results for the uncorrected test (a) and modified test using bootstrap samples (b) for data set #27.
Table 2.5: Average results for bootstrap correction
M , PreOutbreak Outbreak PostOutbreak Method pgjgg Alarm Rate Alarm Rate False Alarm Rate
No Correction 0.053 0.958 0.990
Bootstrap 0.021 0.671 0.022
Table 2.6: Summary of average results across all methods
Method PreOutbreak False Alarm Rate Outbreak Alarm Rate PostOutbreak False Alarm Rate
No Correction 0.053 0.958 0.990
Known Aa 0.019 0.782 0.047
Aa = Ai 0.023 0.936 0.365
0.018 0.614 0.024
Bootstrap 0.021 0.671 0.022
23
German states between 2004 and 2013. Analysis by Bayer et al. [2014] concluded that the outbreak in question occurred approximately between October 20th and November 8th, 2011, corresponding to weeks 408 through 410 in the data. Figure 2.7 shows the total number of reported Salmonella Newport cases across all 16 states during the study period.
As the state of Saarland reported no cases of Salmonella Newport during the outbreak periods, it was excluded from the demonstration. Three years of data were used to estimate the baseline number of expected cases in each of the remaining 15 states, and a separate CUSIJM was calculated for each state individually with size of outbreak to detect set at Ai = 1.5A0. Each CUSUM test was performed once with an unmodified hypothesis test and again with the modified hypothesis test proposed in Section 2.3.2. The modified and unmodified tests were compared based on how quickly each was able to identify the outbreak and how many alarms occur following week 410, when the outbreak presumably ended.
As the number of reported cases during this outbreak was small in some states, the outbreak was not detected in all places. In Berlin, Hamburg, and North RhineWestphalia, the outbreak was detected at week 410. In Brandenburg, Hesse, Saxony, and Lower Saxony, it was detected one week later at week 411, and in SchleswigHolstein it was detected four weeks later at week 414. This outbreak was not detected in BadenWurttemberg, SaxonyAnhalt, RhinelandPalatinate, Bremen, Bavaria, or Thuringia. For each state where the outbreak was detected, the corrected methods detected the outbreak at the same time period as the uncorrected method, with the exception of the state of North Rhine Westphalia, where the outbreak was detected one time period later than the uncorrected method at week 411.
For the states where the outbreak was detected, the number of alarms following the initial detection for the corrected and uncorrected methods is summarized in Table 2.7.
In summary, in 7 of the 8 of the states where the outbreak was initially detected, the corrected methods retained the same power to detect the outbreak while drastically
24
o
0
03
O
Â£
,o
o
CO
o _
CM
o
T
0
100
â€”i1r~
200 300 400
Time (in Weeks)
500
Figure 2.7: Total number of Salmonella Newport cases reported in 16 German states between 2004 and 2014. A clear spike in cases can be seen starting around week 408.
Table 2.7: Number of alarms following initial detection time between 7 = 411 and t = 528
for the four approaches.
Region Uncorrected Ai Bootstrap
Berlin 78 50 0 0
Brandenburg 116 116 3 3
Hamburg 7 4 0 0
Hesse 12 7 0 0
Lower Saxony 90 53 1 2
North RhineWestphalia 116 81 8 7
Saxony 55 27 1 1
SchleswigHolstein 115 71 4 5
25
lowering the postoutbreak false alarm rate. In one state, the power to detect the outbreak was slightly reduced by delaying the detection of the outbreak by one week. However, the corrected tests resulted in dramatically fewer false alarms after the outbreak ended.
2.6 Discussion
When using a nonrestarting CUSIJM control chart to monitor a disease process, false alarms following the end of a disease outbreak are common. We proposed a modified nonrestarting CUSUM solution that utilizes pvalues. By changing the way we perform our Monte Carlo simulations and simulating case counts under the alternative hypothesis during time periods where an outbreak has been detected, we create a more realistic set of simulated data for calculating pvalues in the presence of an outbreak. We demonstrated the effectiveness of this method on both simulated and real data, and found that when a reasonable estimate of the outbreak intensity can be obtained, this method controls the postoutbreak false alarm rate at the desired level while retaining the ability to quickly detect an emerging outbreak.
The modified test typically has less power to signal an alarm during outbreak time periods. In the simulation study, we observed data streams where an outbreak was initially detected, but the modified NRCUSUM test failed to produce a continuous stream of alarms during the outbreak period. The uncorrected test did not have this problem. However, the corrected CUSUM tests retain the timeliness of the uncorrected CUSUM (with the same time to first detection), while controlling the postoutbreak false alarm rate at the desired level (a = 0.05 for simulations) for the bootstrap and estimation methods, and lowering the false alarm rate in the case where we simulate outbreak data at the level of Ai we want to detect. In the data study, the false negative rate is more difficult to determine, as it is not precisely known when the outbreak we aimed to detect truly took place. Given the assumption that the outbreak took place approximately between week 408 and 410, the false negative rate for the corrected CUSUM test was no worse than the uncorrected in 7 out of 8 regions where the outbreak was detected, and delayed the
26
detection of the outbreak by one time period in 1 out of 8 states.
The loss of power induced by the corrected hypothesis test stands as the primary weakness of this method. While not strongly visible in our data study, the effect on power to detect a continuing outbreak can be seen in the simulation studies, with several false negatives introduced in the middle of the outbreak time periods. Care should be taken when using this method to avoid preemptively declaring an outbreak to have ended, for example, by requiring a string of negative results rather than an isolated negative (as successive false negatives are more rare). Power to detect the beginning of an outbreak may also be affected in certain cases, such as we saw with the state of North RhineWestphalia in the data study, where detection was delayed by one time period when using the corrected tests. When monitoring a data stream, the time of the first alarm following initialization of the CUSIJM is the same between the corrected and uncorrected tests, as the two behave identically until an outbreak is detected. However, the results of the tests may diverge when subsequent outbreaks occur. In the case of North RhineWestphalia, which experienced several smaller outbreaks of Salmonella Newport prior to the 2011 outbreak of interest, this resulted in slightly elevated simulated CUSUM streams and an additional false negative period. This effect becomes more pronounced if prior outbreaks were of larger magnitude than future outbreaks, as the CUSUM statistics may remain elevated enough to make detection difficult. One strategy that may help reduce issues of this type may be to reset the simulated CUSUM streams to zero when the CUSUM has returned to zero following an outbreak, or resetting the simulated streams to the current CUSUM value after a prespecffied number of negative test results in the case where the statistic may not return to zero between outbreaks.
27
CHAPTER III
CUSCAN: DETECTING EMERGING DISEASE CLUSTERS WITH THE CUMULATIVE SUM OF SCAN STATISTICS 3.1 Introduction
Disease surveillance is an important aspect of public health. Early detection of disease outbreaks is necessary for proper responsive measures to be taken. Prospective disease surveillance methods monitor changes in disease counts or syndromic indicators that precede formal diagnosis [Das et ah, 2005] over time and use statistical methods to determine whether monitored levels exceed expected counts enough to determine that an outbreak is present. To protect patient privacy, case data is often reported as counts aggregated over areas such as zip codes, counties, or census tracts. Demographic information, when available, often describes the population at risk for a region rather than the specific individuals comprising the cases. For this reason, surveillance methods are often designed to make use of the limited structure of available data. A variety of methods can be used for surveillance depending on the data available. Regression methods offer a flexible set of tools for disease surveillance, with options for time series regression (such as autoregressive integrated moving average or ARIMA models [Diebold, 2007]) for timeindexed data, spatial regression (such as simultaneous autoregressive, SAR, or conditional autoregressive, CAR models [Waller and Gotway, 2004]) for spatiallyindexed data, or some combination of the two. When demographic or environmental covariate data are available, regression models can be updated to get improved estimates of expected counts of cases, allowing for better detection of aberrant cases. Overviews of these and other regression methods can be found in surveillance method review papers such as Unkel et al. [2012], Robertson et al. [2010], or Tsui et al. [2008].
Regression methods can be powerful, but require a trained hand to use effectively, and many of the benefits to regression (such as the inclusion of covariates in modeling) are dependent on data availability. It is common in practice for surveillance data to consist
28
only of disease counts in aggregate regions for patient privacy, with no demographic covariates reported. In these cases, more simple methods can be used that rely only on case counts, spatial location of agglomeration districts, or both. Spatial scanning methods are a prominent family of methods that require only regional case counts and their spatial locations. These scan methods examine the disease incidence rate in a set of regions (often called a window) relative to the rate in the surrounding regions, and declare that a cluster of cases is present in those regions if the rate inside the window is significantly higher than the rate outside. A classic example is Kulldorffâ€™s circular scan statistic [Kulldorff, 1997], which uses a likelihood ratio approach to compare the disease rate inside circularshaped windows to the rate outside the windows. Variations on Kulldorffâ€™s scan method exist that allow for the detection of noncircular disease clusters, such as the elliptic scan method [Kulldorff et ah, 2006], which uses an ellipse rather than a circle to define windows, and the flexible and restricted flexible scan methods [Tango and Takahashi, 2005, 2012], which search over arbitrarily shaped connected subsets of regions. These statistics can also be generalized to data that are indexed over both space and time by expanding the windows to include multiple time periods, such as the spacetime scan statistic [Kulldorff, 2001], which uses cylindrical windows whose height is measured in time. The spacetime scan statistic is popular due to its simplicity, as well as its availability in the software package SaTScan [Kulldorff, 2003].
Statistical process control methods have also proven effective for disease surveillance, requiring only case counts over time as data, with key examples being the Shewhart chart [Shewhart, 1931], the exponentially weighted moving average (EWMA) chart [Roberts, 1959], and the cumulative sum (CUSIJM) control chart [Page, 1954], Of particular interest is the CUSUM chart, which has been used for disease surveillance purposes in both the Centers for Disease Control and Preventionâ€™s BioSense system and the Department of Defenseâ€™s Electronic Surveillance System for the Early Notification of Communitybased Epidemics (ESSENCE) system [Tsui et ah, 2008]. These methods assume that the data
29
being monitored come from some known distribution when the process is in control (i.e., there is no outbreak), and seek to identify any changes in the datagenerating process that could indicate the process has gone out of control (i.e., there is an outbreak). When observed counts differ from what is expected, these charts increase until they cross some predetermined threshold, at which point an alarm is sounded. In their traditional use in monitoring industrial processes, these charts are typically reset following an alarm, as the data generating process (for example, malfunctioning machinery) would be shut down and restarted in an incontrol state. Since natural disease processes cannot be reset in this way, Gandy and Lau [2012] proposed the use of a nonrestarting CUSIJM for use in public health contexts, as disease outbreaks are naturally transient and return to incontrol states on their own over time, and resetting the control chart to zero following an outbreak alarm reduces power to detect ongoing outbreaks and results in a loss of information regarding the length and intensity of the outbreak. Dassanayake and French [2016] is one example of a disease surveillance method that utilizes the nonrestarting CUSUM.
One of the strengths of the CUSUM is its ability to detect small but persistent shifts in the mean of a process. However, it may take several time periods of accumulated data for an outbreak to be recognized if its intensity is particularly small, which reduces the timeliness of detection. An outbreak that results in a 5% increase in case counts across the entire study area may not be immediately recognized as such. However, if that 5% increase in total cases is the result of a 20% local increase in cases in a smaller subsection of the study area, then incorporating that information can increase the chances of early detection. For this reason, the use of spatial information when monitoring counts over time has become increasingly common. A prime example of a method that incorporates spatial information into a CUSUM framework is the nearest neighbor CUSUM [Raubertas, 1989], which groups regions with their nearest neighbors and computes a CUSUM of the combined counts. Sonesson [2007] proposed an extension of the nearest neighbor CUSUM called the circular CUSUM, which uses the circular windows defined by the circular scan
30
method [Kulldorff, 1997] to aggregate case counts for the CUSIJM. Additionally, the previously mentioned Dassanayake and French [2016] also uses a nearestneighbor approach to aggregate case counts for the nonrestarting CUSUM.
We propose an improved method for incorporating spatial information into a CUSUM framework for disease surveillance. The method takes advantage of the relative simplicity and minimal data requirements of the CUSUM and spatial scan method approaches to outbreak detection. Similar to Sonesson [2007], we search over a set of potential clusters defined by spatial scanning methods. However, while Sonessonâ€™s circular CUSUM simply used circular windows to define expanded neighborhoods for the nearest neighbors CUSUM, we instead compute the Poisson scan statistic for each potential cluster, and use a nonrestarting CUSUM to monitor how these statistics change over time. By using scan statistics as data, we are able to incorporate more spatial information into the monitoring process, improving our ability to detect emerging spatial disease clusters. This flexible framework allows for potential clusters to be defined in a variety of ways as desired by the practitioner. We demonstrate the proposed method, which we term the CUSCAN, using circular [Kulldorff, 1997] and elliptic [Kulldorff et ah, 2006] scan statistics. Other methods for determining potential clusters may be used within this framework as well, assuming the set of potential clusters to search is fixed across time periods.
Given the similarities between the CUSCAN method and other methods based on scan statistics or CUSUM charts, itâ€™s natural to consider the strength and weaknesses of each when deciding the best method to use for a given situation. However, direct comparisons between methods is often difficult, as different authors use different data sets and different measures of performance to test their methods. For example, Dassanayake and Frenchâ€™s modified nearest neighbors CUSUM uses an FDRbased approach to controlling false positives while the CUSCAN controls the type I error rate, so performance is not directly comparable between the two. Many proposed surveillance methods also do not offer publiclyavailable software for implementation, making it difficult to apply multiple
31
methods to the same data. As other spatially aggregated CUSIJM methods mentioned here (the circular CUSUM, the nearest neighbors CUSUM, and Dassanayake and Frenchâ€™s modified nearest neighbors CUSUM) fall into this category, we are not able to make direct comparisons between these methods and the CUSCAN. However, since Kulldorffâ€™s spacetime scan statistic is publicly available as software though SaTScan, we offer comparisons between this statistic and the CUSCAN. Additionally, Kulldorff et al. [2004] applied the spacetime scan statistic to benchmark data that is publicly available, allowing us to make a direct comparison between methods using the same data.
The structure of this chapter is as follows: In Section 3.2, we review properties of spatial scan methods, including the spacetime scan method, and the nonrestarting CUSUM. In Section 3.3, we describe the proposed CUSCAN methodology in detail. In Section 3.4, we demonstrate the properties of the CUSCAN method using simulated data. In Section 3.5, we apply the CUSCAN to benchmark data from Kulldorff et al. [2004] and provide a power analysis and comparison to the spacetime scan method. In Section 3.6, we summarize our conclusions on the overall effectiveness of the CUSCAN method.
3.2 Review of Methods
3.2.1 Spatial Scan Methods
Poisson spatial scan methods are popular for identifying clusters of cases given regional counts in a given study area [Waller and Gotway, 2004], These method creates a set of potential cluster locations from subsets of the study area, and compares the observed disease rate inside each potential cluster to the rate outside each cluster. This approach was popularized by Kulldorff and Nagarwalla [1995] and Kulldorff [1997].
For a study area consisting of N disjoint spatial regions, we define n\,..., nn to be the atrisk population in each region, with a total population of n+ = niâ€¢ Let Yi,..., YN be the associated Poisson case counts in each region, with y+ = In the absence of
a disease cluster, we would expect the disease risk at each location, r* = yi/rii, to be
32
consistent across the study area and equal to the global risk, r = y+/n+. The expected counts for each region in the study area, Ei, is computed by Ei = rrq.
For a potential cluster of contiguous regions Z C {1,2,..., N}, we define Yin = J2iez W Yout = 'Etfz Vh Ein = J2iez Eiâ€™ and Eout = Ei We then calculate the
Poisson scan statistic for Z as:
~y \ / w w
1 out \ j / 1 in 1 out
E0ut J \ E'm Eouf
For a set of potential clusters Z, the test statistic for the Poisson spatial scan test is computed by:
max{SJ. (3.2)
The significance of this test statistic is typically assessed through simulation as follows: let S be the test statistic from the observed data. We simulate Nsim data sets under the null hypothesis of constant risk, ~ Poisson(Â£{), i = 1,... , N, and compute the maximum scan statistic SW for each simulated data set. We calculate the pvalue for the test statistic as the proportion of maximum test statistics, including the observed statistic S, that are at least as large as the observed statistic:
Nsim
l + E i (s(t) > s)
i= 1_______________
1 Efgiro
(3.3)
There are many ways to determine 2, the set of potential clusters. The choice of potential clusters frequently distinguishes one Poisson spatial scan method from another. The two methods specifically used in this study are:
1. Circular scan method [Kulldorff, 1997]. Beginning at the centroid of each region in
the study area, expand a circular window to include nearby regions until the population inside the largest window reaches some prespecihed upper bound, commonly 50% of the total atrisk population. A region is included in a potential
33
cluster if its centroid is within the circular window.
2. Elliptic scan method [Kulldorff et ah, 2006]. Similar to the circular scan statistic, but instead of circular windows, a series of elliptical windows centered at each region are used. These ellipses are defined in terms of their shape (c = a/b, the ratio of the major and minor axes) and angle (0, the angle between the major axis and horizontal axis). Different combinations of shapes and angles may be used, and each set of ellipses is increased in size along the major axis until the population within the window reaches some prespecihed upper bound.
The flexiblyshaped scan method [Tango and Takahashi, 2005], which searches all connected subsets of regions whose total population is below the specified upper bound, can also be used with the CUSCAN. However, the number of potential clusters increases exponentially as the number of regions in the study area increases, making it a computationally demanding method that may not be ideal when multiple data sets need to be examined or when a study area is particularly large. The computation time necessary can be reduced in most cases by using the restricted flexible scan statistic [Tango and Takahashi, 2012], which only searches connected subsets consisting of regions where the observed incidence rate is higher than some tolerance; however, this means the set of potential clusters varies over time depending on observed counts, and so the restricted flexible scan statistic cannot be used with the CUSCAN. For these reasons, we do not provide implementation of the flexible scan statistic for the CUSCAN in this paper.
Kulldorff [2001] extended the circular scan method to create a spacetime scan method useful for prospective surveillance on timeindexed data. The spacetime scan statistic is computed as in (3.2), except that the set of potential clusters Z consists of cylinders whose base are the original circular windows and whose height is increased incrementally to include multiple sequential time periods of data within the window. The most likely cluster then consists of a set of spatial regions as well as a set of time periods where evidence for spacetime clustering is highest. The spacetime scan method can also be used to detect
34
temporalonly clusters when case counts are aggregated across the entire study area at once.
3.2.2 The NonRestarting CUSTJM
In addition to the spatial information available through the spatial scan method, we utilize the cumulative sum control chart (CUSIJM) to monitor changes in the data over time. The CUSUM detects shifts in the mean of a process by accumulating deviations from the expected mean over time, allowing detection of large, sudden changes as well as smaller, sustained changes in the mean.
For a single stream of data, the CUSUM statistic at time t is defined by the following recursive formula:
C0 = 0, Ct = max{0, Ct1 +Ytk}, (3.4)
where Ct\ is the CUSUM statistic from the previous time period, Yt is the observation at the current time period, and k is a number chosen based on the distribution of Y to slow the growth of the CUSUM during incontrol time periods and control the false alarm rate. When no change has occurred and the process is in control, Ct tends to remain near zero. When a shift in the mean occurs and the process is out of control, Ct tends to increase rapidly. When monitoring a process over time, a shift in the mean is detected when Ct exceeds some predetermined threshold, h.
In addition to thresholds, the presence of a shift can be detected using pvalues to determine if Ct is significantly higher than we would expect if the process were in control. Like the spatial scan statistic, we compute this pvalue using simulations. Nsim incontrol data streams are simulated from a Poisson distribution with incontrol mean A0, and the CUSUM statistic Gf'1 is computed for each data stream. The pvalue for Ct is then the ratio of CUSUM statistics, including Ct, that are at least as large as Ct:
Pt
Nsim
i+ E i
i= 1
cf > Ct
1 + Ns,
(3.5)
35
In the industrial process control setting where the CUSIJM was developed, an outofcontrol alarm would result in the offending process being shut down and restarted in an incontrol state. For this reason, the CUSUM statistic is traditionally reset to zero once a shift has been detected. However, when considering use in a public health setting, it is not possible to â€œshut downâ€ a disease process, so resetting the CUSUM is not desirable. Additionally, while an industrial process may remain out of control until it is shut down and adjusted, disease outbreaks are transient, resolving themselves over time. For these reasons, Gandy and Lau [2012] suggests the use of a nonrestarting CUSUM (NRCUSUM) for disease surveillance applications, which we utilize in our proposed methodology. The nonrestarting CUSUM is identical in application to the standard CUSUM, except that the chart is not reset to zero when an outbreak is detected.
3.3 Proposed Methodology
3.3.1 The CUSCAN Method
For the CUSCAN method, we assume the following: that our data are regional counts, that the number of cases in each region is distributed as a Poisson random variable, and that we have multiple time periods of data and would like to perform a hypothesis test for each period. We propose combining the purely spatial scan statistic with the nonrestarting CUSUM to create a surveillance method that includes both spatially and temporallyaggregated information. We will refer to this method as the CUSCAN, as it is the cumulative sum of scan statistics. By combining these two methods, we create a surveillance system that can both quickly detect a new emerging disease cluster and indicate its spatial location.
To implement this combined methodology, we perform the following:
1. For each time period, compute a scan statistic for each potential cluster using Equation (3.1).
2. Compute the CUSUM statistic in Equation (3.4) for each potential cluster over time.
36
3. At each time period, take the maximum of the CUSIJM streams for that particular time. This is our CUSCAN statistic, and the regions that produced it comprise our most likely cluster of cases.
4. Assess the significance of the statistic and determine whether there is evidence of an outbreak. This is done by applying steps 13 to data simulated under the null hypothesis and computing the pvalue for the test statistic found in 3.
3.3.2 Selection of k
When monitoring a data stream with a CUSUM control chart, two userspecified parameters are required: a rejection threshold h and tuning parameter k. When the distribution of the observed data are known to come from a member of the exponential family, it is simple to find an optimal value of k and h [Hawkins and dwell, 1998]. However, the data stream we are monitoring  the spatial scan statistics  do not follow any specific known distribution, so these methods cannot be used.
We can eliminate the need for the rejection threshold h by performing a pvalue based hypothesis test. With a pvalue approach, the type I error rate is controlled at the desired significance level, a, without the need to limit the CUSUM statistic below a given value during incontrol periods. For this reason, the choice of k is more important for computational efficiency and the power to detect an outbreak than controlling the false alarm rate. If no k were specified in the CUSUM formula given in Equation (3.4), i.e. k = 0, the false alarm rate would be controlled by pvalue testing, but the statistic would grow without bound over time, as the data values Yt are nonnegative. This continued growth would eventually become taxing on computational resources and makes it difficult to visually examine the evolution of the CUSCAN statistic over time. Thus, it is important to specify a k to constrain the CUSUM statistic near zero and prevent it from growing without bound during incontrol time periods. Likewise, the choice of k has a direct effect on the power to detect an outbreak; an unnecessarily large value of k prevents
37
the CUSIJM statistic from growing above zero even during outofcontrol times. So while the choice of k is less important in the pvalue model than the threshold model, it is still necessary to put some care into its selection. If no analytic equation for k exists, we want to be able to select a k such that the following are true: (i) the value of k is of appropriate magnitude relative to the data, (ii) the value of k is constant, and (iii) k can be selected in a computationally manageable way. We propose that k be determined using available incontrol baseline data and incontrol simulations, which will ensure that k is on the correct scale relative to the incontrol statistics.
We propose selecting k to control the maximum number of time periods the CUSCAN statistic is expected to remain positive during incontrol time periods within some specified tolerance level r, that is, the value of k that satisfies
min : P{Ct = 0 for some t â‚¬ {1,... , do}) > t. (3.6)
k
This method is similar to that proposed by Chatterjee and Qiu [2009], which used bootstrapping methods to determine a value of k that controlled the average sprint length,
i.e., the average amount of time the CUSUM remains positive before returning to zero. Since the distribution of our baseline data is assumed to be Poisson, we are able to use standard simulation methods rather than needing the distributionfree methodology described by Chatterjee and Qiu [2009]. Additionally, by choosing k to control the maximum sprint length rather than average sprint length, fewer time periods of data need to be simulated and k can be chosen with relative computational ease. Define 5o to be the maximum sprint length under the null, i.e., the number of time periods for the CUSCAN statistic to return to zero with no outbreak present. We perform a grid search within a range of values chosen based on the size of the null scan statistics. More specifically, k can be selected using the following algorithm:
1. Select the maximum run length until return to zero, Ac and the tolerance level, r.
2. For each time period where baseline data is available, compute the spatial scan
38
statistic for each potential cluster, and take the maximum at each time period.
3. Select the range of potential k values to consider based on the size of the maximum statistics.
4. For each potential value of k:
(a) Simulate nSim data sets under the null hypothesis for a total of % time periods.
(b) For each simulated data set, compute the spatial scan statistics and take the maximum scan statistic at each time period.
(c) For the selected value of k, compute the CUSIJM statistics for the maximum scan statistics.
(d) Calculate the proportion of CUSUM streams that returned to zero during the simulated % time periods.
5. Select the smallest value of k that brings the proportion of streams that return to zero within % time periods closest to the userdefined tolerance level, r.
This selection of k, based on the distribution of maximum statistics during incontrol periods, ensures that the CUSCAN statistic will remain bounded near zero for all potential clusters while also allowing the CUSCAN to grow above zero when the scan statistics increase in the presence of an outbreak.
When deciding on initial values for k and %, it is important to consider the effect these choices have on the final value of k. For initial values of k, we suggest using quantiles from the maximum scan statistics to determine a range of potential values. For example, the median of the maximum statistics may be chosen as the lower bound on k, so that even if there are many time periods with large null statistics, we can reasonably expect the CUSCAN statistic to increase in no more than about 50% of time periods. Likewise, the upper bound on k need not be larger than the largest maximum scan statistic, as the probability of observing statistics above this level is very small. For selecting %, itâ€™s
39
important to recognize the effect this value has on the selected value of k. Smaller values of 50 will result in larger values of k, keeping the CUSCAN statistic near zero until there is a large increase in the scan statistics, as would be seen in the case of larger spatial clusters. Likewise, larger values of So will result in smaller values of k, allowing the CUSCAN statistic to rise above zero in the presence of only a small increase in the scan statistics, potentially allowing for the detection of smaller clusters.
3.4 Demonstration: Simulated Data based on New York Leukemia Data
In order to demonstrate the potential of the CUSCAN method, we simulate data based on the New York Leukemia data provided by Waller and Gotway [2004]. In this data set, counts of leukemia cases were recorded in regions across upstate New York. The original data contain 592 cases across 281 regions, with a total atrisk population of 1,057,673. Using this data to provide a realistic neighborhood map and baseline incidence rate, we generated three sets of time series data representing three cluster models.
The global leukemia incidence rate was estimated from the provided data to be approximately r = 5.597E04. This rate was used as the global incidence rate under the null hypothesis of no outbreak. The expected case count in region i with population rii is computed as Ei = rrii. When an outbreak was present in the simulated data, the local incidence rate inside the simulated cluster was set to twice the null incidence rate. Three clusters of varying sizes were simulated: cluster A contains 11.3% of the total population of the study area over 31 spatial regions, cluster B contains 4.1% of the population over 10 regions, and cluster C contains 5.9% of the population over 24 regions (Figure 1). In total, 100 data sets were simulated for each cluster model, each containing 20 periods of null data followed by a 10period outbreak.
When applying the CUSCAN method to these simulated clusters, we use the circular scan statistic as our base with a population upper bound of 50% of the total population. For the CUSUM portion of the method, we used the fcselection procedure described in Section 3.3.2. We used 999 simulated null data sets, with a starting range of the 50th and
40
Figure 3.1: The three simulated clusters on the map of 281 regions in upstate New York.
41
90th percentile of the null maximums, using % = 5 time periods with a tolerance of r = 0.95, i.e., the CUSIJM returned to zero within 5 time periods in 95% of simulations. This resulted in a value of k = 6.061.
Significance of the CUSCAN statistic was determined via simulation. Conditioning on the observed number of cases in the current time period, we simulate 999 null data sets from a multinomial distribution with probabilities based on the expected case count in each region. The pvalue is then calculated as in Equation (3.5). A significance threshold of a = 0.05 was used for all hypothesis testing decisions. If the CUSCAN statistic at time t is determined to be statistically significant, then we say there is an alarm and declare an outbreak is present at t. The location of the outbreak is recorded as the most likely cluster, determined as described in Section 3.3.1.
To analyze the performance of the CUSCAN method, we use the following metrics:
1. Basic power: the basic power at time t is the proportion of data sets for which the outbreak was detected by time t.
2. Delay: the difference between the time of detection t* and the true start of the outbreak ts, i.e., delay = t* â€” ts. If the outbreak was detected in the first time period it was present, delay = 0.
3. Spatial precision: the precision is the proportion of the true cluster contained in the identified cluster. Let At be the identified cluster at time t, A be the true cluster, and n(Z) the population of the set of regions Z. The precision at time t is defined based on population and is equal to 1 when At C A:
precisiont
n(At C A)
n(At)
4. Spatial recall: the recall is the proportion of identified cluster contained in the true
cluster. The recall at time t is likewise defined based on population and is equal to 1
42
when A C At:
recall*
n(At fl A)
n{A)
5. False alarm rate: the false alarm rate at time t is the proportion of data sets that produced an alarm at time t when no outbreak was present.
When a cluster detection method is performing well, we expect values of power, precision, and recall close to 1, as well as short delay and a false alarm rate at or below the level of significance (a = 0.05).
Table 3.1 summarizes the average power, delay, and false alarm rate of the simulation study. We see that in general, the power is very high and the delay very low across all outbreak times (the average delay is far less than one day to detection from start of outbreak). The power for cluster B (10 regions, 4.1% population) is somewhat lower in the first day of the outbreak (0.69), but increases rapidly to 0.94 by day two. Clusters A and C, containing more regions or a larger proportion of the atrisk population (31 regions, 11.3% population for A and 24 regions, 5.4% population for C) were much easier to detect, with hrstday power of 0.98 and 0.89 respectively.
Figure 3.2 shows plots of average spatial precision and recall results for each day of the outbreak. Spatial precision starts at 0.89 for cluster A, 0.68 for cluster B, and 0.82 for cluster C, increase rapidly to 0.93, 0.81, and 0.93 by the second time period and 0.96, 0.94, and 0.98 by the fifth. Recall starts at 0.91 for cluster A, 0.72 for cluster B, and 0.79 for cluster C and increases to 0.94, 0.84, and 0.91, respectively, on the second day of the outbreak and 0.98, 0.95, and 0.98 by the fifth.
Figure 3.3 demonstrates the behavior of the CUSCAN statistic for three randomly selected data sets: data set %19 for cluster A,, data set %70 for cluster B, and data set %57 for cluster C. The CUSCAN statistic behaves as expected for a nonrestarting CUSUM, with larger clusters A and C causing a larger increase in the statistic during outbreak periods relative to smaller cluster B.
43
Precision
Table 3.1: Average power, detection delay, and false alarm rates for the CUSCAN method.
Power on Day
Cluster 21 22 23 2430 Delay (Days) False Alarm Rate
A 0.98 1.00 1.00 1.00 0.02 0.054
B 0.69 0.94 0.98 1.00 0.39 0.052
C 0.89 1.00 1.00 1.00 0.11 0.045
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
21 22 23 24 25 26 27 28 29 30 21 22 23 24 25 26 27 28 29 30
(a)
, 1â€” ^ A * â€¢ â– A A  Â» â€¢ â€¢
â€”Cluster A Cluster B Cluster C
(b)
,^s i : * J â€¢^B A''' / A'' â–
â€”Cluster A â– a Cluster B Cluster C
Time
Time
Figure 3.2: Average results for (a) precision and (b) recall for each cluster during the
outbreak period.
44
Figure 3.4 demonstrates the difference between the true cluster and the most likely cluster identified at t = 30 for the same randomly selected data set (#19) for cluster A.
The most likely cluster identified at t = 30 for clusters B and C were identical to the true simulated cluster.
In summary, the CUSCAN statistic with circular base has high power to quickly detect an emerging outbreak, with higher initial power when the outbreak covers more regions or a higher proportion of the at risk population, and a rapid increase in power as the outbreak persists for multiple time periods. The CUSCAN is also able to accurately detect the regions in space where the outbreak occurs, with both spatial precision and spatial recall starting high and increasing rapidly as more outbreak data becomes available.
3.5 Demonstration and Power Assessment: New York City Benchmark Data
3.5.1 Data Description
Kulldorff et al. [2004] provide a set of benchmark data for testing the effectiveness of spacetime disease surveillance methods. They then conducted a power analysis on the benchmark data using the circular spacetime scan statistic implemented in SaTScan. By using the same set of data, we are able to directly compare the power of the CUSCAN to the spacetime scan statistic when detecting emerging clusters of this form.
The benchmark data uses a map of New York City containing 176 zip codes and a total atrisk population of 8,003,510. There are 17 different localized disease outbreaks simulated within this data set, along with one citywide outbreak, for a total of 18 different outbreak scenarios. The locations and sizes of the localized clusters are shown in Figure 3.5. Five of the clusters consist of a single zip code, five consist of small (510 zip codes) clusters, five consist of outbreaks in entire boroughs of the city, and two consist of irregularly shaped (elongated) clusters on the edge of the city.
For each outbreak model, the data are simulated with 30 periods of no outbreak, followed by one, two, or three days where an outbreak is present, for a total of 31, 32, or 33 days. The data are simulated by randomly distributing an expected 100 cases per day (so
45
(a)
(b)
(c)
Figure 3.3: CUSCAN statistic with alarms for three randomly selected data sets, (a) Cluster A (11.3% population), data set #19. (b) Cluster B (4.1% population), data set #70. (c) Cluster C (5.9% population), data set #57.
46
Figure 3.4: Identified cluster at time t = 30 for cluster A, data set #19. The precision
0.959 and the recall is 0.992.
3,100 total cases for the 31day data set, for example) across all regions and all days, with different days given equal probability of observing a case, and regions given probabilities proportional to their total population (i.e., each person in the city is considered equally likely to become a case). On days when an outbreak is present, the region(s) within the assigned outbreak have an increased relative risk of becoming a case. Two different outbreak scenarios were simulated for each cluster, one with a medium increased relative risk and one with a high increased relative risk.
In total, the data contain 18 different clusters, three different outbreak lengths, and two different increased risks, for a total of 108 different outbreak scenarios. For each of these scenarios, 1,000 outbreak data sets were simulated. In addition to these cluster models, Kulldorff et al. also provide three sets of null data to use for hypothesis testing, with 31, 32, or 33 days of data with no outbreak, each with 9,999 simulated data sets. The null data were generated as described above, but with no excess risk assigned to any of the city regions on any of the days.
3.5.2 Analysis Design
For our power analysis, we chose to focus on the 31 day data sets for our comparison between the CUSCAN and the spacetime scan method. We consider these data sets the most informative, as they represent our ability to rapidly detect a new outbreak. We mirror the power analysis performed with the spacetime scan statistic so that our results are directly comparable. We describe the analysis briefly below. Additional details may be found in the original work [Kulldorff et ah, 2004],
Power analysis for the spacetime scan statistic was performed on the 17 localized cluster data sets. A circular base, a maximum temporal window size of 3 days, and a population upper bound of 50% were used to determine the set of potential clusters. A significance level of a = 0.05 was used for all hypothesis testing decisions. To account for the possibility of citywide outbreaks, temporalonly clusters were included in the analysis. These temporalonly clusters were also computed using the spacetime scan statistic, with
48
(c)
Figure 3.5: Simulated outbreak clusters in New York City, (a) Single zip codes (solid) and small clusters (solid+shaded). (b) Large clusters, whole borough, (c) Irregularly shaped
clusters.
49
100% of the population included in the window at each time step. Significance was assessed using critical values, determined by applying the spacetime scan statistic to 9,999 null data sets and identifying the 500th largest maximum statistic for each time period. The power was calculated as the proportion of maximum statistics from the 1,000 outbreak data sets that were higher than this critical value. No adjustments were made for multiple testing.
Kulldorff et al. also provide power from five selected outbreak models with the following changes to the above parameters: with temporalonly clusters excluded, with a population upper bound of 5% rather than 50%, setting the maximum temporal window size to 1 or 7 days rather than 3 days, and adjusting for multiple testing so that only one false alert would be expected per year. Power results from the temporalonly analysis are also included for these select data sets. In addition to three local cluster outbreak models (one single zip code, one small cluster, and one whole borough cluster), these models also include a city wide outbreak (whole city) and null model (no outbreak). However, as the data set used to compute power in the no outbreak setting for the spacetime scan statistic is not specified, nor is any specific nonoutbreak test data set provided in the benchmark data, we exclude this setting from our comparisons.
To make our analysis as comparable as possible to the analysis performed by Kulldorff et ah, we likewise used circular windows with a population upper bound of 50% to determine potential clusters. A significance level of a = 0.05 was used for all hypothesis testing decisions. When assessing significance, we chose to apply the CUSCAN exactly as designed, using pvalues rather than critical values. Using the null data provided, we use the procedure described in Section 3.3.2 with % = 5 days and r = 0.95 to select a value of k for use in the CUSUM formula given in Equation (3.4). These parameters led us to choose k = 6.115 for the CUSCAN.
Unlike the spacetime scan statistic, the CUSCAN is not a natural choice for use on temporalonly clusters, as the spatial scan statistic cannot be calculated when all regions are included in a potential cluster. Instead, we simply use the nonrestarting CUSUM as
50
described in Section 3.2.2 to search for temporal clusters by monitoring total case counts for the city. We assume that daily counts are distributed as a Poisson random variable. As the distribution of the data is assumed to be known in this case, we can compute k in the CUSIJM equation (3.4) analytically as follows [Hawkins and Olwell, 1998]:
, _ Ai â€” Ao
_ ln(Ai) â€” ln(A0) â€™
where Ao is the incontrol mean estimated from null data, and Ai is the smallest shifted mean we want to detect. We estimate the incontrol mean Ao using the first 15 days of data from each cluster model as a baseline. We consider values of Ai corresponding to a 10%, 20%, or 30% increase in the mean number of cases.
We also provide additional power results for the CUSCAN for all localized clusters with the following modifications: (1) with temporalonly analysis excluded, and (2) with a population upper bound of 5% instead of 50%.
For the outbreak models including the irregularly shaped clusters, we calculate the power of the CUSCAN using the elliptic scan statistic. For these models, in addition to the circular regions from the circular scan statistic, we consider elliptic windows with shape S = 1.5, 2, 3, 4,5, with 4, 6, 9, 12, and 15 different equallyspaced rotational angles respectively.
Finally, we also provide spatial precision and spatial recall calculations for the CUSCAN. As SaTScan does not include these metrics as part of the power analysis it can perform, we are not able to compare these results to the spacetime scan statistic.
3.5.3 Results
A comparison of power between the CUSCAN and the spacetime scan statistic (STSS) for the primary analysis (circular windows, 50% population upper bound) can be found in Table 3.2. A comparison between the CUSCAN and spacetime scan statistic for the four select models described above and matching alternate parameter specifications can
51
be found in Table 3.3. Full power results for the CUSCAN with temporalonly analysis excluded and with population upper bound decreased to 5% can be found in Table 3.4, and results with the population upper bound set to 5% and temporalonly analysis included can be found in Table 3.5. Power results for the CUSCAN using the elliptic scan statistic are in Table 3.6. Finally, spatial precision and recall results for the CUSCAN are in Table 3.7.
Beginning with Table 3.2, we see that the CUSCAN performs about as well as the spacetime scan statistic on smaller clusters, but excels at identifying larger clusters, with hrstday detection power for high excess risk ranging from 0.862 to 0.996 for clusters of 10 or more regions, including the irregularly shaped Fludson River cluster (Fig lc). This is especially noticeable when the increased risk is lower; the CUSCAN demonstrates higher hrstday power to detect lowerintensity outbreaks in 12 out of 17 outbreak models, including those with smaller simulated clusters.
Table 3.3 shows the comparison between the CUSCAN and the spacetime scan statistic (STSS) for the four selected data sets for which the same data could be used. We see that the CUSCAN suffers a larger drop in power than the spacetime scan statistic when temporalonly analysis is excluded, with the exception of the whole city outbreak, where the power was unaffected. We also see that, while the choice of population upper bound has a minimal effect on the power of the spacetime scan statistic to detect small clusters (from 0.85 at 50% to 0.86 at 5% for the five zip code cluster, no change in the single zip code cluster), the power of the CUSCAN increases substantially, from 0.796 at 50% to 0.924 at 5% for the single zip cluster and from 0.830 to 0.878 in the five zip code cluster. The CUSCAN also has higher power retention compared to the spacetime scan statistic on the larger clusters, with a power of 0.868 at 5% in the large Manhattan cluster compared to 0.77 for the spacetime scan statistic, and 0.969 for the whole city outbreak compared to 0.4 for the spacetime scan statistic.
While spacetime scan statistic power results for lower population bounds or with temporalonly analysis excluded are not available for the remaining simulated clusters,
52
Table 3.2: Power to detect new outbreak on first day for the CUSCAN and the spacetime scan statistic (STSS). Temporalonly analysis is included when testing for an increase in expected case counts of 10%, 20%, or 30%. The highest value in each row is in bold text.
High Excess Risk
Outbreak area No. of regions Inc = 10% Inc = 20%. Inc = 30% STSS
A. Williamsburg, Brooklyn 1 0.806 0.805 0.804 0.860
B. Roosevelt Island, Manhattan 1 0.919 0.915 0.915 0.920
C. Bulls Head, Staten Island 1 0.811 0.808 0.812 0.830
D. LaGuardia, Queens 1 0.855 0.859 0.856 0.850
E. West Farms, Bronx 1 0.838 0.842 0.841 0.830
A with 4 neighbors 5 0.838 0.841 0.837 0.850
B with 5 neighbors 6 0.872 0.894 0.902 0.820
C with 4 neighbors 5 0.812 0.816 0.818 0.830
D with 9 neighbors 10 0.914 0.923 0.929 0.880
E with 4 neighbors 5 0.835 0.844 0.844 0.860
Rockaways 5 0.824 0.825 0.830 0.840
Hudson River 20 0.849 0.884 0.903 0.660
Bronx 25 0.960 0.962 0.964 0.940
Brooklyn 37 0.995 0.996 0.996 0.980
Manhattan 40 0.975 0.975 0.976 0.920
Queens 62 0.996 0.995 0.995 0.980
Staten Island 12 0.862 0.867 0.870 0.870
Medium Excess Risk
Outbreak area No. of regions Inc = 10% Inc = 20% Inc = 30% STSS
A. Williamsburg, Brooklyn 1 0.334 0.324 0.328 0.350
B. Roosevelt Island, Manhattan 1 0.399 0.387 0.388 0.370
C. Bulls Head, Staten Island 1 0.291 0.283 0.281 0.340
D. LaGuardia, Queens 1 0.366 0.364 0.359 0.320
E. West Farms, Bronx 1 0.323 0.325 0.325 0.290
A with 4 neighbors 5 0.435 0.426 0.433 0.420
B with 5 neighbors 6 0.437 0.471 0.482 0.400
C with 4 neighbors 5 0.338 0.351 0.357 0.330
D with 9 neighbors 10 0.541 0.570 0.576 0.420
E with 4 neighbors 5 0.397 0.406 0.412 0.430
Rockaways 5 0.330 0.334 0.323 0.340
Hudson River 20 0.432 0.455 0.472 0.330
Bronx 25 0.704 0.712 0.716 0.940
Brooklyn 37 0.900 0.899 0.901 0.790
Manhattan 40 0.685 0.690 0.700 0.570
Queens 62 0.871 0.868 0.879 0.730
Staten Island 12 0.432 0.441 0.437 0.430
53
Table 3.3: Power results on day 31 for alternate parameter specifications on select data sets for CUSCAN and spacetime scan statistic (STSS). All models are high excess risk. CUSCAN results with purely temporal clusters included are calculated with Ai = 1.3A0.
Outbreak area Maximum Size Purely Temporal CUSCAN STSS
A. Williamsburg, Brooklyn 50% Yes 0.804 0.860
50% No 0.796 0.860
5% No 0.924 0.860
N/A Yes, only 0.185 0.190
A with 4 neighbors 50% Yes 0.837 0.850
50% No 0.830 0.850
5% No 0.878 0.860
N/A Yes, only 0.330 0.290
Manhattan 50% Yes 0.976 0.920
50% No 0.973 0.920
5% No 0.868 0.770
N/A Yes, only 0.807 0.750
Whole city 50% Yes 1.000 0.860
50% No 1.000 0.840
5% No 0.969 0.400
N/A Yes, only 0.807 0.750
54
Table 3.4: Power results with temporalonly analysis excluded and population upper bound
reduced from 50% to 5%.
High Excess Risk
Outbreak area Regions Population 50% 5% Change
A. Williamsburg, Brooklyn 1 1.1% 0.796 0.924 0.128
B. Roosevelt Island, Manhattan 1 0.1% 0.913 0.912 0.001
C. Bulls Head, Staten Island 1 1.1% 0.799 0.907 0.108
D. LaGuardia, Queens 1 0.5% 0.850 0.929 0.079
E. West Farms, Bronx 1 0.7% 0.830 0.900 0.070
A with 4 neighbors 5 4.0% 0.830 0.878 0.048
B with 5 neighbors 6 3.1% 0.834 0.894 0.060
C with 4 neighbors 5 3.3% 0.799 0.910 0.111
D with 9 neighbors 10 8.2% 0.892 0.828 0.064
E with 4 neighbors 5 3.7% 0.827 0.911 0.084
Rockaways 5 1.3% 0.801 0.899 0.098
Hudson River 20 10.3% 0.761 0.707 0.054
Bronx 25 16.6% 0.96 0.832 0.128
Brooklyn 37 30.8% 0.995 0.836 0.159
Manhattan 40 19.0% 0.973 0.868 0.127
Queens 62 28.0% 0.995 0.897 0.098
Staten Island 12 5.5% 0.842 0.918 0.076
Medium Excess Risk
Outbreak area Regions Population 50% 5% Change
A. Williamsburg, Brooklyn 1 1.1% 0.305 0.441 0.136
B. Roosevelt Island, Manhattan 1 0.1% 0.375 0.364 0.011
C. Bulls Head, Staten Island 1 1.1% 0.259 0.460 0.201
D. LaGuardia, Queens 1 0.5% 0.335 0.462 0.127
E. West Farms, Bronx 1 0.7% 0.298 0.440 0.142
A with 4 neighbors 5 4.0% 0.427 0.481 0.054
B with 5 neighbors 6 3.1% 0.397 0.452 0.055
C with 4 neighbors 5 3.3% 0.308 0.478 0.170
D with 9 neighbors 10 8.2% 0.519 0.417 0.102
E with 4 neighbors 5 3.7% 0.382 0.506 0.124
Rockaways 5 1.3% 0.300 0.463 0.163
Hudson River 20 10.3% 0.387 0.349 0.038
Bronx 25 16.6% 0.701 0.474 0.227
Brooklyn 37 30.8% 0.893 0.494 0.399
Manhattan 40 19.0% 0.667 0.472 0.195
Queens 62 28.0% 0.860 0.542 0.318
Staten Island 12 5.5% 0.400 0.559 0.159
55
Table 3.5: Power results for population upper bound set to 5%. Temporalonly analysis is included when testing for an increase in expected case counts of 10%, 20%, or 30%.
High Excess Risk
Outbreak area No. of regions Inc = 10% Inc = 20% Inc = 30%
A. Williamsburg, Brooklyn 1 0.930 0.930 0.930
B. Roosevelt Island, Manhattan 1 0.920 0.916 0.917
C. Bulls Head, Staten Island 1 0.913 0.91 0.912
D. LaGuardia, Queens 1 0.932 0.935 0.932
E. West Farms, Bronx 1 0.908 0.911 0.910
A with 4 neighbors 5 0.887 0.892 0.889
B with 5 neighbors 6 0.919 0.932 0.938
C with 4 neighbors 5 0.920 0.920 0.920
D with 9 neighbors 10 0.859 0.879 0.890
E with 4 neighbors 5 0.918 0.921 0.920
Rockaways 5 0.908 0.908 0.912
Hudson River 20 0.804 0.847 0.872
Bronx 25 0.869 0.908 0.918
Brooklyn 37 0.927 0.971 0.976
Manhattan 40 0.907 0.938 0.947
Queens 62 0.936 0.973 0.983
Staten Island 12 0.932 0.933 0.936
Medium Excess Risk
Outbreak area No. of regions Inc = 10% Inc = 20% Inc = 30%
A. Williamsburg, Brooklyn 1 0.469 0.466 0.479
B. Roosevelt Island, Manhattan 1 0.401 0.400 0.402
C. Bulls Head, Staten Island 1 0.497 0.491 0.489
D. LaGuardia, Queens 1 0.495 0.499 0.493
E. West Farms, Bronx 1 0.472 0.479 0.478
A with 4 neighbors 5 0.516 0.523 0.529
B with 5 neighbors 6 0.503 0.525 0.537
C with 4 neighbors 5 0.519 0.524 0.527
D with 9 neighbors 10 0.469 0.494 0.504
E with 4 neighbors 5 0.540 0.549 0.551
Rockaways 5 0.491 0.499 0.500
Hudson River 20 0.430 0.483 0.509
Bronx 25 0.568 0.634 0.652
Brooklyn 37 0.685 0.765 0.807
Manhattan 40 0.559 0.619 0.656
Queens 62 0.655 0.741 0.787
Staten Island 12 0.597 0.609 0.609
56
Table 3.6: Power results for elliptic scan statistic. Temporalonly analysis is included when testing for an increase in expected case counts of 10%, 20%, or 30%.
Outbreak area Excess Risk Maximum Size Inc = 10% Inc = 20% Inc = 30%
Rockaways High 50% 0.718 0.732 0.728
High 5% 0.891 0.900 0.895
Medium 50% 0.288 0.336 0.357
Medium 5% 0.498 0.534 0.564
Hudson River High 50% 0.887 0.909 0.922
High 5% 0.875 0.905 0.920
Medium 50% 0.460 0.480 0.496
Medium 5% 0.483 0.518 0.540
57
Table 3.7: Average precision, recall, and estimated cluster size on day 31, 50% population
upper bound.
High Excess Risk
Outbreak area Size (Population) Estimated Size Precision Recall
A. Williamsburg, Brooklyn 1.1% 7.4% 0.794 0.999
B. Roosevelt Island, Manhattan 0.1% 1.6% 0.966 0.999
C. Bulls Head, Staten Island 1.1% 3.2% 0.868 0.986
D. LaGuardia, Queens 0.5% 3.7% 0.904 0.999
E. West Farms, Bronx 0.7% 3.3% 0.912 0.992
A with 4 neighbors 4.0% 17.9% 0.558 0.924
B with 5 neighbors 3.1% 14.1% 0.612 0.915
C with 4 neighbors 3.3% 7.2% 0.759 0.905
D with 9 neighbors 8.2% 25.7% 0.519 0.938
E with 4 neighbors 3.7% 11.7% 0.730 0.927
Rockaways 1.3% 4.1% 0.880 0.887
Hudson River 10.3% 37.2% 0.285 0.796
Bronx 16.6% 31.4% 0.613 0.925
Brooklyn 30.8% 45.1% 0.676 0.969
Manhattan 19.0% 39.3% 0.498 0.902
Queens 28.0% 45.2% 0.597 0.939
Staten Island 5.5% 9.7% 0.837 0.855
Medium Excess Risk
Outbreak area Size (Population) Estimated Size Precision Recall
A. Williamsburg, Brooklyn 1.1% 17.7% 0.501 0.980
B. Roosevelt Island, Manhattan 0.1% 7.7% 0.812 0.987
C. Bulls Head, Staten Island 1.1% 9.9% 0.644 0.869
D. LaGuardia, Queens 0.5% 11.2% 0.700 0.970
E. West Farms, Bronx 0.7% 13.0% 0.650 0.956
A with 4 neighbors 4.0% 25.2% 0.406 0.898
B with 5 neighbors 3.2% 24.8% 0.373 0.894
C with 4 neighbors 3.3% 11.6% 0.619 0.783
D with 9 neighbors 8.2% 29.3% 0.418 0.898
E with 4 neighbors 3.7% 18.2% 0.523 0.874
Rockaways 1.3% 11.4% 0.654 0.785
Hudson River 10.3% 37.3% 0.278 0.735
Bronx 16.6% 31.9% 0.584 0.894
Brooklyn 30.8% 42.8% 0.684 0.915
Manhattan 19.0% 37.3% 0.504 0.836
Queens 28.0% 43.1% 0.606 0.893
Staten Island 5.5% 13.8% 0.722 0.786
58
Table 3.4 and 3.5 show that the results in Table 3.3 appear to be typical for the CUSCAN. Table 3.4 shows that the CUSCAN, when performed without supplementary temporalonly analysis, performs only marginally worse in hrstday power to detect, with an average decrease in power of 0.0227 in the high risk model (compared to the Ai = 1.3Ao model) and 0.0327 in the medium risk model when the population upper bound is 50%. Tables 3.4 and
3.5 also shows the same pattern as described previously when the population upper bound is changed, where power to detect a cluster increases for small clusters but decreases for larger ones, with a more dramatic change when the excess risk is lower. However, the power is still quite high across all clusters, with the exception of the whole borough clusters in the medium excess risk case with no temporalonly analysis included.
In Table 3.6 we see that replacing the circularbased scan statistic with an elliptic scan statistic improves the power to detect the irregularly shaped Hudson River and Rockaways clusters. For the high excess risk model, the maximum power to detect the Hudson river cluster increases from 0.903 with the circular scan statistic to 0.922 with a population upper bound of 50%, and for the medium excess risk model, the power increases from 0.472 to 0.496. While the power to detect the Rockaways cluster does not improve in the high risk model, the power does improve in the medium risk model, increasing from 0.323 to 0.357 with 50% population and from 0.500 to 0.564 with 5% population. We see the same pattern here as in previous trials: the method performs better when detecting larger clusters (20 regions in the Hudson River cluster compared to 5 in the Rockaways cluster), but power to detect small clusters improves substantially when the population upper bound is lowered.
Finally, in Table 3.7, we see that the CUSCAN has high spatial recall on the first day of the outbreak, indicating that most of the atrisk population is correctly being identified as part of the cluster. However, the spatial precision is low, especially for irregularly shaped clusters like the Hudson River. In essence, the area identified by the CUSCAN is larger than the actual outbreak area on average. This can be primarily attributed to the use of the circular scan statistic as the base of the CUSCAN: if the true cluster is not
59
circular, then the circular window that contains all the outbreak regions will necessarily include other regions as well that are added on to fill in the circle. This effect is most extreme on the elongated Hudson River cluster, where a large number of excess regions were added in when attempting to fit all of the outbreak regions inside a circular window.
3.6 Discussion
From the results of our simulation study, we demonstrate that the proposed CUSCAN methodology shows promise as a tool for the rapid detection of emerging disease clusters, with high power to detect a new cluster in the first time period of an outbreak and a high proportion of the true cluster population correctly identified. While no method will be universally most powerful in all cases, the CUSCAN performs well in identifying larger spatial clusters, and is strong in identifying clusters of any size when the intensity of the outbreak is lower. This makes it a powerful tool for detecting clusters with lower excess risk, which can often be missed.
While the CUSCAN method has more difficulty in identifying smaller clusters, such as the onezipcode clusters in the benchmark data, the power can be improved by lowering the population upper bound, with the tradeoff of making it more difficult to detect larger clusters instead. In practice, as size of a potential future disease cluster will be unknown, it is up to the practitioner to decide whether they want to prioritize the search for one size of cluster over another. In the case where the detection of smaller clusters is prioritized, a more moderate value for the population upper bound such as 20% or 30% will increase the ability to detect smaller clusters while sacrificing less power to detect larger ones.
We note in Section 3.5.3 that, while the spatial recall of the CUSCAN is high, the precision tends to be low. We believe this to primarily be a weakness of the circular scan statistic used as the base of the CUSCAN, as many nonoutbreak regions are often included within a circular window containing a noncircular cluster. Indeed, when the CUSCAN was applied to clusters that were more circular in shape in Section 3.4, the precision was as high as the recall, meaning few excess regions needed to be added. In
60
essence, low spatial precision in this method is primarily an artifact of using a scan method not ideally suited to the true shape of the clusters, and when paired with high recall, we see that the CUSCAN has a high rate of identifying the correct spatial location of a cluster as well.
Kulldorff et al. [2004] recommend that temporalonly analysis should be included in surveillance contexts where the spacetime scan statistic is used. We likewise recommend that temporalonly analysis be considered for use alongside the CUSCAN. We do acknowledge, however, that the form of temporalonly analysis we use in our demonstration does increase the number of parameters needing to be specified by the user. While the CUSCAN emphasizes datadriven calculations and requires few parameters to be specified by the user (specifically, So, r, and the population upper bound), when temporalonly analysis is included using the nonrestarting CUSUM, it is also required that the user specify a level of Ai to detect. However, should the practitioner find this undesirable, the method used to determine k in Section 3.3.2 may be used with observed counts rather than scan statistics, and the same levels of So and r may be used for simplicity. Additionally, as demonstrated in Table 3.3, the CUSCAN method is also able to detect a citywide outbreak on its own should one happen to occur, though more analysis may be necessary to recognize the outbreak as citywide, as the CUSCAN would only be able to point to a cluster of specified size within the study area.
While we demonstrate the CUSCAN method using the circular and elliptic scan statistics, other types of scan statistics may be used with this method, including the flexible scan statistic. Indeed, any scan statistic that searches over a fixed set of potential clusters may be used. This flexibility allows the CUSCAN to remain relatively efficient if computational resources are scarce (circular scan statistic), and allows for the detection of irregularly shaped clusters when resources are available (flexible, elliptic scan statistic). Methods such as the flexible scan statistic are by necessity computationally intensive, however, and while useful, may not always be a realistic choice for the practitioner. While
61
the more intensive scan statistics do increase detection power of irregular clusters, such as the Hudson River and Rockaways clusters in the benchmark data, the method still performs reasonably well in detecting these clusters using the more computationally efficient circular scan statistic. Regardless of the computational resources available, the method should be a boon for practitioners when used in a prospective surveillance setting.
One additional aspect to consider when utilizing this method is the use of the nonrestarting CUSIJM. While the nonrestarting CUSIJM improves power during a sustained outbreak [Gandy and Lau, 2012], this comes at the expense of additional false alarms following the end of an outbreak. As the benchmark data set does not extend beyond the beginning of an outbreak, this false alarm problem is not demonstrated by our simulation study, but can be seen in other works utilizing the nonrestarting CUSIJM such as in Dassanayake and French [2016]. While not currently implemented for this method, Hall and French [2019] proposed a correction for the nonrestarting CUSUM that controls postoutbreak false alarms that may be used to address this issue.
62
CHAPTER IV
AN EXTENSION OF THE CUSCAN FOR DYNAMIC SCANNING
METHODS: THE RESTRICTED FLEXIBLE SCAN STATISTIC 4.1 Introduction
Rapid detection of emerging disease outbreaks is an essential aspect of public health, and often the primary goal of prospective surveillance methods. The more quickly a new outbreak is detected, the more effective any available intervention will be. For this reason, surveillance methods that boast high power and speed of detection are often favored. However, power of detection is not the only desirable quality in a surveillance method. When data include a spatial element, such as counts of disease indexed by county or census tracts, it is also essential that the location of an outbreak be identified correctly. A method that quickly detects a new outbreak, but cannot accurately determine its location, is of little use when further action is required at the outbreak site.
Spatial scan methods are popular for locating clusters of disease in spatial data. These methods search over a variety of potential cluster locations called windows, comparing the disease incidence rate relative to expected levels inside the window to that outside the window to determine if an outbreak is likely to exist within that given window. One of the most common ways the windows are determined is using a series of concentric circular areas, as circular windows can be computed using only a measure of distance and require relatively little computational complexity. Kulldorff [1997] originally proposed the circular scan method as a method for examining crosssectional data, and it was later extended for use with timeindexed data as a prospective tool by Kulldorff [2001]. While the circular and spacetime scan methods have high power to detect when an outbreak is present, their precision in determining which regions are included in the disease cluster is low when clusters are not approximately circular in shape, as a circular window that fully contains a noncircular cluster will by necessity also include many noncluster regions. For this reason, it can be difficult to determine which regions within the identified outbreak area are actually in need of health intervention.
63
To reduce the problems caused by fixedshaped scan methods, many scan methods have been developed with the goal of detecting irregularly shaped disease clusters. One such example is the flexiblyshaped scan statistic [Tango and Takahashi, 2005], which searches over all connected subsets in a local neighborhood, allowing for the detection of clusters of arbitrary shape. However, as the number of regions in a study area the size of the local neighborhoods increase, the number of potential clusters under consideration increases exponentially and quickly become computationally infeasible. This makes detecting larger clusters or clusters within large regions extremely difficult. In order to mitigate this, Tango and Takahashi [2012] proposed a modified version of the flexiblyshaped scan method that reduces the number of potential clusters under consideration by first filtering out regions that are not likely to be hotspots. This effectively reduces the number of regions under consideration, speeding up computation considerably, while still allowing for identified clusters to take on a variety of shapes.
In Chapter III, we proposed a novel prospective surveillance method based on spatial scanning methods that we called the CUSCAN statistic. This method, defined by computing the cumulative sum (CUSIJM) of spatial scan statistics, was shown to have high power to detect emerging disease clusters, making it promising for use as an active surveillance tool. However, as the CUSCAN primarily uses the circular scan method to compute the scan statistics at its core, it suffers from the same low precision seen in other circularbased scan methods. This weakness can potentially be eliminated by replacing the circular scan method with one designed to detect clusters of arbitrary shape. However, a current requirement of the CUSCAN method is that the set of potential clusters under consideration must be fixed ahead of time, so that the CUSUM statistic may be computed. This means that methods such as the restricted flexible scan statistic, whose potential clusters are determined by the observed data and vary over time, could not be used with the CUSCAN. Here, we propose a modification to the CUSCAN to allow the use of spatial scanning methods with timevariable windows by linking together overlapping
64
clusters between time periods. We demonstrate this extension by incorporating the restricted flexible scan method into the CUSCAN.
The structure of this chapter is as follows: in Section 4.2, we review the properties of the CUSCAN and restricted flexible scan methods. In Section 4.3, we describe in detail the process of connecting overlapping clusters through time that allows for the computation of the CUSCAN statistic. In Section 4.4, we demonstrate our modifications and compare the properties of the circular, elliptic, and restricted flexible CUSCAN methods using simulated benchmark data. Finally, in Section 4.5, we summarize our conclusions.
4.2 Review of Methods
4.2.1 The CUSCAN Method
In Chapter III, we introduced the CUSCAN method, which incorporates spatial data into a nonrestarting CUSUM framework by computing the cumulative sum of spatial scan statistics.
Suppose that we have a study area consisting of N disjoint spatial regions. Let ri\,..., un denote the atrisk population within each region and n+ = ni the total population, both of which we assume to be fixed and unchanging over time. Let Y(i)t),..., Y(N,t) denote the case counts in the associated regions at a given time t.
For a given study area with fixed population, we expect the global disease risk to remain constant over time, and to be equal to the local risk within each region when no outbreak is present. Suppose that we have M time periods of baseline data with no outbreaks. Let yt+ = be total case count for time t. We can compute the average global rate of disease as
M
E yt+
r
t= i
Mn.
(4.1)
We can then compute the expected case count at each region, also assumed to be constant,
as Ei = rrii.
65
For a given spatial scan method, we consider a set of possible clusters, Z, assumed to be fixed, each consisting of a subset of contiguous regions within the study area. For each potential cluster and time t, define Yin = J2iezYiP Yout = YLi$zY%t> = l>2iezEiâ€™
and Eout = J2t^z Eiâ€™ where we suppress the dependency on t for simplicity. We then begin by computing the spatial scan statistic for each potential cluster at each time period:
S(Z,t)
Y
1 in
~E~
Y
1 zr.
Y,
out
E,
out
Ym
Y
1 m
~E~
>
Yr
out
E,
out
(4.2)
Once we have computed the Poisson scan statistic for each potential cluster Z E Z, we compute the CUSIJM statistic for each cluster by:
C(Z,o) = 0, C(z,t) = max{0, C(z,ti) + S(z,t) ~ k}, (4.3)
where C(Z,t) is the CUSUM statistic for cluster Z at time t, and k is a constant fitted via simulation to keep the CUSUM bounded near zero when no outbreak is present (see Section 3.3.2 for details on the selection of k). The CUSCAN statistic at time t is then computed as the maximum CUSUM statistic for that time period:
Ct = maf{C(z>t)}.
(4.4)
The significance of the CUSCAN statistic is assessed via simulation. Nsim data sets are simulated under the null hypothesis of no outbreak, and the CUSCAN statistic is computed for each. Let be the CUSCAN statistic at time t for simulated data set i. We compute the pvalue for the CUSCAN statistic Ct as the proportion of statistics, including Ct, that are as least as large as Ct:
Pt
Nsim
i+ E i
i= 1
cf > Ct
1 + Ns.
(4.5)
66
If p < a for a given significance level a, we declare that an outbreak is present at location Z*, where Z* is the maximal cluster from equation (4.4).
Currently, the CUSCAN method requires that the set of potential clusters, Z, be fixed ahead of time so that each Z e Z appears in every time period and C(z,t) may be computed. In Chapter III, the following two scan methods were used to determine Z, each depending only on the fixed population and regions within the study area:
1. Circular scan statistic [Kulldorff, 1997]. Beginning at the centroid of each region in the study area, expand a circular window to include nearby regions until the population inside the largest window reaches some prespecihed upper bound, commonly 50% of the total atrisk population. A region is included in a potential cluster if its centroid is within the circular window.
2. Elliptic scan statistic [Kulldorff et ah, 2006]. Beginning at the centroid of each region in the study area, expand an elliptic window to include nearby regions until the population upper bound is reached. These ellipses are defined in terms of their shape (c = a/b, the ratio of the major and minor axes) and angle (9, the angle between the major axis and horizontal axis).
4.2.2 The Restricted Flexible Scan Method
In Chapter III, it was additionally noted that the flexiblyshaped scan statistic [Tango and Takahashi, 2005], which searches all connected subsets of regions within a neighborhood of fixed size, may also be used with the CUSCAN. However, the flexibly shaped scan statistic is severely computationally intensive, making it impractical for use in detecting clusters containing more than a handful of regions at a time. To remove this weakness, Tango and Takahashi [2012] proposed a restricted flexibly shaped scan method that significantly reduces the number of regions under consideration when computing connected subsets, drastically reducing the number of potential clusters and computation time.
67
Suppose as before that we have N disjoint regions in our study area, each with fixed population size n\,..., un and timeindexed case counts hp,t), â– â– â– , Y(N,t), with the global rate r and expected counts Ei,... , TV The flexibly shaped scan method determines the set of potential clusters Z as follows: for each region in the study area, create a local neighborhood consisting of the region and its k â€” 1 nearest neighbors, measured by intercentroid distance. For each local neighborhood of k regions, identify all connected subsets of regions within the neighborhood. This collection of connected subsets is our set of potential clusters Z, and the Poisson scan statistic for each Z e Z is then computed as:
y \ Yout / y'. V
1 out \ j / 1 in 1 out
Ecjut) \ E'm Eout
with the most likely cluster determined to be the Z* that satisfies:
max{^}. (4.7)
Zj
By determining the set of potential clusters Z in this way, the flexibly shaped scan method is able to detect clusters of up to k regions with arbitrary shape, making it a useful tool for identifying irregularly shaped disease clusters.
It is easy to see that the number of possible clusters under consideration increases exponentially as either the neighborhood size k or the number of regions in the study area N increase. In order to reduce the number of potential clusters under consideration, Tango and Takahashi [2012] suggest that regions first be filtered based on their local risk in order to determine which regions are most likely to be experiencing an outbreak.
To determine whether a region has a sufficiently high local risk, we compute its middle pvalue:
rripi = P(Yi > yi + 1  Yi ~ Poisson(Â£))) + ^P(T* = y*  T* ~ Poisson(Â£))), (4.8)
where yi is the actual observed count in region i. This middle pvalue is then compared to
68
a predetermined threshold op. Let 1Z C {1,..., N} be the set of regions with mpi < aq. Rather than determine all connected subsets of all regions in the study area up to a maximum neighborhood size, we instead search only for connected subsets of regions in 7Z. This restricted search defines a much smaller set of potential clusters for Z, significantly reducing the amount of computations necessary to compute the spatial scan statistics.
4.3 CUSCAN with the Restricted Flexible Scan Method
One effect of the region filtering described in Section 4.2.2 is that the set of potential clusters Z cannot be predetermined and will vary based on actual observed case counts. For this reason, when we have a time series of case counts T(i,t), â€¢ â€¢ â€¢ ,Y(N,t), will describe a different set of potential clusters at each time index t. As a given cluster Z is not guaranteed to be in every Zt, we could not originally compute the CUSCAN statistic as described in Chapter III and Section 4.2.1.
We propose that, rather than requiring the same cluster Z for each time period, we can instead find a chain of overlapping clusters ZÂ±,..., Zt to compute the CUSCAN statistic. Let Zt be the set of potential clusters at time t. At time t = 1, we can compute the CUSCAN directly by calculating the Poisson scan statistic for each cluster in Z\ and using equation (4.3). For each subsequent t > 2, we use the following method to compute the CUSCAN:
1. Compute the middle pvalue for each region using equation (4.8) and determine the set of potential clusters Zt.
2. Compute the Poisson scan statistic (4.2) for each potential cluster Z e Zt.
3. Order the statistics S(z,t) from largest to smallest, ..., . Let
z\x\ ..., Z^Zt^ denote the regions associated with each statistic.
4. From this ordered list, determine the set of nonoverlapping likely clusters J\ft:
a. Add Z\l) to Mt, as it is associated with the largest scan statistic
69
b. Identify the next largest statistic such that shares no regions in common with Z^\ and add it to Aft
(i) (i)
c. Identify the next largest statistic S) such that that Z{ shares no regions in common with Z^ or Z^ and add it to Aft
d. Continue this process until no further nonoverlapping clusters can be found.
5. For each cluster Zt in Aft
a. Identify the cluster Zt1 in Afti with the largest CUSCAN statistic that overlaps with Zt.
i. If no such cluster exists, compute the CUSIJM statistic in equation (4.3) with the scan statistic statistic from Zt and an initial CUSCAN statistic of 0.
ii. Return to a. for the next cluster.
b. Compute the CUSUM statistic as in equation (4.3) using the scan statistic from Zt and the CUSCAN statistic from Zt\.
6. Calculate the maximum CUSCAN statistic at time t by taking the maximum of the CUSUM streams as in equation (4.4).
7. Increment t and return to 1.
To demonstrate how this process links clusters together over time, we provide the following example. Suppose we have a study area that we want to calculate the CUSCAN for over three consecutive time periods. Using the restricted flexible scan method, we compute the scan statistics for each cluster in Zi,Z2,Z3. After ordering the scan statistics and identifying the nonoverlapping clusters, we have:
Af1 = {z[l) = {1,3,4}, Zj2) = {7,9,11}, Zf = {18,21,22}}
M = {41} = {4, 5, 7}, Z<2) = {9,12,13}, Zf = {16,17, 20}}
Af3 = {zi1) = {3,4,5}, = {8,10,11}, zf = {17,18, 20}}
70
where is the region associated with the largest scan statistic, Z^ the second largest,
(3\
and ZÂ± ; the third largest. These regions would then be linked together in the following way:
Cluster t = 1 t = 2 t = 3
h1â€™ {1,3,4}  + {4,5,7}  + {3,4,5}
zf {7,9,11}  + {9,12,13} {8,10,11}
zg {18,21,22} {16,17,20}  + {17,18,20}
Analyzing the progression, we see that at t = 2:
â€¢ Cluster {4, 5, 7} in J\f2 overlaps with both {1,3, 4} and {7,9,11} in A/}. We link it to the cluster with the larger test statistic, {1,3,4}.
â€¢ Cluster {9,12,13} in Af2 overlaps only with {7, 9,11} in A/}, so we link them together.
â€¢ Cluster {16,17, 20} in J\f2 does not overlap any clusters in A/}, and so a new chain begins assuming St1 = 0.
â€¢ Since no cluster in J\f2 overlaps with {18, 21, 22} in A/}, the chain ends at t = 1.
When we increment t, we see that at t = 3:
â€¢ Cluster {3, 4, 5} in A3 only overlaps with {4,5, 7} in A2, so we link them together.
â€¢ Cluster {8,10,11} in A3 does not overlap any clusters in J\f2l and so a new chain begins assuming St1 = 0.
â€¢ Cluster {17,18, 20} in A3 only overlaps {16,17, 20} in J\f2l so we link them together.
â€¢ Since no cluster in A3 overlaps {9,12,13} in A2, the chain ends at t = 2.
In this way, we are able to chain together potential clusters from 7=1 to 7 = 3, even though the set of nonoverlapping clusters changes over time.
71
Using this method of chaining together clusters that overlap between time periods, we are able to compute several CUSUM screams over time when the set of potential clusters is dynamic rather than fixed. After calculating the CUSUM streams, we would then take the maximum at each time period as our CUSCAN statistic.
4.4 Demonstration: Simulated Data based on Northeast Benchmark Data
The data we use to demonstrate the CUSCAN with the restricted flexible scan method is based on benchmark data constructed by Kulldorff et al. [2003] and Duczmal et al. [2006]. These data sets were inspired by breast cancer mortality data from 19881992 in the northeastern United States, including various regions within Connecticut, Delaware, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, Vermont, and the District of Columbia. The map of the area contains 245 distinct regions with a total atrisk population of approximately 29.5 million, with the population in each region measured as the female population in the region as recorded in the 1990 U.S. Census.
The data sets we utilize here come from Duczmal et al. [2006], which contains many irregularlyshaped disease clusters. For each cluster, 10, 000 crosssectional data sets were simulated, each totaling 600 cases. Additionally, 99,999 crosssectional null data sets are provided where the 600 total cases are randomly distributed across all regions.
The clusters we chose to analyze are clusters B, C, E, F, and G from this data set, chosen to provide a varied selection of irregular shapes and sizes. These five clusters are shown in Figure 4.1 and summarized in Table 4.1.
As the restricted flexible scan method requires that subsets within a neighborhood be connected, a spatial adjacency matrix was required. The spatial adjacency matrix used was provided along with the simulated data sets as part of the neastbenchmark R package, which is available at http://www.githib.com/jpfrench81/neastbenchmark. This matrix was primarily autogenerated with the R package spdep, which automatically locates adjacent regions from the regional map, and updated to connect an island region to the rest
72
Figure 4.1: Simulated disease clusters with irregular shapes.
Table 4.1: Size and population of selected simulated disease clusters with descriptions from
original source.
Cluster Description No. of Regions Population
B Hudson River 16 5.7%
C Lake Ontario Coast 7 2.4%
E Susquehanna River 21 5.0%
F New England Coast 23 10.8%
G Pennsylvania External Border 26 8.4%
73
of the contiguous regions. The neastbenchmark package has been updated since the time of this study, and the results that follow utilize the connectivity matrix in neastw_old.
Since the CUSCAN is a method for detecting emerging clusters in timeseries data, we simulated a set of timeindexed data using the crosssectional data sets provided. For each outbreak data set, 30 crosssectional null data sets and 3 crosssectional outbreak data sets were randomly selected and placed together sequentially. For each null data set, 33 crosssectional null sets were randomly selected. This allowed us to simulate a timeindexed study with 30 periods of no outbreak followed by 3 periods where an outbreak is present. We simulated 9,999 null data sets, and 1,000 outbreak data sets for each of the five selected clusters.
In order to observe the effect that the restricted flexible method has on the CUSCAN, we compare the performance of the CUSCAN with restricted flexible base to the CUSCAN with circular and elliptic bases. Rather than using a kâ€”nearest neighbor approach to the restricted flexible scan method as described in 4.2.2, we instead use a populationbased nearestneighbor approach to determining the neighborhoods, with each local neighborhood extending from its centroid until up to 50% of the total population of the study area is in the neighborhood. This ensures that the circular, elliptic, and restricted flexible scan windows all have the capability to detect clusters of the same size.
To analyze the performance of the restricted flexible CUSCAN, we use the same metrics as in Chapter III:
1. Basic power: the basic power at time t is the proportion of data sets where the outbreak was detected by time t.
2. Delay: the difference between the time of detection t* and the true start of the outbreak ts, i.e., delay = t* â€” ts. If the outbreak was detected in the first time period it was present, delay = 0.
3. Spatial precision: the precision is the proportion of the true cluster contained in the
74
identified cluster. Let At be the identified cluster at time t, A be the true cluster, and n(Z) the population of the set of regions Z. The precision at time t is defined based on population and is equal to 1 when At C A:
precisiont
n(At fl A) n(At)
4. Spatial recall: the recall is the proportion of the identified cluster contained in the true cluster. The recall at time t is likewise defined based on population and is equal to 1 when A C At:
recall*
n(At n A) n(A)
5. False alarm rate: the false alarm rate at time t is the proportion of data sets that produced an alarm at time t when no outbreak was present.
The restricted flexible CUSCAN method is implemented with aq = 0.1,0.15,0.2. The Elliptic CUSCAN was implemented with shape parameters S = 1, 2, 4 with 1, 6, and 12 rotational angles respectively. A significance level of a = 0.05 is used in all power calculations. Unlike in Chapter III, we did not include any temporalonly clusters in this analysis.
Basic power, false alarm rate, and average delay results for all three methods are found in Table 4.2, average spatial precision in Table 4.3, and average spatial recall in Table 4.4.
Starting with the basic power in Table 4.2, we see that while the restricted flexible CUSCAN has reasonably good power on the smaller or more connected clusters, with hrstday power of 0.6910.746 for cluster B, 0.7210.787 for cluster C, and 0.7250.740 for cluster E, the power drops off when clusters contain a large amount of regions with few connections between them. For clusters F and G, which consist of long chains of regions with minimal connections between them, the power was considerably lower at 0.3530.430 for cluster F and 0.4210.484 for cluster G. All clusters had the highest power at either
75
Table 4.2: Basic power for the CUSCAN method with restricted flexible (R), circular (C)
and elliptic (E) base.
Power on Day
Cluster Method CX\ 31 32 33 Delay (Days) False Alarm Rate
B R 0.10 0.740 0.948 0.990 0.295 0.050
R 0.15 0.746 0.959 0.994 0.285 0.049
R 0.20 0.691 0.932 0.987 0.356 0.049
C  0.784 0.962 0.995 0.245 0.049
E  0.827 0.975 0.993 0.185 0.049
C R 0.10 0.787 0.958 0.991 0.239 0.050
R 0.15 0.768 0.952 0.990 0.263 0.049
R 0.20 0.721 0.924 0.983 0.327 0.049
C  0.887 0.992 1.000 0.121 0.049
E  0.909 0.991 1.000 0.100 0.050
E R 0.10 0.725 0.925 0.980 0.316 0.051
R 0.15 0.740 0.950 0.991 0.295 0.050
R 0.20 0.735 0.952 0.991 0.298 0.050
C  0.802 0.970 0.977 0.223 0.051
E  0.851 0.984 0.996 0.158 0.051
F R 0.10 0.413 0.689 0.809 0.638 0.051
R 0.15 0.430 0.698 0.818 0.621 0.050
R 0.20 0.353 0.628 0.764 0.716 0.049
C  0.698 0.905 0.970 0.347 0.049
E  0.738 0.936 0.978 0.288 0.049
G R 0.10 0.474 0.734 0.863 0.600 0.052
R 0.15 0.484 0.744 0.871 0.590 0.051
R 0.20 0.421 0.703 0.837 0.657 0.051
C  0.452 0.711 0.842 0.619 0.050
E â€” 0.557 0.796 0.916 0.523 0.049
76
Table 4.3: Average precision for the CUSCAN method with restricted flexible (R), circular
(C), and elliptic (E) base.
Precision on Day
Cluster Method CX\ 31 32 33
B R 0.10 0.881 0.890 0.894
R 0.15 0.884 0.858 0.851
R 0.25 0.803 0.809 0.803
C  0.679 0.723 0.753
E  0.672 0.719 0.749
C R 0.10 0.928 0.939 0.950
R 0.15 0.909 0.925 0.933
R 0.20 0.863 0.876 0.875
C  0.735 0.804 0.830
E  0.733 0.803 0.828
E R 0.10 0.853 0.878 0.882
R 0.15 0.808 0.821 0.824
R 0.20 0.779 0.793 0.793
C  0.548 0.566 0.574
E  0.547 0.561 0.572
F R 0.10 0.851 0.884 0.894
R 0.15 0.819 0.863 0.873
R 0.20 0.754 0.798 0.802
C  0.639 0.664 0.663
E  0.672 0.719 0.749
G R 0.10 0.760 0.807 0.923
R 0.15 0.710 0.758 0.754
R 0.20 0.633 0.676 0.675
C  0.547 0.556 0.573
E â€” 0.672 0.719 0.749
77
Table 4.4: Average recall for the CUSCAN method with restricted flexible (R), circular
(C), and elliptic (E) base.
Recall on Day
Cluster Method CX\ 31 32 33
B R 0.10 0.407 0.392 0.373
R 0.15 0.478 0.459 0.441
R 0.20 0.539 0.507 0.481
C  0.606 0.638 0.665
E  0.605 0.634 0.664
C R 0.10 0.560 0.597 0.586
R 0.15 0.637 0.621 0.605
R 0.20 0.705 0.682 0.662
C  0.758 0.755 0.761
E  0.754 0.756 0.760
E R 0.10 0.321 0.294 0.296
R 0.15 0.371 0.346 0.350
R 0.20 0.446 0.414 0.415
C  0.643 0.690 0.722
E  0.632 0.680 0.714
F R 0.10 0.243 0.238 0.224
R 0.15 0.285 0.280 0.267
R 0.20 0.327 0.311 0.296
C  0.622 0.656 0.693
E  0.601 0.648 0.682
G R 0.10 0.206 0.195 0.188
R 0.15 0.243 0.222 0.214
R 0.20 0.312 0.267 0.261
C  0.388 0.389 0.385
E â€” 0.350 0.369 0.359
78
ai = 0.10 or ot\ = 0.15, with power dropping off at a = 0.2. In contrast, the circular and elliptic CUSCAN boast higher overall power four out of five clusters, especially on the minimallyconnected cluster F. However, on the large and spreadout cluster G, the circular CUSCAN fared no better than the restricted flexible CUSCAN, and the elliptic only slightly better. Despite low starting power in some areas, the power increases rapidly over time for all three methods in all cases.
The tradeoff in terms of power comes in the form of higher precision for the restricted flexible CUSCAN. In Table 4.3, we see that the restricted flexible method has high precision across the board, with the highest precision in the smallest cluster C and lowest in the largest cluster G. As expected, the precision increases as aq decreases, since regions included in the identified cluster are more likely to be true hotspots. The circular and elliptic methods have noticeably lower precision for all clusters, as they require that nonhotspot regions be included in order to create a circle or ellipse of regions when clusters are not already approximately circular or approximately elliptic. Like basic power, the precision tends to increase over time as more time periods of data are accumulated.
Finally, in Table 4.4, we notice that the restricted flexible CUSCAN tends to have low spatial recall, especially in clusters with more regions. The recall improves as aq increases, but does not surpass the recall of the circular and elliptic CUSCAN, which are uniformly more likely to contain a higher proportion of the true atrisk population. For the circular and elliptic CUSCAN, the recall either remains approximately flat or increases slightly as the outbreak continues, whereas with the restricted flexible CUSCAN, the precision often decreases slightly with time.
4.5 Discussion
In this chapter, we proposed an extension of the CUSCAN to allow the use of spatial scan methods with timevariable sets of potential clusters. Previously, the CUSCAN method required that the spatial scan method at its base have a fixed set of potential clusters based only on the study region and population, so that the same potential cluster
79
could be followed over time. Here, we described a modified approach, where instead of following clusters with the same sets of regions over time, we instead link overlapping clusters together to compute the CUSIJM streams. This allows us to use methods without fixed sets of windows, here demonstrated with the restricted flexibly shaped scan method, to detect clusters of arbitrary shape.
While the power of the restricted flexible CUSCAN was low relative to the circular or elliptic CUSCAN, it had considerably higher precision than either the circular or elliptic methods, as each region within the identified cluster was much more likely to be an actual hot spot. When the circular and elliptic methods are used to detect irregularly shaped clusters, many nonoutbreak regions are by necessity included in the identified cluster in order to â€œfill outâ€ the circle or ellipse of zones. The restricted flex method, on the other hand, imposes no shape restrictions, and so does not need to add in extra zones. This results in higher precision and identified clusters of more correct shapes.
The downside to the restricted flexible method is that it requires connectivity in the subsets searched. Consider for example cluster G from the data demonstration in Section 4.4. This cluster consists of a long chain of regions connected in a line. If one region in the middle of this cluster produces a middle pvalue high enough to be excluded from consideration, then the restricted flexible method will not be able to connect the two sides of the cluster, and will falter when trying to detect it. In addition to low power, spatial recall will also be low in these cases, as many regions in the true cluster will not be in the identified cluster, since without the connection no potential cluster contains all of the outbreak zones. Indeed, this is the primary reason why power and recall are lower in clusters F and G, where a single broken connection means that not all hotspot regions are able to appear in the same potential cluster. In effect, the weaknesses of the restricted flexible CUSCAN are due primarily to the weaknesses in the restricted flexible scan method at its base. However, even though the initial power to detect an outbreak is low, power increases considerably over multiple time periods, and by day three of the outbreak
80
the power of the CUSCAN is high across all the irregularly shaped clusters.
One question brought up by the results of the data demonstration in Section 4.4 is why the recall for the restricted flexible CUSCAN appears to diminish over time. We believe this is an effect caused by a combination of the region filtering that occurs with the restricted flexible scan method and the process of matching clusters by overlap. Due to random variation, the middle pvalue in a specific region will fluctuate over time, and the region may or may not be included in potential clusters in any given time period. Because of this, the clusters detected by the CUSCAN tend to feature primarily regions that were included in the cluster in all outbreak times. Since the probability that a region will be included in potential clusters in all time periods diminishes as the number of periods increase, the clusters that are selected tend to diminish in size over time. This is one of the primary weaknesses in the restricted flexible CUSCAN, as the effect of underestimating the size of a disease cluster is typically undesirable.
Another weakness posed by the use of the restricted flexible CUSCAN is that of computational limitations. For the data demonstration in Section 4.4, we limited the aq threshold for region filtering to values of 0.10, 0.15, and 0.20. This is due primarily to the fact that the number of potential clusters, and thus the computation time required to calculate the scan statistics and the CUSCAN statistics, increases exponentially as additional regions are added for consideration. At aq = 0.25 and above, the number of regions is high enough that benchmarking the process on data with a large number of regions or large numbers of data sets becomes infeasible. While there is additional computation time added to this from the CUSCAN as we compute the CUSUM statistics for each data set, since we deal primarily with nonoverlapping sets of clusters, the number of CUSUM streams computed is often small and the additional computation time is minimal. The bulk of the computational complexity comes from the restricted flexible scan method and computing the initial scan statistics, which is a known weakness of the allconnectedsubsets approach to determining potential clusters.
81
Due to the limitations of the restricted flexible CUSCAN, which are primarily the weaknesses of the restricted flexible scan method, a different dynamic scan method may be more appropriate if we wish to detect clusters of arbitrary shape. Minimum spanning tree methods, such as the dynamic minimum spanning tree [Assuncao et ah, 2006] or constrained spanning tree [Costa et ah, 2012], may offer higher power than the restricted flexible scan method, and those with early stopping criterion may additionally reduce the amount of computational time and complexity needed to compute the scan statistics. With these methods, clusters may still be linked together over time as described in Section 4.3, and so the CUSCAN statistic may be computed as easily as with the restricted flexible method.
A final consideration is that the restricted flexible CUSCAN, like the circular and elliptic CUSCAN methods described in Chapter III, makes use of the nonrestarting form of the CUSUM to increase power of detection during extended outbreaks. As discussed in Chapter II, the nonrestarting CUSUM often suffers from a high false alarm rate following the end of an outbreak. While the correction proposed in Chapter II may be applied to the CUSCAN, care must be taken when deciding how to simulate data under the alternative hypothesis, as the regions within the detected clusters are not guaranteed to include all of the hot spots in the actual data, nor is every region within the detected clusters guaranteed to be a hot spot at all.
82
CHAPTER V
CONCLUSION
In the preceding chapters, we explored may of the properties of CUSUM control charts as a prospective surveillance tool and proposed updated versions that demonstrate the potential for improved functionality. In Chapter II, we examined the false alarm problem inherent to the nonrestarting CUSUM method and demonstrated that false alarms could be controlled at the desired level through the use of modified simulations for hypothesis testing. In Chapter III, we proposed a new means of incorporating spatial information into a CUSUM framework by computing the cumulative sum of scan statistics (CUSCAN).
The CUSCAN method with circular and elliptic windows was shown to have high power to detect emerging outbreaks as early as the first day they appear. In Chapter IV, we provided an extension of the CUSCAN method to allow the use of spatial scan methods with timevariable windows. By linking overlapping clusters through time, we are able to compute the CUSCAN statistic and identify disease clusters of arbitrary shape.
When the nonrestarting CUSUM method is used to monitor a disease process, the statistic grows large in the presence of an outbreak lasting for multiple time periods. This allows the CUSUM to sound an alarm for each time period the process is out of control, but often results in many false alarms once the outbreak ends as the statistic takes time to return to incontrol levels. We proposed a solution to this problem in the form of a modified approach to the pvalue method of hypothesis testing. When the nonrestarting CUSUM method is used to monitor a disease process, the statistic grows large in the presence of an outbreak lasting for multiple time periods. This allows the CUSUM to sound an alarm for each time period the process is out of control, but often results in many false alarms once the outbreak ends as the statistic takes time to return to incontrol levels. When using a pvalue approach to hypothesis testing, false alarms occur when the elevated CUSUM statistic is compared to simulated CUSUM streams that have been in control since the start. We proposed changing the way the Monte Carlo simulations are performed
83
by simulating the outbreak during times it was identified, so that the simulated CUSIJM streams likewise experience the outbreak and provide a more realistic comparison. We provided three examples for how to simulate Poisson counts under the alternative hypothesis that an outbreak is present: as random draws from a Poisson distribution with a userspecified mean, as random draws from a Poisson distribution with the mean estimated from observed counts during outbreak time periods, and by sampling with replacement from the observed outbreak counts (bootstrapping).
We demonstrated the effectiveness of the modified CUSUM method using both simulated and real data. In the simulation study, the corrected CUSUM tests were able to detect the new oubreak at the same speed as the uncorrected CUSUM while also controlling the postoutbreak false alarm rate at the desired level (a = 0.05 for simulations) for the bootstrap and estimation methods (Table 2.6). When simulating at a userspecified level that underestimated the true size of the outbreak, the false alarm rate was substantially reduced, though not completely controlled at the chosen level of significance. In the data study, false alarm and false negative rates are more difficult to determine. Under the assumption that the outbreak occurred between weeks 408 and 410, the false negative rate for the corrected CUSUM test was no worse than the uncorrected in 7 out of 8 regions where the outbreak was detected, with detection time in the remaining state delayed by a single time period. Additionally, the uncorrected test continued to produce alarms for 712 weeks for smaller states and up to 116 weeks for larger states, while the bootstrap and estimation corrections resulted in a significant reduction in alarms and a total of 08 alarms in each state over the remaining 116 weeks (Table 2.7).
In exchange for adequately controlling the postoutbreak false alarm rate, the modified test typically has less power to continuously signal an alarm during outbreak time periods. Since the simulated CUSUM streams are meant to approximately match the level of the elevated statistic, lower levels of observed cases are less likely to register as extreme, and so false negatives occasionally occur. However, false negatives were primarily isolated and
84
sequential time periods of false negatives were rare. Consequently, some care should be taken in interpreting the hypothesis tests, as a string of negative results is more likely to signal the true end of an outbreak than an isolated negative result. Additionally, while the first alarm from the start of monitoring will occur at the same time in both the corrected and uncorrected CUSIJM, subsequent alarms for the corrected CUSIJM may be delayed as seen in the data study in Section 2.5. The state of North RhineWestphalia had experienced a prior outbreak of Salmonella Newport shortly before the countrywide outbreak at week 408, and so the simulated CUSIJM streams remained elevated enough at the start of the new outbreak to result in an additional false negative. This effect is more pronounced if prior outbreaks were large relative to future outbreaks, or if several outbreaks occur in quick succession. One strategy that may help reduce issues of this type may be to reset the simulated CUSUM streams to zero when the CUSUM has returned to zero following an outbreak, or resetting the simulated streams to the current CUSUM value after a prespecihed number of negative test results in the case where the statistic may not return to zero between outbreaks. The issue of false negatives may also potentially be reduced by selecting a level of significance a conditional on the presence of an outbreak. While the postoutbreak false alarm rate is controlled at the specified level a, the CUSUM correction has the effect of reducing the incidence of preoutbreak false alarms, resulting in an effectively more conservative test. In the simulation study in Section 2.4, where a significance level of a = 0.05 was used, the preoutbreak false alarm rate for the corrected tests was approximately 0.02 on average, well below a. In essence, it may be possible to control the preoutbreak false alarm rate at the desired 0.05 level while allowing a > 0.05, resulting in increased power during outbreak time periods. The desired type I error rate (for example, 0.05) should be used following the end of an outbreak, as the postoutbreak alarm rate has only been shown to be controlled at this level.
One final consideration when using the modified nonrestarting CUSUM test is that controlling the postoutbreak false alarm rate at the desired significance level a requires a
85
reasonable approximation of the true outbreak size. When simulating under the alternative hypothesis, we make the assumption that an outbreak has a stationary mean that does not vary over the course of the outbreak. If the mean of an outbreak is nonstationary, the estimation and bootstrap corrections will underestimate an increasing mean and overestimate a decreasing mean. As demonstrated in the simulation study with the correction using a userspecified mean Ai, underestimating the size of the outbreak reduces the effectiveness of the CUSIJM correction and false alarms are reduced rather than controlled. While not addressed in Chapter II, overestimating the size of the outbreak can also significantly impact the performance of the CUSUM. When the simulated CUSUM statistics are larger than needed, power to detect subsequent outbreak periods diminishes. While the proposed simulation methods were shown in Section 2.5 to adequately estimate the size of an outbreak in real data, the outbreak studied was relatively short in duration. More prolonged outbreaks or outbreaks in data with more frequent time updates (such as daily rather than weekly counts) are more likely to experience negative effects from incorrectly estimating the intensity of the outbreak. Additionally, when more than one outbreak occurs in a given study period, the outbreaks will not necessarily be of the same intensity. As such, in order to facilitate more accurate estimation of the outbreak mean, we suggest that the pool of outbreak time periods used in the estimation and bootstrap methods be reset along with the simulated CUSUM streams once an outbreak is determined to have ended.
In addition to being able to identify both the beginning and the end of an outbreak, a good surveillance method should be able to accurately determine the location of the outbreak when geographic information is available. In Chapter III and Chapter IV, we described new methods for incorporating spatial information into the nonrestarting CUSUM using data from spatial scan methods. Spatial scan methods provide more evidence of local clusters of disease than simple aggregated counts, and so computing the cumulative sum of scan statistics (CUSCAN) allows us to determine both when and where
86
clusters of cases associated with disease outbreaks exist.
The CUSCAN method begins with the selection of an appropriate spatial scan method that will determine the type of clusters the method can detect. For example, the circular scan method will result in the detection of circular clusters, while the elliptic scan method can identify more elongated ellipseshaped clusters. Once a scan method is chosen, the Poisson scan statistic is calculated for each potential cluster defined by that method and each time period of data, and the nonrestarting CUSIJM is used to monitor the scan statistics through time. When the chosen scan method uses a fixed set of potential clusters that does not vary over time, we can compute a separate CUSUM stream for each potential cluster. When the chosen scan method results in sets of potential clusters that differ from one time period to the next (such as with the restricted flexible scan statistic, where the set of potential clusters depends on observed counts), the nonrestarting CUSUM is instead computed from the scan statistics of overlapping clusters. The final CUSCAN statistic is then determined by taking the maximum of the CUSUM statistics at each time step, with the most likely cluster being the set of regions associated with the maximum statistic.
Since Poisson scan statistics do not follow any known distribution, a nonparametric approach was required to determine the value of the tuning parameter k in the CUSUM equation (3.4). Since we use a pvalue approach to hypothesis testing, where the statistics from the Monte Carlo simulations are computed in the same way as the observed statistic, the type I error rate will always be controlled at the significance level a regardless of the choice of k. The selection of k is instead motivated by the desire to keep the CUSUM constrained near zero when no outbreak is present to conserve computational resources and allow for easier visual inspection of the evolution of the statistic. In addition, the selection of k has an impact on the power of the CUSUM to detect emerging outbreaks, as too large a value of k can prevent the statistic from growing even in the presence of an outbreak. We suggested choosing a value of k to control the maximum sprint length under the null 5o, that is, the number of consecutive time periods the CUSUM statistic is expected to remain
87
positive before returning to zero when no outbreak is present. This process results in a value of k large enough to prevent the CUSIJM from growing without bound in the absence of an outbreak, while also small enough to allow the CUSUM to increase quickly when an outbreak occurs.
We demonstrated the effectiveness of the CUSCAN method using collections of benchmark data. In Section 3.5, we applied the CUSCAN with the circular scan method to benchmark data provided by Kulldorff et al. [2004], which includes 17 distinct clusters of varying size with outbreaks simulated at both high and medium levels of intensity. Since this data set was provided as part of a demonstration of the effectiveness of the spacetime scan method as a prospective surveillance tool, we were able to directly compare the performance of the circular CUSCAN and the spacetime scan method. The results of this demonstration showed that the circular CUSCAN has considerable power to detect emerging clusters as early as the first day they appear, with firstday power ranging from 0.804 to 0.996 in the high excess risk models and 0.281  0.901 in the medium excess risk models (Table 3.2). Notably, while the circular CUSCAN performs about as well as the spacetime scan method for detecting small clusters in the highrisk model, the CUSCAN outperforms the spacetime scan method in detecting larger clusters in the highrisk models as well as clusters of all sizes in the mediumrisk model. When the two methods were adjusted to detect smaller clusters by reducing the maximum cluster size from 50% of the total population to 5%, both saw increased power to detect small clusters in exchange for reduced power to detect large clusters. However, the power improvement was higher for the CUSCAN, causing it to surpass the spacetime scan method in the detection of small clusters. The CUSCAN also retained higher power to detect large clusters than the spacetime scan method (Table 3.3). Additionally, the CUSCAN demonstrated increased power to detect the elongated Rockaways and Hudson River clusters when elliptic instead of circular windows were used to identify potential clusters (Table 3.6).
In addition to basic power, we also assessed the spatial accuracy of the circular
CUSCAN method by computing the populationbased spatial precision (the proportion of the population in the identified cluster that was part of the true atrisk population) and populationbased spatial recall (the proportion of the true atrisk population that was included in the identified cluster). The circular CUSCAN had high recall across the board, ranging from 0.796  0.999 for the high risk model (Table 3.7), indicating that the CUSCAN was able to locate up to 99.9% of the population in outbreak regions on the first day that the outbreak appeared. In contrast, the precision of the circular CUSCAN was lower, particularly for clusters such as the elongated Hudson River cluster that are not approximately circular in shape. Low precision is often an effect of using circular windows to define the set of potential clusters, as a circular window that fully contains a noncircular cluster will by necessity contain additional neighboring regions that are not part of the true cluster. While precision and recall tended to be slightly lower for the mediumrisk model than the highrisk model, the differences were not considerable, indicating that the spatial accuracy of the CUSCAN is not dependent on outbreak size.
As spatial precision and spatial recall were neither computed by Kulldorff et al. [2004] for the spacetime scan method nor included in power analysis in SaTScan, we are not able to make direct comparisons in the performance of the two methods in this area. Since the spacetime scan method uses the same set of circular windows to determine potential clusters as the circular CUSCAN, the spacetime scan method is expected to have a similar pattern of performance to the circular CUSCAN, where high spatial recall is met with low spatial precision due to the addition of excess regions.
In Section 4.4, we used data sets based on benchmark data provided by Kulldorff et al. [2003] and Duczmal et al. [2006] to demonstrate how the CUSCAN method can be used with spatial scan methods with timevariable windows to detect irregularly shaped clusters, using the restricted flexible scan method as our example. The restricted flexible scan method starts by filtering out regions that are unlikely to be experiencing an outbreak by computing the middle pvalue for each region based on observed case counts and excluding
89
regions where the middle pvalue exceeds a predetermined threshold ot\. The method then searches over all connected subsets of the remaining regions up to some maximum size, allowing for the detection of arbitrarily shaped clusters. Since the regions included in the search depend on the observed case data, the set of potential clusters changes from one time period to the next, and so the CUSCAN was computed from overlapping clusters as described in Section 4.3. The performance of the restricted flexible CUSCAN was then compared to the performance of the circular and elliptic CUSCAN methods used in the previous benchmark study.
We found that, in general, the restricted flexible CUSCAN had lower power than the circular or elliptic CUSCAN methods to detect an emerging disease cluster on the first day of the outbreak (Table 4.2). However, the power increased rapidly over time, and by the third day of the outbreak the restricted flexible CUSCAN approximately matched the power of the circular and elliptic CUSCAN methods. Since the potential clusters from the restricted flexible scan method do not have fixed shape, the addition of nonoutbreak regions to identified clusters is not often necessary, and so the restricted flexible CUSCAN recorded a high degree of spatial precision relative to the circular and elliptic CUSCAN (Table 4.3). In contrast, the circular and elliptic CUSCAN methods had much higher spatial recall than the restricted flexible CUSCAN (Table 4.4). The low recall for the restricted CUSCAN method results from the way the potential clusters are determined in the restricted flexible method. If an outbreak region is mistakenly excluded from consideration, it can become difficult or impossible for the remaining outbreak regions to form a connected set. As such, there will be times when no potential cluster contains more than some small subset of outbreak regions, resulting in the identification of only a small part of the atrisk population and low spatial recall. The â€œdisconnectingâ€ of outbreak regions likewise has a detrimental effect on power as the smaller pieces may not provide sufficient evidence to trigger an alarm. This effect is most pronounced in clusters F and G, which consist of long strings of minimally connected regions that are at high risk of
90
accidental separation (Figure 4.1). While this effect can be mitigated by using a higher middle pvalue threshold op to remove fewer regions from consideration, it is often not computationally feasible to do so. As the number of considered regions increases, the number of connected subsets increases exponentially, resulting in a significant increase in the computation time required to compute the scan statistics.
As demonstrated by the benchmark studies in Section 3.5 and Section 4.4, one of the most important aspects of the CUSCAN is the choice of scan method used to compute the scan statistics. The properties of the CUSCAN tend to follow the properties of the chosen scan method: the circular CUSCAN and the circular scan method are both powerful for outbreak detection but suffer from low precision, the elliptic CUSCAN and elliptic scan method both increase power to detect elongated clusters in exchange for increased computational complexity, and the restricted flexible CUSCAN and restricted flexible scan method both offer high precision but low recall and overall power. As such, care must be taken to choose a scan method appropriate for a given study that has the desired qualities. Fortunately, the CUSCAN is an incredibly flexible method, and can be adapted for use with many different scan methods with a variety of properties. While the initial power to detect an outbreak differed for the three scan methods presented in this thesis (circular, elliptic, and restricted flexible methods), the use of the nonrestarting CUSUM resulted in a rapid increase to equivalent power levels over the course of an extended outbreak, making the CUSCAN a powerful surveillance tool with the potential for broad application.
While many of the properties of the CUSCAN have been explored in this thesis, the properties of the CUSCAN using scan methods other than the circular, elliptic, and restricted flexible methods have not yet been studied. Due to the weaknesses inherent to the restricted flexible method, other scan methods for detecting irregularly shaped clusters are likely to be more appropriate for use with the CUSCAN. Additionally, while the use of the nonrestarting CUSUM in the CUSCAN method provides high power during outbreak periods, applying the nonrestarting CUSUM correction previously described in Chapter II
91
to the CUSCAN is not trivial. While the CUSCAN is able to accurately determine the time periods an outbreak is present, the ability to distinguish between outbreak and nonoutbreak regions is heavily affected by the scan method used. Since the effectiveness of the nonrestarting CUSIJM correction is dependent on the ability to accurately recreate the outbreak in simulated data, ensuring that the outbreak is simulated in the correct regions is essential. This makes it difficult if not impossible to implement the corrections that simulate new Poisson counts (i.e., simulating at userspecified mean or a mean estimated from the data). It should be possible, however, to control the postoutbreak false alarm rate by applying a bootstrap correction to the CUSCAN. If new counts are created by drawing with replacement from case counts during outbreak times within the same region, then the simulated data sets will have outbreaklevel counts in outbreak regions and nonoutbreaklevel counts in nonoutbreak regions without the requirement that the true location of the outbreak be precisely known.
92
REFERENCES
R. Assuncao, M. Costa, A. Tavares, and S. Ferreira. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 25(5):723742, 2006.
C. Bayer, H. Bernard, R. Prager, W. Rabsch, P. Hiller, B. Malorny, B. Pfefferkorn,
C. Frank, A. De Jong, I. Friesema, et al. An outbreak of Salmonella Newport associated with mung bean sprouts in Germany and the Netherlands, October to November 2011. 2014.
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289300, 1995.
S. Chatterjee and P. Qiu. Distributionfree cumulative sum control charts using bootstrapbased control limits. The Annals of Applied Statistics, 3(1):349â€”369, 2009.
M. Coory, S. Duckett, and K. SketcherBaker. Using control charts to monitor quality of hospital care with administrative data. International Journal for Quality in Health Care, 20(l):3139, 2007.
M. A. Costa, R. M. Assungao, and M. Kulldorff. Constrained spanning tree algorithms for irregularlyshaped spatial clustering. Computational Statistics & Data Analysis, 56(6): 17711783, 2012.
D. Das, K. Metzger, R. Heffernan, S. Balter, D. Weiss, and F. Mostashari. Monitoring overthecounter medication sales for early detection of disease outbreaksNew York City. MMWR Morb Mortal Wkly Rep, 54(Suppl):4146, 2005.
S. Dassanayake and J. P. French. An improved cumulative sumbased procedure for prospective disease surveillance for count data in multiple regions. Statistics in Medicine, 35(15):25932608, 2016.
F. X. Diebold. Elements of Forecasting. Thompson SouthWestern, 2007.
L. Duczmal, M. Kulldorff, and L. Huang. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics, 15(2): 428442, 2006.
S. Fasting and S. E. Gisvold. Statistical process control methods allow the analysis and improvement of anesthesia care. Canadian Journal of Anesthesia, 50(8):767774, 2003.
M. Frisen. Evaluations of methods for statistical surveillance. Statistics in Medicine, 11(11): 1489 1502, 1992.
A. Gandy and F. D.H. Lau. Nonrestarting cumulative sum charts and control of the false discovery rate. Biornetrika, 100(1):261 268, 2012.
93
L. M. Hall and J. P. French. A modified cusum test to control postoutbreak false alarms. Statistics in Medicine, 2019.
D. M. Hawkins and D. H. Olwell. Cumulative sum charts and charting for quality improvement. Springer Science & Business Media, 1998.
M. Kulldorff. A spatial scan statistic. Communications in StatisticsTheory and methods, 26(6):14811496, 1997.
M. Kulldorff. Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society: Series A (Statistics in Society), 164 (l):6172, 2001.
M. Kulldorff. Satscan version 9.6: software for the spatial and spacetime scan statistics, http://www.satscan.org/, 2003.
M. Kulldorff and N. Nagarwalla. Spatial disease clusters: detection and inference. Statistics in Medicine, 14(8):799â€”810, 1995.
M. Kulldorff, T. Tango, and P. J. Park. Power comparisons for disease clustering tests. Computational Statistics & Data Analysis, 42(4):665684, 2003.
M. Kulldorff, Z. Zhang, J. Hartman, R. Heffernan, L. Huang, and F. Mostashari. Benchmark data and power calculations for evaluating disease outbreak detection methods. Morbidity and Mortality Weekly Report, pages 144151, 2004.
M. Kulldorff, L. Huang, L. Pickle, and L. Duczmal. An elliptic spatial scan statistic. Statistics in Medicine, 25(22):39293943, 2006.
J. M. Lucas. Counted data CUSUMâ€™s. Technometrics, 27(2): 129â€”144, 1985.
J. M. Lucas and R. B. Crosier. Fast initial response for CUSUM qualitycontrol schemes: give your CUSUM a head start. Technometrics, 24(3): 199 205, 1982.
E. S. Page. Continuous inspection schemes. Biometrika, 41(1/2): 100 115, 1954.
G. P. Patil and C. Taillie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological statistics, 11 (2): 183197, 2004.
R. F. Raubertas. An analysis of disease surveillance data that uses the geographic locations of the reporting units. Statistics in Medicine, 8(3):267â€”271, 1989.
S. Roberts. Control chart tests based on geometric moving averages. Technometrics, 1 (3) :239â€”250, 1959.
C. Robertson, T. A. Nelson, Y. C. MacNab, and A. B. Lawson. Review of methods for spacetime disease surveillance. Spatial and spatiotemporal epidemiology, 1(23): 105116, 2010.
94

Full Text 
PAGE 1
PROSPECTIVEDISEASESURVEILLANCEWITHTHECUSUMANDSPATIAL SCANMETHODS by LAURENM.HALL B.A.,ColoradoStateUniversity,2012 B.S.,UniversityofColoradoDenver,2014 M.S.,UniversityofColoradoDenver,2017 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematicsProgram 2019
PAGE 2
ThisthesisfortheDoctorofPhilosophydegreeby LaurenM.Hall hasbeenapprovedforthe AppliedMathematicsProgram by ErinAustin,Chair JoshuaFrench,Advisor StephanieSantorico StephenHartke PeterAnthamatten Date:May18,2019 ii
PAGE 3
Hall,LaurenM.Ph.D.,AppliedMathematics ProspectiveDiseaseSurveillancewiththeCUSUMandSpatialScanMethods ThesisdirectedbyAssociateProfessorJoshuaFrench ABSTRACT ThecumulativesumCUSUMcontrolchartisamethodfordetectingwhetherthe meanofatimeseriesprocesshasshiftedbeyondsometolerancei.e.,isoutofcontrol. CUSUMcontrolchartshavebeenwidelyusedforprospectivesurveillanceduetotheir abilitytoquicklydetectbothlarge,suddenincreasesandsmall,persistentincreasesin reportedcasesofadiseaseassociatedwithanoutbreak.Originallydevelopedinan industrialprocesscontrolsetting,theCUSUMstatisticistypicallyresettozerooncea processisdiscoveredtobeoutofcontrolsincetheindustrialprocessisthenrecalibratedto beincontrol.Inadiseasesurveillancesetting,resettingtheCUSUMstatisticisunrealistic, andanonrestartingCUSUMchartisusedinstead.Inpractice,thenonrestarting CUSUMprovidesmoreinformation,butsuersfromahighfalsealarmratefollowingthe endofanoutbreak.Inthisthesis,weproposeamodiedhypothesistestforusewiththe nonrestartingCUSUMwhentestingwhetheraprocessisoutofcontrol.Bysimulating statisticsconditionalonthepresenceofanoutofcontrolprocessinrecenttimeperiods,we areabletoretaintheCUSUM'spowertodetectanoutofcontrolprocesswhilecontrolling thepostoutofcontrolfalsealarmrateatthedesiredlevel.Additionally,weproposeanew methodforincorporatingspatialinformationintoCUSUMcontrolchartsbycombiningthe nonrestartingCUSUMwithspatialscanmethods.Wecomputethecumulativesumof scanstatistics,orCUSCANstatistics,usinganonparametricformoftheCUSUMin conjunctionwiththecircularandellipticscanmethods.Usingpubliclyavailable benchmarkdata,wedemonstratethattheCUSCANhashighpowertobothquickly detectanewoutbreakandidentifythespatialregionswheretheoutbreakexists.Lastly, weproposeamethodforextendingtheCUSCANfordynamicscanmethodsthatdonot useaxedsetofwindowsfordetectingspatialclusters,usingtherestrictedexiblescan iii
PAGE 4
methodasanexample.Bylinkingoverlappingwindowsacrosstime,weareableto computetheCUSCANstatisticforscanmethodswithtimevaryingclusterwindows, whichallowstheCUSCANtodetectclustersofarbitraryshape. iv
PAGE 5
TABLEOFCONTENTS CHAPTER I.INTRODUCTION................................1 II.AMODIFIEDCUSUMTESTTOCONTROLPOSTOUTBREAKFALSE ALARMS.....................................7 2.1Introduction...................................7 2.2TheNonRestartingCUSUMFalseAlarmProblem..............9 2.3AnAdaptiveHypothesisTestfortheNonRestartingCUSUM........12 2.3.1AModiedNRCUSUMTest.......................13 2.3.2SimulatingUndertheAlternativeHypothesis.............14 2.4DataDemonstration:ConditionalHypothesisTestingonSimulatedData..15 2.4.1Resultsusingknown a ..........................17 2.4.2Resultsusing 1 ..............................19 2.4.3Resultsusing ^ m .............................19 2.4.4SimulationwithBootstrap........................22 2.4.5SummaryofSimulationResults.....................22 2.5Demonstration:Salmonelladata........................22 2.6Discussion.....................................26 III.CUSCAN:DETECTINGEMERGINGDISEASECLUSTERSWITHTHE CUMULATIVESUMOFSCANSTATISTICS................28 3.1Introduction...................................28 3.2ReviewofMethods................................32 3.2.1SpatialScanMethods..........................32 3.2.2TheNonRestartingCUSUM......................35 3.3ProposedMethodology..............................36 3.3.1TheCUSCANMethod.........................36 3.3.2Selectionof k ...............................37 v
PAGE 6
3.4Demonstration:SimulatedDatabasedonNewYorkLeukemiaData.....40 3.5DemonstrationandPowerAssessment:NewYorkCityBenchmarkData..45 3.5.1DataDescription.............................45 3.5.2AnalysisDesign..............................48 3.5.3Results...................................51 3.6Discussion.....................................60 IV.ANEXTENSIONOFTHECUSCANFORDYNAMICSCANNINGMETHODS:THERESTRICTEDFLEXIBLESCANSTATISTIC.........63 4.1Introduction...................................63 4.2ReviewofMethods................................65 4.2.1TheCUSCANMethod.........................65 4.2.2TheRestrictedFlexibleScanMethod..................67 4.3CUSCANwiththeRestrictedFlexibleScanMethod.............69 4.4Demonstration:SimulatedDatabasedonNortheastBenchmarkData....72 4.5Discussion.....................................79 V.CONCLUSION..................................83 REFERENCES.......................................93 vi
PAGE 7
CHAPTERI INTRODUCTION Diseasesurveillanceisakeytoolinpublichealthapplications.Bymonitoringthe occurrenceandspreadofdisease,wegaininformationthatallowsustobetterreactto potentialthreatstopublicsafety.Oneofthemostimportantfeaturesofsurveillanceis earlydetectionofnewoutbreaks,asthesoonerahealtheventisidentied,themore eectiveanyavailableinterventionwillbe.Prospectiveandsyndromicsurveillance methodsassessdatainrealtimeasupdatesbecomeavailableandsearchforevidenceof anyunusualcountsordistributionsofcasesthatcouldindicateaneworongoinghealth crisis.Inprospectivediseasesurveillance,thedatastreamsmonitoredareprimarilycounts ofconrmeddiagnosesofadiseaseofinterest,whereassyndromicsurveillanceseeks patternsindataonrelatedvariables,suchasmedicationsalesorreportsofulike symptomsthatprecedeformaldiagnosisandmayprovideearlierdetectionofanemerging outbreak[Dasetal.,2005]. Inadditiontorapiddetectionofanewdiseaseoutbreak,prospectivesurveillance methodsshouldalsobeabletoaccuratelyidentifythebeginningandendofanoutbreak, i.e.,toaccuratelydistinguishoutbreakfromnonoutbreaktimeperiods[Frisen,1992]. Whencountsareaggregatedwithinphysicalregions,suchascities,counties,orcensus tracts,itisalsoimportantthatthelocationoftheoutbreakbeaccuratelydetermined.In essence,theperformanceofprospectivesurveillancemethodsisusuallymeasuredby timerequiredtodetectanewoutbreak,sensitivitytodeterminethetimes/locationsan outbreakispresent,andspecicitytodeterminethetimes/locationswithnooutbreak. Whentestingforthepresenceofanoutbreakinsurveillancedata,statisticalmethods oftenworkbycomparingobservedcasecountstoabaselinelevelthatisexpectedinthe absenceofanoutbreak.Ifobservedcountsaresignicantlyhigherthanexpected,thenwe saythereisevidenceofanoutbreakofdisease.Dierenttechniquesexisttoestimatethe baselineexpectedcountsdependingonthetypeandquantityofdataavailable.When 1
PAGE 8
covariateinformationsuchaspopulationdemographicsexist,regressionmethodsoera exiblesetoftoolsforcomputingexpectedcounts.Whencasecountsareindexedover time,timeseriesregressionsuchasautoregressiveintegratedmovingaverageARIMA modelscanbeused[Diebold,2007].Whencountsareindexedoverspace,spatialregression techniquessuchassimultaneousautoregressiveSARorconditionalautoregressiveCAR modelscanbeused[WallerandGotway,2004].Thesemethodscanbecombinedfordata thatisindexedoverbothtimeandspace,alongwithothertypesofregressionmethods includinggeneralizedlinearmodelsGLMsandBayesianmodels.Whencovariatesare available,regressionmethodscanproduceinformedandmoreaccurateexpectedcounts, improvingtheabilitytoidentifyabnormalcountsofdisease.Furtherdiscussionofthese andotherregressionmethodscanbefoundinsurveillancemethodreviewpaperssuchas Unkeletal.[2012],Robertsonetal.[2010],orTsuietal.[2008]. Whenpopulationdemographicsorothercovariatesareunavailable,othertechniques maybeusedthataredesignedtoworkwithmoresparseinformation.Anotherpopular approachforprospectivesurveillanceisstatisticalprocesscontrol,wherecontrolchartsare usedtomonitorthemeanofaprocessovertime.Originallydevelopedtomonitor industrialprocesses,controlchartsareusedwithtimeseriesdatatodetectthepointin timewhenthemeanofadatageneratingprocesschanges.Inapublichealthsetting,we maythinkofthespreadofdiseaseasaprocesswewishtomonitor.Whennooutbreakis present,weexpectcasecountstofollowsomeknowndistribution.Whenanoutbreak occurs,themeanofthedistributionshifts,andthenumberofreportedcasesincreases. Whennooutbreakispresent,theprocessisconsideredincontrol,andwhenanoutbreak occursandthemeanshifts,theprocessisdeclaredtobeoutofcontrolandthecontrol chartwillsignalanalarm.Sincethespreadofdiseasecanbesimplymodeledasa datageneratingprocess,statisticalprocesscontrolmethodsareanaturalchoicefor monitoringdiseasecasesovertime,andmanyhavebeenadaptedforuseinasurveillance context[SonessonandBock,2003].Onemethodcommonlyusedforsurveillanceisthe 2
PAGE 9
cumulativesumCUSUMcontrolchart,whichiscurrentlyinuseinsurveillancesystems liketheElectronicSurveillanceSystemfortheEarlyNoticationofCommunitybased EpidemicsESSENSEandtheCentersforDiseaseControlandPrevention'sBioSense system[Tsuietal.,2008].TheCUSUMisdesignedtodetectshiftsinthemeanofaprocess byaccumulatingdeviationsfromexpectedlevelsovertime.Thisallowsforthedetectionof bothlargechangesandsmallerbutsustainedchanges,makingitapowerfultoolfor identifyingoutbreaksofvaryingsize.WhiletheCUSUMwasoriginallydevelopedforuse withnormallydistributeddata[Page,1954],thePoissonCUSUM[Lucas,1985]isthe versionmorecommonlyusedduetothediscretecountnatureofthedata.Othercommon typesofcontrolchartsareexponentiallyweightedmovingaverageEWMAcharts [Roberts,1959]andShewhartcharts[Shewhart,1931],thelatterofwhichhavebeenused inmanypublichealthsettings,astheycanbeusedtoshowaproportionofincidentsfora xedperiodoftime[FastingandGisvold,2003,Cooryetal.,2007].Duetotheirstrength andbroadapplicationsforuse,thisthesiswillfocusprimarilyonthepropertiesofCUSUM controlcharts.Whiledemographicsorothercovariatesmaybeusedwhenestimatingthe baselinedistributionofadiseaseprocess,theCUSUMdoesnotrequirethatsuchdatabe availableandcaneasilybeusedtofulleectwhenonlythetotalcasecountsareavailable. ThePoissonCUSUMhasseveraluserchosenparametersthatdirectlyaectthe timelinessandsensitivityofdetectinganoutbreak.Themeanofthediseaseprocessinthe absenceofanoutbreakcalledtheincontrolmeanneedstobeknownorestimatedfrom baselinedata,andapotentialsizefortheshiftedmeanneedstobeprovided.Ingeneral, thevaluegivenfortheshiftedmeanistypicallythesmallestshiftedmeanthatwewantthe charttobeabletodetect.Ifthevalueprovidedistoolargeandoverestimatesthesizeofa futureoutbreak,itmaytakemuchlongerfortheoutbreaktobedetected,causingfalse negativesandlowsensitivity.Ifthevalueprovidedisinsteadtoosmall,normalvariationin theincontrolprocessmayseemmoresignicantthanitisandresultinadditionalfalse positivesandlowspecicity.Additionally,animportantdecisiontobemadewhenusing 3
PAGE 10
theCUSUMmethodiswhethertorestartthecontrolchartfollowinganalarm.Inits traditionaluseinindustrialprocesscontrol,resettingthechartisstandard.Oncean industrialprocesshasbeendeterminedtobeoutofcontrol,theprocessistypicallyshut downandrestartedbackinanincontrolstate.Sinceadiseaseprocesscannotbeforcibly returnedtoanincontrolstate,resettingthechartinasurveillancesettingisless appropriate.GandyandLau[2012]describedaversionoftheCUSUMthatdoesnotreset followinganalarm,andthisnonrestartingCUSUMisabetterchoiceforusewith surveillancedata.TherestartingCUSUMprioritizestypeIerrorcontrolandreducesthe totalnumberofalarms,butiftheprocessremainsoutofcontrolfollowingthereset,itmay takeseveraltimeperiodsbeforeanotheralarmissounded,resultinginahightypeIIerror rate.Incontrast,thenonrestartingCUSUMretainstheabilitytodetectacontinuousset oftimeperiodswithabnormaldiseaseactivity,andthushashigheroverallpowerduringan outbreak.Inreturn,thenonrestartingCUSUMisknowntosuerfromhighfalsealarm ratesfollowingtheendofanoutbreak,asitmaytakeseveraltimeperiodstoregisteras backincontrol[GandyandLau,2012,DassanayakeandFrench,2016]. WhiletheCUSUMhastheabilitytodetectsmallshiftsinamean,itmaytakeseveral periodsofaccumulateddataforanoutbreaktoberecognized.Forexample,anoutbreak thatresultsinatotal5%increaseincasecountsmaybediculttoidentifywithoutseveral daystoaccumulateevidence.However,ifthat5%increaseintotalcasescorrespondstoa 20%localincreaseincasesinasmallersubsetofthedata,thenincorporatingthat informationcanincreasethechancesofearlydetection.Assuch,wewanttomakeuseof spatialinformationwhenitisavailable.Ascomputingtechnologyimproves,surveillance methodsusingspatialdataarebecomingmorecommonplace.Forthisreason,CUSUM controlchartsthatutilizespatialinformationarepreferred.Aprimeexampleofamethod thatincorporatesspatialinformationintoaCUSUMframeworkisthenearestneighbor CUSUM[Raubertas,1989],whichgroupsregionswiththeirnearestneighborsand computesaCUSUMofthecombinedcounts.Thisallowsfortheisolationofasetof 4
PAGE 11
regionsaectedbyanoutbreak,makingthemeasiertodetect.Thepreviouslymentioned DassanayakeandFrench[2016]alsousesanearestneighborapproachtoaggregatecase countsforthenonrestartingCUSUM.Sonesson[2007]proposedanextensionofthe nearestneighborCUSUMcalledthecircularCUSUM,whichaggregatescountsacrossall regionswithinasetofcircularwindowsofvarioussize.Thesecircularwindowsare centeredateachregioninthestudyarea,andincreaseinsizeuntilthewindowscontaina prespeciedamountofthetotalpopulation.Oncethewindowsaretted,thecasecounts intheregionsineachwindowareaggregated,thenearestneighborCUSUMiscomputed, andthelargeststatisticisusedtodetermineifandwhereanoutbreakispresent. ThecircularwindowsdescribedbythecircularCUSUMarederivedfromthePoisson spatialscanmethoddevelopedbyKulldor[1997],whichisusedtodetectclustersin spatialdata.Inthespatialscanmethod,thesamecircularwindowsareusedtoaggregate countsfromacrossmultipleregions.Theaggregatediseaseincidencerateinsideeach windowiscomparedwiththerateintheregionsoutsidethewindowusingalikelihood ratiostatistic,withthemaximumoverallwindowsusedtotestforthepresenceofa diseasecluster.Whileoriginallydevelopedforuseonretrospectivedata,Kulldor[2001] extendedthecircularscanmethodtoincludewindowsspanningmultipletimeperiods, allowingittobeusedasaprospectivetoolwithtimeseriesdata.Manyotherspatialscan methodsexist,suchastheellipticscanmethod[Kulldoretal.,2006]whichusesellipses ratherthancirclestoaggregateregions,aswellasmethodsthatdonotusexedshape windows,suchasupperlevelsetscanningmethod[PatilandTaillie,2004]andthedynamic minimumspanningtreemethod[Assuncaoetal.,2006].Thesemethodsoermore informationonthedistributionofdiseasecountsacrossspacethansimplyaggregatingcase counts,butmostaredesignedforusespecicallyforretrospectiveanalysisandnotfor prospectivesurveillance. Inthisthesis,weaimtocreateanimprovedversionoftheCUSUMmethodthat incorporatesahigherdegreeofspatialinformationthanpreviousmethodssuchasthe 5
PAGE 12
nearestneighborsCUSUM.GiventhestrengthsoftheCUSUMasaprospective surveillancetoolandthepowerofspatialscanmethodstoidentifylocaldiseaseclusters,we proposeacombinationofthetwomethods.Bytakingthecumulativesumofscanstatistics overtime,wecanusethestrengthsofeachtobeabletoquicklyandaccuratelydetecta localincreaseincasecountscorrespondingtothebeginningofanewoutbreak. Thestructureofthisthesisisasfollows:inChapterII,weexaminethepropertiesof thenonrestartingCUSUMthatmakeitdiculttouseinpractice,andproposeamodied hypothesistesttocontrolpostoutbreakfalsealarmswhileretainingpowerduring outbreaks.Withitsprimaryweaknessmitigated,thenonrestartingCUSUMisthebest choiceforusewithournewsurveillancemethod.InChapterIII,wedescribeindetailthe proposedmethodologyforcomputingthecumulativesumofscanstatistics.Thismethod, whichwecalltheCUSCAN,usesanonparametricformofthenonrestartingCUSUMin conjunctionwithspatialscanmethodstoidentifybothtimeperiodsandspatiallocations whereanoutbreakispresent.Usingsimulatedbenchmarkdataandthecircularand ellipticscanmethods,wedemonstratethattheCUSCANhasconsiderablepowerto rapidlydetectbothlargeandsmalldiseaseclusters.InChapterIV,weprovidean extensiontotheCUSCANthatallowsfortheuseofotherscanmethods,specicallythose thatdonotusewindowsofxedshape.ThisallowstheCUSCANtodetectdisease clustersofarbitraryshape,creatingamodularsurveillancemethodwiththecapacitytobe usedinawidevarietyofapplications.Finally,inChapterV,wesummarizethendingsof thisresearchanddiscussfuturedirectionsofwork. 6
PAGE 13
CHAPTERII AMODIFIEDCUSUMTESTTOCONTROLPOSTOUTBREAKFALSE ALARMS Acknowledgements Thecontentofthischapterwasacceptedforpublicationin StatisticsinMedicine in 2019.Inaccordancewiththecopyrightagreement,theoriginalsubmittedversionofthe manuscriptappearsinthischapter.Otherthanformatting,nochangeshavebeenmade. Thenalacceptedversion,coauthoredbyJoshuaFrench,maybefoundat https://doi.org/10.1002/sim.8088 . 2.1Introduction Diseasesurveillanceisakeytoolinpublichealthapplications.Bymonitoringthe occurrenceandspreadofdisease,wegaininformationthatallowsustobetterreactto potentialthreatstopublicsafety.Oneofthemostimportantfeaturesofsurveillanceis earlydetectionofnewoutbreaks,asthesoonerahealtheventisidentied,themore eectiveanyavailableinterventionwillbe.Inadditiontotimelinessthespeedatwhicha surveillancemethodcandetectthepresenceofanoutbreakamethodshouldalsobeable toaccuratelyseparateoutbreaktimeperiodsfromnonoutbreakperiods[Frisen,1992].In otherwords,agoodsurveillancemethodshouldbeabletosoundanalarmduringtime periodswhenanoutbreakispresenthighsensitivity/powerindetectinganongoing outbreakandsoundnoalarmduringperiodswherethereisnooutbreakhighspecicity. Whiletimeliness,sensitivity,andspecicityareallimportantincreatingauseful surveillancetool,nomethodwillperformperfectlyinallsettings.Researchersoftenhave tochoosewhichfeaturestoprioritizedependingontheirspecicneedsandthepurposeof theirsurveillance.Amethodthatprioritizesquickdetectionofanoutbreakandhigh sensitivitymayexperienceahigherrateoffalsealarmsduringnonoutbreakperiods. Likewise,amethodthatprioritizescontrollingfalsealarmsmayhavelowerpowertodetect trueoutbreaks,duetothenaturaltradeobetweenpowerandtypeIerrorrate. 7
PAGE 14
Onecommonlyusedsurveillancemethodwhereprioritizingsuchdecisionsneedtobe madeisthecumulativesumCUSUMcontrolchart.Originallydevelopedforindustrial processcontrolbyPage[1954],theCUSUMcontrolchartisdesignedtodetectpersistent shiftsinthemeanofaprocessbyaggregatingdeviationsfromthemeanovertime.The processisdeemedoutofcontrol"whenapersistentshiftisdetected.Lucas[1985] extendedtheCUSUMcontrolcharttoPoissoncountdata.ThePoissonCUSUMhas recentlybeenusedinmanyprospectivesurveillanceapplications[Woodall,2006],including surveillancesystemssuchasBioSenseandtheElectronicSurveillanceSystemfortheEarly NoticationofCommunitybasedEpidemicsESSENCE[Tsuietal.,2008].Inthedisease surveillancecontext,thecountsarethediseaseincidencecountsateachtime,anda processisdeclaredoutofcontrolwhenthereisanoutbreakofdiseasebeyondthestandard incidencebehavior. TheCUSUMmethodhasseveraluserchosenparametersthatdirectlyaectthe timelinessandsensitivityofdetectinganoutbreak.Additionally,animportantdecisionto bemadewhenusingtheCUSUMmethodiswhethertorestartthecontrolchartwhenan alarmissounded.TherestartingCUSUM,thetraditionalformofthechart,prioritizes typeIerrorcontrolandnecessarilylackstheabilitytoidentifyallactiveoutbreaktime periods,asthechartisresetimmediatelyafteranoutbreakisidentied.Incontrast,the nonrestartingCUSUMretainsinformationaboutthelengthofanongoingoutbreakby havingtheabilitytodetectacontinuoussetoftimeperiodswithabnormaldiseaseactivity, andthus,higheroverallpowerduringanoutbreak.Inreturn,thenonrestartingCUSUM isknowntosuerfromhighfalsealarmratesfollowingtheendofanoutbreak[Gandyand Lau,2012,DassanayakeandFrench,2016]. Inthispaper,wetakeacloserlookatthefalsealarmprobleminherenttothe nonrestartingCUSUM.WeproposeamodicationtothenonrestartingCUSUMthatwill allowthemethodtoretainitsabilitytomonitoracontinuousoutbreak,whilereducingthe numberofpostoutbreakfalsealarms.InSection2.2,wedescribethenonrestarting 8
PAGE 15
CUSUMandexplaintheoriginofthepostoutbreakfalsealarms.InSection2.3,wedetail theproposedmodicationtotheCUSUM.Wethendemonstratetheuseandeectsofthis modication,rstinSection2.4withsimulateddata,andagaininSection2.5,byapplying themodiedCUSUMtothedetectionofaknownoutbreakof Salmonella Newportin Germanyin2011[Bayeretal.,2014].Lastly,weprovidefurtherdiscussioninSection2.6. 2.2TheNonRestartingCUSUMFalseAlarmProblem TheaimoftheCUSUMmethodistoassesswhetherthemeanofatimeseriesprocess f Y t ;t =1 ; 2 ;::: g hasapersistentmeanshiftovertime.Mostcommonly,ateachtimestep t ,theCUSUMmethoddecidesbetween H 0 : = 0 vs H a : > 0 ,where isthe stationarymeanoftheprocess.WedenetheCUSUMstatisticforPoissoncountdataat time t , C t ,bythefollowingrecursiveformula[Lucas,1985,HawkinsandOlwell,1998]: C t =max f 0 ;C t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 + Y t )]TJ/F19 11.9552 Tf 11.955 0 Td [(k g ; .1 where C 0 =0, Y t istheobservedcountattime t ,and k isaconstantcalculatedtocontrola typeIerrorrate.When Y isPoissondistributed,wedenetheconstant k by k = 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( 0 ln 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(ln 0 ; .2 where 0 istheincontrolmeanand 1 istypicallythesmallestoutofcontrolmeanwe wanttodetect.Basedonthesevalues,wedeterminethecriticalvalue h thatcontrolsthe errorcriterionatthedesiredlevel.Ateachtimestep,theCUSUMstatistic C t iscompared to h ,andif C t >h ,wedeclaretheprocesstobeoutofcontrol. Inanindustrialprocesscontrolsetting,whenaprocessisdeclaredoutofcontrol,the datageneratingprocessisshutdownandcanberestartedinanincontrolstate.Forthis reason,theCUSUMistraditionallyresetafteranoutbreak,eithertozero[Page,1954]or toanothervaluesuchas h= 2[LucasandCrosier,1982].Inapublichealthsetting,we cannotrepairandrestartourdiseaseprocess,soweshouldnotresettheCUSUMfollowing 9
PAGE 16
analarm.ThisisknownasanonrestartingCUSUMNRCUSUMcontrolchart.By allowingtheCUSUMtocontinuemonitoringtheprocess,wecontinuetoreceivesignalsas longastheprocessremainsoutofcontrol,providingvaluableinformationaboutthe timelineofthedetectedoutbreak. Oneoftheassumptionswemakeindiseasesurveillanceisthatoutbreaksofdisease representatransientshiftinourprocess,andtheprocesswillselfcorrectovertimeand returntoanincontrolstate.Forthisreason,whenmonitoringadatastreamusinga CUSUMprocess,weoftenprefertouseaNRCUSUM,wherethemonitoringprocessisnot resetfollowingthedetectionofanoutbreak[GandyandLau,2012].Byallowingthe processtocontinue,wecangaininformationaboutthelengthandseverityofanoutbreak, andmonitoraprocessforevidencethatithasreturnedtoitsincontrolstate. Determiningthetimeperiodwhenaprocessreturnstoitsincontrolstatecanbe dicultiftherecordeddeviationwassignicantlyhigh,eitherfromalongoutofcontrol periodorthepresenceofaparticularlylargeoutbreak.ThesesituationscausetheCUSUM statistictoriserapidly,anditmaytakemanytimeperiodsaftertheendofanoutbreakfor theCUSUMstatistictoreturntopreviouslevels. WedemonstratetheCUSUMpostoutbreakfalsealarmproblemusingasimulated example.Considerasetofcountdataobservedat100consecutivetimesteps.Weassume theresponsesareindependentrandomvariableshavingaPoissondistribution.Therst30 timeperiodsareincontrolwithmean 0 =4,thenext40periodsareoutofcontrolwith mean 1 =6,andthelast30timeperiodsreturntopreviousincontrollevels.Figure2.1 showsthetimeseriesoftheCUSUMstatisticoverthe100times.WeseethattheCUSUM statisticremainselevatedforseveraltimeperiodsaftertheprocessreturnstoanincontrol state.Duringthistime,theprocesswillcontinuetosignalanoutbreakateachtimeperiod itremainsabovethesignicancethreshold h ,resultinginalargenumberofpostoutbreak falsealarms.ArecentexampleofthisproblemcanbefoundinDassanayakeandFrench [2016],whereanonrestartingCUSUMframeworkwasused.GandyandLau[2012] 10
PAGE 17
Figure2.1:AnexampleofanonrestartingCUSUMdisplayingelevatedlevels postoutbreak. 11
PAGE 18
proposeplacinganupperlimitontheCUSUMstatistic,limitinghowhighitcangrowand increasingthespeedatwhichitcandropbackbelowthethresholdfollowinganoutbreak. Whilethismethoddoesreducethefalsealarmrate,theratestillremainshighrelativetoa restartingCUSUM,andlimitingtheCUSUMstatisticgrowthcanresultinalossof informationonthesizeanddurationofanoutbreak. 2.3AnAdaptiveHypothesisTestfortheNonRestartingCUSUM Inthissection,weproposeasolutiontothepostoutbreakfalsealarmproblemthat controlsthefalsealarmratewithoutsacricinginformationfromtheCUSUM.By accountingforthepresenceofanoutbreakinrecenttimeperiods,wecanadjusttheway weperformourhypothesisteststobetterdetecttheendofanoutbreakandthereturnto anincontrolstate. Traditionally,theCUSUMmethodusesathreshold h chosentocontroltheaveragerun lengthbeforeafalsealarm.WhentheCUSUMstatisticcrossesthisthreshold,analarmis soundedandweconcludethatanoutbreakhasoccurred.Wechooseinsteadtousea pvalueapproachtohypothesistesting.Ouradaptivetestwillhaveanonstationarynull distributionovertime,sotherecannotbeapredeterminedthreshold h .Apvalue approachfacilitatesauniformtestingapproachandtellsushowstrongtheevidenceofan outbreakisateachtimestep. ToobtainapvaluefortheNRCUSUMtest,wesimulate N sim datastreamsassuming thenullhypothesisistrueandcalculatetheCUSUMstatisticsforeachstream.Let C i t denotetheCUSUMstatisticattime t forsimulateddatastream i .Thepvalueforthetest ateachtimestep t iscomputedastheproportionofstatistics,includingtheobserveddata [WallerandGotway,2004],whichareatleastaslargeastheobserved: p t = 1+ Nsim P i =1 I h C i t C t i 1+ N sim : .3 Theinclusionoftheobservedstatisticinthecalculationpreventsobservingapvalue 12
PAGE 19
ofzerointhecasewherenoneofthesimulatedstatisticsareaslargeastheobserved.The primarycauseofpostoutbreakfalsealarmsforthepvaluemethodofhypothesistestingis theuseofnullhypothesisassumptionswhilesimulatingthedatastreams.Anoutbreakin recenttimeperiodswillcausetheobservedCUSUMstatistictoremainelevatedcompared tothesimulatedstreamswithnoprioroutbreak,resultinginfalsealarmsoncethe outbreakends. 2.3.1AModiedNRCUSUMTest Weproposeamodiedhypothesistestthatallowsustosimulatedataconditionalon thepresenceofanoutbreak.Let Y i t denotethesimulatedcountofdiseasecasesattime t fordatastream i .BeginningwithourinitialCUSUMvalue C 0 =0,weperformthe followingstepsateachtimestep t : 1.Simulate N sim countsunderthenullhypothesis,i.e.,simulate f Y i t ;i =1 ; 2 ;:::;N sim g ,where Y i t Poisson 0 for i =1 ; 2 ;:::;N sim . 2.Calculate C i t for i =1 ; 2 ;:::;N sim . 3.Calculatethepvalue, p t ,forthehypothesistestby.5. 4.Determinewhetherthereisanoutbreak. a.Ifwefailtorejectthenullhypothesisandnooutbreakisdetected,weincrement t andreturnto1. b.Ifthenullhypothesisisrejectedandweconcludethatanoutbreakispresentin thecurrenttimestep: iSimulate N sim countsunderthealternativehypothesis,i.e.,resimulate f Y i t ;i =1 ; 2 ;:::;N sim g .Onecaneitherassume Y i t Poisson a for i =1 ; 2 ;:::;N sim ,orobtainasampleviabootstrap.Wewilldiscussthisstep inmoredetailinwhatfollows. 13
PAGE 20
iiRecalculate C i t for i =1 ; 2 ;:::;N sim usingthecountssimulatedin4.b.i. iiiIncrement t andreturnto1. Thisprocessresultsin N sim simulationsover t timeperiods,withelevatedrates simulatedfortimeperiodswherewehaveevidenceofanoutbreak.Thisallowsour simulatedCUSUMstatisticstomorecloselyfollowtheobservedstatistics,allowingfor morespecicitytodetectareturntotheincontrolstateandthusfewerpostoutbreak falsealarms.Wenotethatthereisanimplicitassumptionthattheoutbreakmean a is constant. 2.3.2SimulatingUndertheAlternativeHypothesis ThebiggestquestionraisedbytheproposedNRCUSUMmodicationishowto simulatedataunderthealternativehypothesis.Weproposefourmethodsforsimulating dataduringoutbreakperiods: 1.Simulateusingaknown a : ifthemeanofthediseaseoutbreakprocess, a ,isknown, wecansimulatedataasrandomdrawsfromaPoissondistributionwithmean a .This isunrealisticforrealdata. 2.Simulateusingchosen 1 : whenthetrueoutofcontrolmeanisunknown,wecan simulatePoissoncountsusingthe 1 parameterweselecttocalculatetheCUSUM constant k ,i.e.,setting a = 1 . 3.Averageoutofcontrolcounts: withthisapproach,weestimate a bytheaverageofthe countsoveralltimeperiodswhereanoutbreakhasbeendetecteduptothattime.Note thatwereestimatetheoutbreakmeaneachtimeanewoutbreaktimeperiodisobserved toincludethenewdatafromthattimeperiod.Denotethisestimateas ^ m . 4.Bootstrapobservationsfromoutbreaktimeperiods: ratherthansimulateobservations fromaPoissondistribution,thismethodgeneratesnewcountsbysamplingwith replacementthediseasecountsfortimeswhereanoutbreakhasbeendetected.The 14
PAGE 21
poolofobservationswesamplefromincreaseseachtimeweidentifyanewoutbreak timeperiod. 2.4DataDemonstration:ConditionalHypothesisTestingonSimulatedData Wewillnowdemonstratethevalidityofthisapproachusingsimulateddatawhenthe trueoutbreakmeananddurationareknown.Wecreated100setsofsimulatedPoisson counts.Eachdatasetcontainsobservationsfor125timesteps,withanoutbreaksimulated over25timeperiodsfrom t =51through t =75.Wesimulatedthedataunderthenull hypothesisusing 0 =5,withanoutbreaklevelof a =10.Foralltests,weutilize 1 =1 : 5 0 =7 : 5,resultingin k = 7 : 5 )]TJ/F15 11.9552 Tf 11.955 0 Td [(5 ln : 5 )]TJ/F15 11.9552 Tf 11.955 0 Td [(ln 6 : 1658 : Foreachofthesedatasets,weperformedtheanalysisvetimes:onceusingthe uncorrectedNRCUSUMtestandthenthemodiedcorrectedNRCUSUMforeachofthe simulationapproachesproposedinSection2.3.2.Asignicancelevelof =0 : 05wasused foralltests.WenotethatGandyandLau[2012]proposeamethodthatlowersthefalse alarmrateofNRCUSUMs.However,theirmethodcontrolstheFalseDiscoveryRateFDR [BenjaminiandHochberg,1995],sotheresultsarenotcomparablewithourmethod, whichcontrolsthetraditionaltypeIerrorrate. Werandomlyselectedoneofthesedatasetsset#27toserveasademonstrationof theeectsofthetest.ThetimeseriesofcasecountsisshowninFigure2.2.Weseeaclear increaseinthecasecounts,onaverage,between t =51and t =75duringthesimulated outbreaktimeperiod. TheaverageresultsoftheuncorrectedtestareshowninTable2.1.Theseresultswill bereplicatedinwhatfollowsforeasierreading. 15
PAGE 22
Figure2.2:Casesovertimeforsimulateddataset#27 Table2.1:AveragesimulationresultsforuncorrectedNRCUSUM PreOutbreakOutbreakPostOutbreak FalseAlarmRateAlarmRateFalseAlarmRate 0.0530.9580.990 16
PAGE 23
2.4.1Resultsusingknown a Webeginwiththescenariowhen a isknown.Whileunrealisticinpractice,this simulationwillallowustoestablishabaselinefortheperformanceofthemodied hypothesistestunderidealconditions. Theresultsfortheuncorrectedandmodiedtestusingknown a fordataset#27 areshowninFigure2.3.Thepointsmarkedwithanx"intheplotsindicatetimeperiods wherewerejectedthenullhypothesisandtheprocesswasdeclaredtobeoutofcontrol. TheuncorrectedCUSUMtesthasthreepreoutbreakfalsealarmsat t =24, t =26,and t =27.Theoutbreakisdetectedattime t =53andthisapproachcontinuestosound alarmsuntiltheendofthesimulateddata.Withthecorrectedhypothesistestused,the onlypreoutbreakfalsealarmisat t =24,andtheprocessisdeclaredbackincontrolat t =78,withoneadditionalfalsealarmat t =93.Thetradeo,then,comesintermsof ourpowerduringanoutbreak.Withnohypothesistestcorrection,analarmissounded duringeachtimeperiodwheretheprocesswasoutofcontrolfollowinginitialdetectionof theoutbreak t =53to t =75.Afterapplyingthecorrection,weexperienceadditional falsenegativesat t =54and t =61. Theaveragedresultsfromrunningthetestsonall100simulateddatasetsare summarizedinTable2.2.Whilethenumberofoutbreaktimeperiodscorrectlyidentied decreasedwiththecorrectedhypothesistest,weseedramaticimprovementinthe postoutbreakfalsealarmrate.Notably,usingthecorrectedhypothesistest,the postoutbreakfalsealarmrateisnowcontrolledatthe =0 : 05level. 17
PAGE 24
Figure2.3:AcomparisonofNRCUSUMresultsfortheuncorrectedtestaandmodied testwithknown a bfordataset#27. Table2.2:Averagesimulationresultswhen a isknown. Method PreOutbreakOutbreakPostOutbreak FalseAlarmRateAlarmRateFalseAlarmRate Uncorrected0.0530.9580.990 Known a 0.0190.7820.047 18
PAGE 25
2.4.2Resultsusing 1 Inthisscenario,weassume 0 =5isknown,butthetruevalueof a isunknown. However,weset a = 1 =7 : 5whensimulatingoutbreakdatainthemodiedtest.A comparisonoftheresultsfortheuncorrectedandcorrectedtestsfordataset#27are showninFigure2.4. Usingthecorrectedversionofthetest,thereisonlyonepreoutbreakfalsepositiveat t =26.Theoutbreakisdetectedat t =53,andthealarmiscontinuouslysoundeduntilthe processisdeclaredbackincontrolat t =99.Theaveragedresultsfromrunningthis versionofthetestonallsimulateddatasetsaresummarizedinTable2.3. Asthechoiceof 1 wewantedtodetectwaslowerthanthetrueoutbreakmeanof10, oursimulationunderestimatesthesizeoftheoutbreakandthesimulatedstatisticsremain toofarbelowtheobservedvaluestoeectivelycontrolthepostoutbreakfalsealarmrate atthedesiredlevel.Despitethis,thefalsealarmratedidseedrasticreductionfrom99% alarmratewithnocorrectionto36.5%withcorrection,sothismethodprovides improvementifnobetterestimatesof a areavailable. 2.4.3Resultsusing ^ m Asinthepreviousscenario,weassume 0 =5isknownand a isunknown.Withthis method,whensimulatingunderthealternativehypothesis,weestimate a bytakingthe meanofobservedcountsoverallcurrentandpreviousoutbreaktimeperiods,andthen simulatingnewPoissoncountsusingthatmeanasourparameter.Acomparisonofthe resultsfortheuncorrectedandcorrectedtestsfordataset#27areshowninFigure2.5. InFigure2.5,weseeonefalsealarmat t =24andnofalsealarmsfollowingtheendof theoutbreak.Inexchange,weexperiencemorefalsenegatives,withmissingalarmsat t =59,60,61,69and72.AveragedresultsaresummarizedinTable2.4. 19
PAGE 26
Figure2.4:AcomparisonofNRCUSUMresultsfortheuncorrectedtestaandmodied testwith a = 1 bfordataset#27. Table2.3:Averagesimulationresultswith a = 1 Method PreOutbreakOutbreakPostOutbreak FalseAlarmRateAlarmRateFalseAlarmRate NoCorrection0.0530.9580.990 a = 1 0.0230.9360.365 20
PAGE 27
Figure2.5:AcomparisonofNRCUSUMresultsfortheuncorrectedtestaandmodied testwith a = ^ m bfordataset#27. Table2.4:AverageResultsforSimulationwith = ^ m Method PreOutbreakOutbreakPostOutbreak FalseAlarmRateAlarmRateFalseAlarmRate NoCorrection0.0530.9580.990 a = ^ m 0.0180.6140.024 21
PAGE 28
ThetypeIIerrorrateishigherwiththismethodcomparedtosimulatingwithaxed valueof 1 ,butthepostoutbreaktypeIerrorrateiseectivelycontrolledatthe =0 : 05 level,andwerequirenoassumptionsaboutthevalueof 1 . 2.4.4SimulationwithBootstrap Weonceagainassumethat 0 =5isknownand a isunknown.Withthismethod, ratherthanattemptingtoguessorestimatethevalueof a ,weuseabootstrapmethodto generatesimulatedoutbreakobservationsbysamplingwithreplacementfrompreviously observedoutbreaktimeperiods. Withthismethod,weobserveasinglefalsealarmat t =24andnopostoutbreakfalse alarms.Falsenegativesoccurat t =59,61,68and72,asseeninFigure2.6.Average resultsacross100datasetsforthebootstrapsimulationmethodarepresentedinTable2.5. Aswiththepreviousmethodwhereweusedasamplemeantoestimate a ,weseean increaseinthetypeIIerrorrateinexchangeforcontrolofthetypeIerrorrate. 2.4.5SummaryofSimulationResults Inthissection,weexaminedtheeectivnessoftheproposedmodiedhypothesistest usingsimulateddata.Weconsideredfourmethodsofsimulatingdataunderthealternative hypothesis.WesummarizetheresultsacrossallfourdemonstrationsinTable2.6. Inallcases,changingthewaywesimulatedataduringoutbreaktimeperiodshasa noticeableeectonthepostoutbreakfalsealarmrate,loweringitconsiderably.Whenour simulatedstatisticsunderestimatethesizeoftheoutbreak,thecontrolonthefalsealarm rateisweakened;however,whentheoutbreakissimulatedatclosetothecorrectsize,the postoutbreaktypeIerrorrateiseectivelycontrolledatthechosenlevelof . 2.5Demonstration:Salmonelladata Tofurtherdemonstratetheeectivenessoftheproposedmodiedhypothesistest,we usedaCUSUMapproachtodetectarecordedoutbreakof Salmonella NewportinGermany in2011.Thedatacontainthenumberof Salmonella Newportcasesreportedacross16 22
PAGE 29
Figure2.6:AcomparisonofNRCUSUMresultsfortheuncorrectedtestaandmodied testusingbootstrapsamplesbfordataset#27. Table2.5:Averageresultsforbootstrapcorrection Method PreOutbreakOutbreakPostOutbreak FalseAlarmRateAlarmRateFalseAlarmRate NoCorrection0.0530.9580.990 Bootstrap0.0210.6710.022 Table2.6:Summaryofaverageresultsacrossallmethods Method PreOutbreakOutbreakPostOutbreak FalseAlarmRateAlarmRateFalseAlarmRate NoCorrection0.0530.9580.990 Known a 0.0190.7820.047 a = 1 0.0230.9360.365 a = ^ m 0.0180.6140.024 Bootstrap0.0210.6710.022 23
PAGE 30
Germanstatesbetween2004and2013.AnalysisbyBayeretal.[2014]concludedthatthe outbreakinquestionoccurredapproximatelybetweenOctober20thandNovember8th, 2011,correspondingtoweeks408through410inthedata.Figure2.7showsthetotal numberofreported Salmonella Newportcasesacrossall16statesduringthestudyperiod. AsthestateofSaarlandreportednocasesof Salmonella Newportduringtheoutbreak periods,itwasexcludedfromthedemonstration.Threeyearsofdatawereusedto estimatethebaselinenumberofexpectedcasesineachoftheremaining15states,anda separateCUSUMwascalculatedforeachstateindividuallywithsizeofoutbreaktodetect setat 1 =1 : 5 0 .EachCUSUMtestwasperformedoncewithanunmodiedhypothesis testandagainwiththemodiedhypothesistestproposedinSection2.3.2.Themodied andunmodiedtestswerecomparedbasedonhowquicklyeachwasabletoidentifythe outbreakandhowmanyalarmsoccurfollowingweek410,whentheoutbreakpresumably ended. Asthenumberofreportedcasesduringthisoutbreakwassmallinsomestates,the outbreakwasnotdetectedinallplaces.InBerlin,Hamburg,andNorthRhineWestphalia, theoutbreakwasdetectedatweek410.InBrandenburg,Hesse,Saxony,andLower Saxony,itwasdetectedoneweeklateratweek411,andinSchleswigHolsteinitwas detectedfourweekslateratweek414.Thisoutbreakwasnotdetectedin BadenWurttemberg,SaxonyAnhalt,RhinelandPalatinate,Bremen,Bavaria,or Thuringia.Foreachstatewheretheoutbreakwasdetected,thecorrectedmethods detectedtheoutbreakatthesametimeperiodastheuncorrectedmethod,withthe exceptionofthestateofNorthRhineWestphalia,wheretheoutbreakwasdetectedone timeperiodlaterthantheuncorrectedmethodatweek411. Forthestateswheretheoutbreakwasdetected,thenumberofalarmsfollowingthe initialdetectionforthecorrectedanduncorrectedmethodsissummarizedinTable2.7. Insummary,in7ofthe8ofthestateswheretheoutbreakwasinitiallydetected,the correctedmethodsretainedthesamepowertodetecttheoutbreakwhiledrastically 24
PAGE 31
Figure2.7:Totalnumberof Salmonella Newportcasesreportedin16Germanstates between2004and2014.Aclearspikeincasescanbeseenstartingaroundweek408. Table2.7:Numberofalarmsfollowinginitialdetectiontimebetween t =411and t =528 forthefourapproaches. RegionUncorrected 1 ^ m Bootstrap Berlin785000 Brandenburg11611633 Hamburg7400 Hesse12700 LowerSaxony905312 NorthRhineWestphalia1168187 Saxony552711 SchleswigHolstein1157145 25
PAGE 32
loweringthepostoutbreakfalsealarmrate.Inonestate,thepowertodetecttheoutbreak wasslightlyreducedbydelayingthedetectionoftheoutbreakbyoneweek.However,the correctedtestsresultedindramaticallyfewerfalsealarmsaftertheoutbreakended. 2.6Discussion WhenusinganonrestartingCUSUMcontrolcharttomonitoradiseaseprocess,false alarmsfollowingtheendofadiseaseoutbreakarecommon.Weproposedamodied nonrestartingCUSUMsolutionthatutilizespvalues.Bychangingthewayweperform ourMonteCarlosimulationsandsimulatingcasecountsunderthealternativehypothesis duringtimeperiodswhereanoutbreakhasbeendetected,wecreateamorerealisticsetof simulateddataforcalculatingpvaluesinthepresenceofanoutbreak.Wedemonstrated theeectivenessofthismethodonbothsimulatedandrealdata,andfoundthatwhena reasonableestimateoftheoutbreakintensitycanbeobtained,thismethodcontrolsthe postoutbreakfalsealarmrateatthedesiredlevelwhileretainingtheabilitytoquickly detectanemergingoutbreak. Themodiedtesttypicallyhaslesspowertosignalanalarmduringoutbreaktime periods.Inthesimulationstudy,weobserveddatastreamswhereanoutbreakwasinitially detected,butthemodiedNRCUSUMtestfailedtoproduceacontinuousstreamof alarmsduringtheoutbreakperiod.Theuncorrectedtestdidnothavethisproblem. However,thecorrectedCUSUMtestsretainthetimelinessoftheuncorrectedCUSUM withthesametimetorstdetection,whilecontrollingthepostoutbreakfalsealarmrate atthedesiredlevel =0 : 05forsimulationsforthebootstrapandestimationmethods, andloweringthefalsealarmrateinthecasewherewesimulateoutbreakdataatthelevel of 1 wewanttodetect.Inthedatastudy,thefalsenegativerateismoredicultto determine,asitisnotpreciselyknownwhentheoutbreakweaimedtodetecttrulytook place.Giventheassumptionthattheoutbreaktookplaceapproximatelybetweenweek408 and410,thefalsenegativerateforthecorrectedCUSUMtestwasnoworsethanthe uncorrectedin7outof8regionswheretheoutbreakwasdetected,anddelayedthe 26
PAGE 33
detectionoftheoutbreakbyonetimeperiodin1outof8states. Thelossofpowerinducedbythecorrectedhypothesisteststandsastheprimary weaknessofthismethod.Whilenotstronglyvisibleinourdatastudy,theeectonpower todetectacontinuingoutbreakcanbeseeninthesimulationstudies,withseveralfalse negativesintroducedinthemiddleoftheoutbreaktimeperiods.Careshouldbetaken whenusingthismethodtoavoidpreemptivelydeclaringanoutbreaktohaveended,for example,byrequiringastringofnegativeresultsratherthananisolatednegativeas successivefalsenegativesaremorerare.Powertodetectthebeginningofanoutbreak mayalsobeaectedincertaincases,suchaswesawwiththestateofNorth RhineWestphaliainthedatastudy,wheredetectionwasdelayedbyonetimeperiodwhen usingthecorrectedtests.Whenmonitoringadatastream,thetimeoftherstalarm followinginitializationoftheCUSUMisthesamebetweenthecorrectedanduncorrected tests,asthetwobehaveidenticallyuntilanoutbreakisdetected.However,theresultsof thetestsmaydivergewhensubsequentoutbreaksoccur.InthecaseofNorth RhineWestphalia,whichexperiencedseveralsmalleroutbreaksof Salmonella Newport priortothe2011outbreakofinterest,thisresultedinslightlyelevatedsimulatedCUSUM streamsandanadditionalfalsenegativeperiod.Thiseectbecomesmorepronouncedif prioroutbreakswereoflargermagnitudethanfutureoutbreaks,astheCUSUMstatistics mayremainelevatedenoughtomakedetectiondicult.Onestrategythatmayhelp reduceissuesofthistypemaybetoresetthesimulatedCUSUMstreamstozerowhenthe CUSUMhasreturnedtozerofollowinganoutbreak,orresettingthesimulatedstreamsto thecurrentCUSUMvalueafteraprespeciednumberofnegativetestresultsinthecase wherethestatisticmaynotreturntozerobetweenoutbreaks. 27
PAGE 34
CHAPTERIII CUSCAN:DETECTINGEMERGINGDISEASECLUSTERSWITHTHE CUMULATIVESUMOFSCANSTATISTICS 3.1Introduction Diseasesurveillanceisanimportantaspectofpublichealth.Earlydetectionofdisease outbreaksisnecessaryforproperresponsivemeasurestobetaken.Prospectivedisease surveillancemethodsmonitorchangesindiseasecountsorsyndromicindicatorsthat precedeformaldiagnosis[Dasetal.,2005]overtimeandusestatisticalmethodsto determinewhethermonitoredlevelsexceedexpectedcountsenoughtodeterminethatan outbreakispresent.Toprotectpatientprivacy,casedataisoftenreportedascounts aggregatedoverareassuchaszipcodes,counties,orcensustracts.Demographic information,whenavailable,oftendescribesthepopulationatriskforaregionratherthan thespecicindividualscomprisingthecases.Forthisreason,surveillancemethodsare oftendesignedtomakeuseofthelimitedstructureofavailabledata.Avarietyofmethods canbeusedforsurveillancedependingonthedataavailable.Regressionmethodsoera exiblesetoftoolsfordiseasesurveillance,withoptionsfortimeseriesregressionsuchas autoregressiveintegratedmovingaverageorARIMAmodels[Diebold,2007]for timeindexeddata,spatialregressionsuchassimultaneousautoregressive,SAR,or conditionalautoregressive,CARmodels[WallerandGotway,2004]forspatiallyindexed data,orsomecombinationofthetwo.Whendemographicorenvironmentalcovariatedata areavailable,regressionmodelscanbeupdatedtogetimprovedestimatesofexpected countsofcases,allowingforbetterdetectionofaberrantcases.Overviewsoftheseand otherregressionmethodscanbefoundinsurveillancemethodreviewpaperssuchasUnkel etal.[2012],Robertsonetal.[2010],orTsuietal.[2008]. Regressionmethodscanbepowerful,butrequireatrainedhandtouseeectively,and manyofthebenetstoregressionsuchastheinclusionofcovariatesinmodelingare dependentondataavailability.Itiscommoninpracticeforsurveillancedatatoconsist 28
PAGE 35
onlyofdiseasecountsinaggregateregionsforpatientprivacy,withnodemographic covariatesreported.Inthesecases,moresimplemethodscanbeusedthatrelyonlyon casecounts,spatiallocationofagglomerationdistricts,orboth.Spatialscanningmethods areaprominentfamilyofmethodsthatrequireonlyregionalcasecountsandtheirspatial locations.Thesescanmethodsexaminethediseaseincidencerateinasetofregionsoften calledawindowrelativetotherateinthesurroundingregions,anddeclarethatacluster ofcasesispresentinthoseregionsiftherateinsidethewindowissignicantlyhigherthan therateoutside.AclassicexampleisKulldor'scircularscanstatistic[Kulldor,1997], whichusesalikelihoodratioapproachtocomparethediseaserateinsidecircularshaped windowstotherateoutsidethewindows.VariationsonKulldor'sscanmethodexistthat allowforthedetectionofnoncirculardiseaseclusters,suchastheellipticscanmethod [Kulldoretal.,2006],whichusesanellipseratherthanacircletodenewindows,andthe exibleandrestrictedexiblescanmethods[TangoandTakahashi,2005,2012],which searchoverarbitrarilyshapedconnectedsubsetsofregions.Thesestatisticscanalsobe generalizedtodatathatareindexedoverbothspaceandtimebyexpandingthewindows toincludemultipletimeperiods,suchasthespacetimescanstatistic[Kulldor,2001], whichusescylindricalwindowswhoseheightismeasuredintime.Thespacetimescan statisticispopularduetoitssimplicity,aswellasitsavailabilityinthesoftwarepackage SaTScan[Kulldor,2003]. Statisticalprocesscontrolmethodshavealsoproveneectivefordiseasesurveillance, requiringonlycasecountsovertimeasdata,withkeyexamplesbeingtheShewhartchart [Shewhart,1931],theexponentiallyweightedmovingaverageEWMAchart[Roberts, 1959],andthecumulativesumCUSUMcontrolchart[Page,1954].Ofparticularinterest istheCUSUMchart,whichhasbeenusedfordiseasesurveillancepurposesinboththe CentersforDiseaseControlandPrevention'sBioSensesystemandtheDepartmentof Defense'sElectronicSurveillanceSystemfortheEarlyNoticationofCommunitybased EpidemicsESSENCEsystem[Tsuietal.,2008].Thesemethodsassumethatthedata 29
PAGE 36
beingmonitoredcomefromsomeknowndistributionwhentheprocessisincontroli.e., thereisnooutbreak,andseektoidentifyanychangesinthedatageneratingprocessthat couldindicatetheprocesshasgoneoutofcontroli.e.,thereisanoutbreak.When observedcountsdierfromwhatisexpected,thesechartsincreaseuntiltheycrosssome predeterminedthreshold,atwhichpointanalarmissounded.Intheirtraditionalusein monitoringindustrialprocesses,thesechartsaretypicallyresetfollowinganalarm,asthe datageneratingprocessforexample,malfunctioningmachinerywouldbeshutdownand restartedinanincontrolstate.Sincenaturaldiseaseprocessescannotberesetinthisway, GandyandLau[2012]proposedtheuseofanonrestartingCUSUMforuseinpublic healthcontexts,asdiseaseoutbreaksarenaturallytransientandreturntoincontrolstates ontheirownovertime,andresettingthecontrolcharttozerofollowinganoutbreakalarm reducespowertodetectongoingoutbreaksandresultsinalossofinformationregarding thelengthandintensityoftheoutbreak.DassanayakeandFrench[2016]isoneexampleof adiseasesurveillancemethodthatutilizesthenonrestartingCUSUM. OneofthestrengthsoftheCUSUMisitsabilitytodetectsmallbutpersistentshiftsin themeanofaprocess.However,itmaytakeseveraltimeperiodsofaccumulateddatafor anoutbreaktoberecognizedifitsintensityisparticularlysmall,whichreducesthe timelinessofdetection.Anoutbreakthatresultsina5%increaseincasecountsacrossthe entirestudyareamaynotbeimmediatelyrecognizedassuch.However,ifthat5%increase intotalcasesistheresultofa20%localincreaseincasesinasmallersubsectionofthe studyarea,thenincorporatingthatinformationcanincreasethechancesofearlydetection. Forthisreason,theuseofspatialinformationwhenmonitoringcountsovertimehas becomeincreasinglycommon.Aprimeexampleofamethodthatincorporatesspatial informationintoaCUSUMframeworkisthenearestneighborCUSUM[Raubertas,1989], whichgroupsregionswiththeirnearestneighborsandcomputesaCUSUMofthe combinedcounts.Sonesson[2007]proposedanextensionofthenearestneighborCUSUM calledthecircularCUSUM,whichusesthecircularwindowsdenedbythecircularscan 30
PAGE 37
method[Kulldor,1997]toaggregatecasecountsfortheCUSUM.Additionally,the previouslymentionedDassanayakeandFrench[2016]alsousesanearestneighborapproach toaggregatecasecountsforthenonrestartingCUSUM. WeproposeanimprovedmethodforincorporatingspatialinformationintoaCUSUM frameworkfordiseasesurveillance.Themethodtakesadvantageoftherelativesimplicity andminimaldatarequirementsoftheCUSUMandspatialscanmethodapproachesto outbreakdetection.SimilartoSonesson[2007],wesearchoverasetofpotentialclusters denedbyspatialscanningmethods.However,whileSonesson'scircularCUSUMsimply usedcircularwindowstodeneexpandedneighborhoodsforthenearestneighbors CUSUM,weinsteadcomputethePoissonscanstatisticforeachpotentialcluster,anduse anonrestartingCUSUMtomonitorhowthesestatisticschangeovertime.Byusingscan statisticsasdata,weareabletoincorporatemorespatialinformationintothemonitoring process,improvingourabilitytodetectemergingspatialdiseaseclusters.Thisexible frameworkallowsforpotentialclusterstobedenedinavarietyofwaysasdesiredbythe practitioner.Wedemonstratetheproposedmethod,whichwetermtheCUSCAN,using circular[Kulldor,1997]andelliptic[Kulldoretal.,2006]scanstatistics.Othermethods fordeterminingpotentialclustersmaybeusedwithinthisframeworkaswell,assumingthe setofpotentialclusterstosearchisxedacrosstimeperiods. GiventhesimilaritiesbetweentheCUSCANmethodandothermethodsbasedonscan statisticsorCUSUMcharts,it'snaturaltoconsiderthestrengthandweaknessesofeach whendecidingthebestmethodtouseforagivensituation.However,directcomparisons betweenmethodsisoftendicult,asdierentauthorsusedierentdatasetsanddierent measuresofperformancetotesttheirmethods.Forexample,DassanayakeandFrench's modiednearestneighborsCUSUMusesanFDRbasedapproachtocontrollingfalse positiveswhiletheCUSCANcontrolsthetypeIerrorrate,soperformanceisnotdirectly comparablebetweenthetwo.Manyproposedsurveillancemethodsalsodonotoer publiclyavailablesoftwareforimplementation,makingitdiculttoapplymultiple 31
PAGE 38
methodstothesamedata.AsotherspatiallyaggregatedCUSUMmethodsmentionedhere thecircularCUSUM,thenearestneighborsCUSUM,andDassanayakeandFrench's modiednearestneighborsCUSUMfallintothiscategory,wearenotabletomakedirect comparisonsbetweenthesemethodsandtheCUSCAN.However,sinceKulldor's spacetimescanstatisticispubliclyavailableassoftwarethoughSaTScan,weoer comparisonsbetweenthisstatisticandtheCUSCAN.Additionally,Kulldoretal.[2004] appliedthespacetimescanstatistictobenchmarkdatathatispubliclyavailable,allowing ustomakeadirectcomparisonbetweenmethodsusingthesamedata. Thestructureofthischapterisasfollows:InSection3.2,wereviewpropertiesof spatialscanmethods,includingthespacetimescanmethod,andthenonrestarting CUSUM.InSection3.3,wedescribetheproposedCUSCANmethodologyindetail.In Section3.4,wedemonstratethepropertiesoftheCUSCANmethodusingsimulateddata. InSection3.5,weapplytheCUSCANtobenchmarkdatafromKulldoretal.[2004]and provideapoweranalysisandcomparisontothespacetimescanmethod.InSection3.6,we summarizeourconclusionsontheoveralleectivenessoftheCUSCANmethod. 3.2ReviewofMethods 3.2.1SpatialScanMethods Poissonspatialscanmethodsarepopularforidentifyingclustersofcasesgivenregional countsinagivenstudyarea[WallerandGotway,2004].Thesemethodcreatesasetof potentialclusterlocationsfromsubsetsofthestudyarea,andcomparestheobserved diseaserateinsideeachpotentialclustertotherateoutsideeachcluster.Thisapproach waspopularizedbyKulldorandNagarwalla[1995]andKulldor[1997]. Forastudyareaconsistingof N disjointspatialregions,wedene n 1 ;:::;n N tobethe atriskpopulationineachregion,withatotalpopulationof n + = P N i =1 n i .Let Y 1 ;:::;Y N betheassociatedPoissoncasecountsineachregion,with y + = P N i =1 Y i .Intheabsenceof adiseasecluster,wewouldexpectthediseaseriskateachlocation, r i = y i =n i ,tobe 32
PAGE 39
consistentacrossthestudyareaandequaltotheglobalrisk, r = y + =n + .Theexpected countsforeachregioninthestudyarea, E i ,iscomputedby E i = rn i . Forapotentialclusterofcontiguousregions Z f 1 ; 2 ;:::;N g ,wedene Y in = P i 2 Z y i , Y out = P j= 2 Z y j , E in = P i 2 Z E i ,and E out = P j= 2 Z E j .Wethencalculatethe Poissonscanstatisticfor Z as: S Z = Y in E in Y in Y out E out Y out I Y in E in > Y out E out : .1 Forasetofpotentialclusters Z ,theteststatisticforthePoissonspatialscantestis computedby: max Z 2Z f S z g : .2 Thesignicanceofthisteststatisticistypicallyassessedthroughsimulationasfollows: let S betheteststatisticfromtheobserveddata.Wesimulate N sim datasetsunderthe nullhypothesisofconstantrisk, Y i Poisson E i , i =1 ;:::;N ,andcomputethe maximumscanstatistic S i foreachsimulateddataset.Wecalculatethepvalueforthe teststatisticastheproportionofmaximumteststatistics,includingtheobservedstatistic S ,thatareatleastaslargeastheobservedstatistic: p = 1+ Nsim P i =1 I )]TJ/F19 11.9552 Tf 5.48 9.684 Td [(S i S 1+ N sim : .3 Therearemanywaystodetermine Z ,thesetofpotentialclusters.Thechoiceof potentialclustersfrequentlydistinguishesonePoissonspatialscanmethodfromanother. Thetwomethodsspecicallyusedinthisstudyare: 1.Circularscanmethod [Kulldor,1997].Beginningatthecentroidofeachregionin thestudyarea,expandacircularwindowtoincludenearbyregionsuntilthe populationinsidethelargestwindowreachessomeprespeciedupperbound, commonly50%ofthetotalatriskpopulation.Aregionisincludedinapotential 33
PAGE 40
clusterifitscentroidiswithinthecircularwindow. 2.Ellipticscanmethod [Kulldoretal.,2006].Similartothecircularscanstatistic,but insteadofcircularwindows,aseriesofellipticalwindowscenteredateachregionare used.Theseellipsesaredenedintermsoftheirshape & = a=b ,theratioofthe majorandminoraxesandangle ,theanglebetweenthemajoraxisandhorizontal axis.Dierentcombinationsofshapesandanglesmaybeused,andeachsetof ellipsesisincreasedinsizealongthemajoraxisuntilthepopulationwithinthe windowreachessomeprespeciedupperbound. Theexiblyshapedscanmethod[TangoandTakahashi,2005],whichsearchesall connectedsubsetsofregionswhosetotalpopulationisbelowthespeciedupperbound, canalsobeusedwiththeCUSCAN.However,thenumberofpotentialclustersincreases exponentiallyasthenumberofregionsinthestudyareaincreases,makingita computationallydemandingmethodthatmaynotbeidealwhenmultipledatasetsneedto beexaminedorwhenastudyareaisparticularlylarge.Thecomputationtimenecessary canbereducedinmostcasesbyusingtherestrictedexiblescanstatistic[Tangoand Takahashi,2012],whichonlysearchesconnectedsubsetsconsistingofregionswherethe observedincidencerateishigherthansometolerance;however,thismeansthesetof potentialclustersvariesovertimedependingonobservedcounts,andsotherestricted exiblescanstatisticcannotbeusedwiththeCUSCAN.Forthesereasons,wedonot provideimplementationoftheexiblescanstatisticfortheCUSCANinthispaper. Kulldor[2001]extendedthecircularscanmethodtocreateaspacetimescanmethod usefulforprospectivesurveillanceontimeindexeddata.Thespacetimescanstatisticis computedasin.2,exceptthatthesetofpotentialclusters Z consistsofcylinderswhose basearetheoriginalcircularwindowsandwhoseheightisincreasedincrementallyto includemultiplesequentialtimeperiodsofdatawithinthewindow.Themostlikelycluster thenconsistsofasetofspatialregionsaswellasasetoftimeperiodswhereevidencefor spacetimeclusteringishighest.Thespacetimescanmethodcanalsobeusedtodetect 34
PAGE 41
temporalonlyclusterswhencasecountsareaggregatedacrosstheentirestudyareaat once. 3.2.2TheNonRestartingCUSUM Inadditiontothespatialinformationavailablethroughthespatialscanmethod,we utilizethecumulativesumcontrolchartCUSUMtomonitorchangesinthedataover time.TheCUSUMdetectsshiftsinthemeanofaprocessbyaccumulatingdeviationsfrom theexpectedmeanovertime,allowingdetectionoflarge,suddenchangesaswellas smaller,sustainedchangesinthemean. Forasinglestreamofdata,theCUSUMstatisticattime t isdenedbythefollowing recursiveformula: C 0 =0 ;C t =max f 0 ;C t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 + Y t )]TJ/F19 11.9552 Tf 11.955 0 Td [(k g ; .4 where C t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 istheCUSUMstatisticfromtheprevioustimeperiod, Y t istheobservationat thecurrenttimeperiod,and k isanumberchosenbasedonthedistributionof Y toslow thegrowthoftheCUSUMduringincontroltimeperiodsandcontrolthefalsealarmrate. Whennochangehasoccurredandtheprocessisincontrol, C t tendstoremainnearzero. Whenashiftinthemeanoccursandtheprocessisoutofcontrol, C t tendstoincrease rapidly.Whenmonitoringaprocessovertime,ashiftinthemeanisdetectedwhen C t exceedssomepredeterminedthreshold, h . Inadditiontothresholds,thepresenceofashiftcanbedetectedusingpvaluesto determineif C t issignicantlyhigherthanwewouldexpectiftheprocesswereincontrol. Likethespatialscanstatistic,wecomputethispvalueusingsimulations. N sim incontrol datastreamsaresimulatedfromaPoissondistributionwithincontrolmean 0 ,andthe CUSUMstatistic C i t iscomputedforeachdatastream.Thepvaluefor C t isthenthe ratioofCUSUMstatistics,including C t ,thatareatleastaslargeas C t : p t = 1+ Nsim P i =1 I h C i t C t i 1+ N sim : .5 35
PAGE 42
IntheindustrialprocesscontrolsettingwheretheCUSUMwasdeveloped,an outofcontrolalarmwouldresultintheoendingprocessbeingshutdownandrestartedin anincontrolstate.Forthisreason,theCUSUMstatisticistraditionallyresettozeroonce ashifthasbeendetected.However,whenconsideringuseinapublichealthsetting,itis notpossibletoshutdown"adiseaseprocess,soresettingtheCUSUMisnotdesirable. Additionally,whileanindustrialprocessmayremainoutofcontroluntilitisshutdown andadjusted,diseaseoutbreaksaretransient,resolvingthemselvesovertime.Forthese reasons,GandyandLau[2012]suggeststheuseofanonrestartingCUSUMNRCUSUM fordiseasesurveillanceapplications,whichweutilizeinourproposedmethodology.The nonrestartingCUSUMisidenticalinapplicationtothestandardCUSUM,exceptthatthe chartisnotresettozerowhenanoutbreakisdetected. 3.3ProposedMethodology 3.3.1TheCUSCANMethod FortheCUSCANmethod,weassumethefollowing:thatourdataareregionalcounts, thatthenumberofcasesineachregionisdistributedasaPoissonrandomvariable,and thatwehavemultipletimeperiodsofdataandwouldliketoperformahypothesistestfor eachperiod.Weproposecombiningthepurelyspatialscanstatisticwiththe nonrestartingCUSUMtocreateasurveillancemethodthatincludesbothspatiallyand temporallyaggregatedinformation.WewillrefertothismethodastheCUSCAN,asitis thecumulativesumofscanstatistics.Bycombiningthesetwomethods,wecreatea surveillancesystemthatcanbothquicklydetectanewemergingdiseaseclusterand indicateitsspatiallocation. Toimplementthiscombinedmethodology,weperformthefollowing: 1.Foreachtimeperiod,computeascanstatisticforeachpotentialclusterusing Equation.1. 2.ComputetheCUSUMstatisticinEquation.4foreachpotentialclusterovertime. 36
PAGE 43
3.Ateachtimeperiod,takethemaximumoftheCUSUMstreamsforthatparticular time.Thisisour CUSCANstatistic ,andtheregionsthatproduceditcompriseour mostlikelyclusterofcases. 4.Assessthesignicanceofthestatisticanddeterminewhetherthereisevidenceofan outbreak.Thisisdonebyapplyingsteps13todatasimulatedunderthenull hypothesisandcomputingthepvaluefortheteststatisticfoundin3. 3.3.2Selectionof k WhenmonitoringadatastreamwithaCUSUMcontrolchart,twouserspecied parametersarerequired:arejectionthreshold h andtuningparameter k .Whenthe distributionoftheobserveddataareknowntocomefromamemberoftheexponential family,itissimpletondanoptimalvalueof k and h [HawkinsandOlwell,1998]. However,thedatastreamwearemonitoringthespatialscanstatisticsdonotfollowany specicknowndistribution,sothesemethodscannotbeused. Wecaneliminatetheneedfortherejectionthreshold h byperformingapvaluebased hypothesistest.Withapvalueapproach,thetypeIerrorrateiscontrolledatthedesired signicancelevel, ,withouttheneedtolimittheCUSUMstatisticbelowagivenvalue duringincontrolperiods.Forthisreason,thechoiceof k ismoreimportantfor computationaleciencyandthepowertodetectanoutbreakthancontrollingthefalse alarmrate.Ifno k werespeciedintheCUSUMformulagiveninEquation.4,i.e. k =0,thefalsealarmratewouldbecontrolledbypvaluetesting,butthestatisticwould growwithoutboundovertime,asthedatavalues Y t arenonnegative.Thiscontinued growthwouldeventuallybecometaxingoncomputationalresourcesandmakesitdicult tovisuallyexaminetheevolutionoftheCUSCANstatisticovertime.Thus,itis importanttospecifya k toconstraintheCUSUMstatisticnearzeroandpreventitfrom growingwithoutboundduringincontroltimeperiods.Likewise,thechoiceof k hasa directeectonthepowertodetectanoutbreak;anunnecessarilylargevalueof k prevents 37
PAGE 44
theCUSUMstatisticfromgrowingabovezeroevenduringoutofcontroltimes.Sowhile thechoiceof k islessimportantinthepvaluemodelthanthethresholdmodel,itisstill necessarytoputsomecareintoitsselection.Ifnoanalyticequationfor k exists,wewant tobeabletoselecta k suchthatthefollowingaretrue:ithevalueof k isofappropriate magnituderelativetothedata,iithevalueof k isconstant,andiii k canbeselectedin acomputationallymanageableway.Weproposethat k bedeterminedusingavailable incontrolbaselinedataandincontrolsimulations,whichwillensurethat k isonthe correctscalerelativetotheincontrolstatistics. Weproposeselecting k tocontrolthemaximumnumberoftimeperiodstheCUSCAN statisticisexpectedtoremainpositiveduringincontroltimeperiodswithinsomespecied tolerancelevel ,thatis,thevalueof k thatsatises min k : P C t =0forsome t 2f 1 ;:::; 0 g : .6 ThismethodissimilartothatproposedbyChatterjeeandQiu[2009],whichused bootstrappingmethodstodetermineavalueof k thatcontrolledtheaveragesprintlength, i.e.,theaverageamountoftimetheCUSUMremainspositivebeforereturningtozero. SincethedistributionofourbaselinedataisassumedtobePoisson,weareabletouse standardsimulationmethodsratherthanneedingthedistributionfreemethodology describedbyChatterjeeandQiu[2009].Additionally,bychoosing k tocontrolthe maximumsprintlengthratherthanaveragesprintlength,fewertimeperiodsofdataneed tobesimulatedand k canbechosenwithrelativecomputationalease.Dene 0 tobethe maximumsprintlengthunderthenull,i.e.,thenumberoftimeperiodsfortheCUSCAN statistictoreturntozerowithnooutbreakpresent.Weperformagridsearchwithina rangeofvalueschosenbasedonthesizeofthenullscanstatistics.Morespecically, k can beselectedusingthefollowingalgorithm: 1.Selectthemaximumrunlengthuntilreturntozero, 0 ,andthetolerancelevel, . 2.Foreachtimeperiodwherebaselinedataisavailable,computethespatialscan 38
PAGE 45
statisticforeachpotentialcluster,andtakethemaximumateachtimeperiod. 3.Selecttherangeofpotential k valuestoconsiderbasedonthesizeofthemaximum statistics. 4.Foreachpotentialvalueof k : aSimulate n sim datasetsunderthenullhypothesisforatotalof 0 timeperiods. bForeachsimulateddataset,computethespatialscanstatisticsandtakethe maximumscanstatisticateachtimeperiod. cFortheselectedvalueof k ,computetheCUSUMstatisticsforthemaximum scanstatistics. dCalculatetheproportionofCUSUMstreamsthatreturnedtozeroduringthe simulated 0 timeperiods. 5.Selectthesmallestvalueof k thatbringstheproportionofstreamsthatreturnto zerowithin 0 timeperiodsclosesttotheuserdenedtolerancelevel, . Thisselectionof k ,basedonthedistributionofmaximumstatisticsduringincontrol periods,ensuresthattheCUSCANstatisticwillremainboundednearzeroforall potentialclusterswhilealsoallowingtheCUSCANtogrowabovezerowhenthescan statisticsincreaseinthepresenceofanoutbreak. Whendecidingoninitialvaluesfor k and 0 ,itisimportanttoconsidertheeectthese choiceshaveonthenalvalueof k .Forinitialvaluesof k ,wesuggestusingquantilesfrom themaximumscanstatisticstodeterminearangeofpotentialvalues.Forexample,the medianofthemaximumstatisticsmaybechosenasthelowerboundon k ,sothatevenif therearemanytimeperiodswithlargenullstatistics,wecanreasonablyexpectthe CUSCANstatistictoincreaseinnomorethanabout50%oftimeperiods.Likewise,the upperboundon k neednotbelargerthanthelargestmaximumscanstatistic,asthe probabilityofobservingstatisticsabovethislevelisverysmall.Forselecting 0 ,it's 39
PAGE 46
importanttorecognizetheeectthisvaluehasontheselectedvalueof k .Smallervaluesof 0 willresultinlargervaluesof k ,keepingtheCUSCANstatisticnearzerountilthereisa largeincreaseinthescanstatistics,aswouldbeseeninthecaseoflargerspatialclusters. Likewise,largervaluesof 0 willresultinsmallervaluesof k ,allowingtheCUSCAN statistictoriseabovezerointhepresenceofonlyasmallincreaseinthescanstatistics, potentiallyallowingforthedetectionofsmallerclusters. 3.4Demonstration:SimulatedDatabasedonNewYorkLeukemiaData InordertodemonstratethepotentialoftheCUSCANmethod,wesimulatedata basedontheNewYorkLeukemiadataprovidedbyWallerandGotway[2004].Inthisdata set,countsofleukemiacaseswererecordedinregionsacrossupstateNewYork.The originaldatacontain592casesacross281regions,withatotalatriskpopulationof 1,057,673.Usingthisdatatoprovidearealisticneighborhoodmapandbaselineincidence rate,wegeneratedthreesetsoftimeseriesdatarepresentingthreeclustermodels. Thegloballeukemiaincidenceratewasestimatedfromtheprovideddatatobe approximately r =5.597E04.Thisratewasusedastheglobalincidencerateunderthe nullhypothesisofnooutbreak.Theexpectedcasecountinregion i withpopulation n i is computedas E i = rn i .Whenanoutbreakwaspresentinthesimulateddata,thelocal incidencerateinsidethesimulatedclusterwassettotwicethenullincidencerate.Three clustersofvaryingsizesweresimulated:clusterAcontains11.3%ofthetotalpopulationof thestudyareaover31spatialregions,clusterBcontains4.1%ofthepopulationover10 regions,andclusterCcontains5.9%ofthepopulationover24regionsFigure1.Intotal, 100datasetsweresimulatedforeachclustermodel,eachcontaining20periodsofnulldata followedbya10periodoutbreak. WhenapplyingtheCUSCANmethodtothesesimulatedclusters,weusethecircular scanstatisticasourbasewithapopulationupperboundof50%ofthetotalpopulation. FortheCUSUMportionofthemethod,weusedthe k selectionproceduredescribedin Section3.3.2.Weused999simulatednulldatasets,withastartingrangeofthe50thand 40
PAGE 47
Figure3.1:Thethreesimulatedclustersonthemapof281regionsinupstateNewYork. 41
PAGE 48
90thpercentileofthenullmaximums,using 0 =5timeperiodswithatoleranceof =0 : 95,i.e.,theCUSUMreturnedtozerowithin5timeperiodsin95%ofsimulations. Thisresultedinavalueof k =6 : 061. SignicanceoftheCUSCANstatisticwasdeterminedviasimulation.Conditioningon theobservednumberofcasesinthecurrenttimeperiod,wesimulate999nulldatasets fromamultinomialdistributionwithprobabilitiesbasedontheexpectedcasecountin eachregion.ThepvalueisthencalculatedasinEquation.5.Asignicancethreshold of =0 : 05wasusedforallhypothesistestingdecisions.IftheCUSCANstatisticattime t isdeterminedtobestatisticallysignicant,thenwesaythereisan alarm anddeclarean outbreakispresentat t .Thelocationoftheoutbreakisrecordedasthemostlikelycluster, determinedasdescribedinSection3.3.1. ToanalyzetheperformanceoftheCUSCANmethod,weusethefollowingmetrics: 1.Basicpower: thebasicpowerattime t istheproportionofdatasetsforwhichthe outbreakwasdetectedbytime t . 2.Delay: thedierencebetweenthetimeofdetection t andthetruestartofthe outbreak t s ,i.e., delay = t )]TJ/F19 11.9552 Tf 11.955 0 Td [(t s .Iftheoutbreakwasdetectedinthersttimeperiod itwaspresent, delay =0. 3.Spatialprecision: theprecisionistheproportionofthetrueclustercontainedinthe identiedcluster.Let A t betheidentiedclusterattime t , A bethetruecluster,and n Z thepopulationofthesetofregions Z .Theprecisionattime t isdenedbased onpopulationandisequalto1when A t A : precision t = n A t A n A t : 4.Spatialrecall: therecallistheproportionofidentiedclustercontainedinthetrue cluster.Therecallattime t islikewisedenedbasedonpopulationandisequalto1 42
PAGE 49
when A A t : recall t = n A t A n A : 5.Falsealarmrate: thefalsealarmrateattime t istheproportionofdatasetsthat producedanalarmattime t whennooutbreakwaspresent. Whenaclusterdetectionmethodisperformingwell,weexpectvaluesofpower,precision, andrecallcloseto1,aswellasshortdelayandafalsealarmrateatorbelowthelevelof signicance =0 : 05. Table3.1summarizestheaveragepower,delay,andfalsealarmrateofthesimulation study.Weseethatingeneral,thepowerisveryhighandthedelayverylowacrossall outbreaktimestheaveragedelayisfarlessthanonedaytodetectionfromstartof outbreak.ThepowerforclusterBregions,4.1%populationissomewhatlowerinthe rstdayoftheoutbreak.69,butincreasesrapidlyto0.94bydaytwo.ClustersAand C,containingmoreregionsoralargerproportionoftheatriskpopulationregions, 11.3%populationforAand24regions,5.4%populationforCweremucheasiertodetect, withrstdaypowerof0.98and0.89respectively. Figure3.2showsplotsofaveragespatialprecisionandrecallresultsforeachdayofthe outbreak.Spatialprecisionstartsat0.89forclusterA,0.68forclusterB,and0.82for clusterC,increaserapidlyto0.93,0.81,and0.93bythesecondtimeperiodand0.96,0.94, and0.98bythefth.Recallstartsat0.91forclusterA,0.72forclusterB,and0.79for clusterCandincreasesto0.94,0.84,and0.91,respectively,ontheseconddayofthe outbreakand0.98,0.95,and0.98bythefth. Figure3.3demonstratesthebehavioroftheCUSCANstatisticforthreerandomly selecteddatasets:dataset#19forclusterA,,dataset#70forclusterB,anddataset #57forclusterC.TheCUSCANstatisticbehavesasexpectedforanonrestarting CUSUM,withlargerclustersAandCcausingalargerincreaseinthestatisticduring outbreakperiodsrelativetosmallerclusterB. 43
PAGE 50
Table3.1:Averagepower,detectiondelay,andfalsealarmratesfortheCUSCANmethod. PoweronDay Cluster2122232430DelayDaysFalseAlarmRate A0.981.001.001.000.020.054 B0.690.940.981.000.390.052 C0.891.001.001.000.110.045 Figure3.2:Averageresultsforaprecisionandbrecallforeachclusterduringthe outbreakperiod. 44
PAGE 51
Figure3.4demonstratesthedierencebetweenthetrueclusterandthemostlikely clusteridentiedat t =30forthesamerandomlyselecteddataset#19forclusterA. Themostlikelyclusteridentiedat t =30forclustersBandCwereidenticaltothetrue simulatedcluster. Insummary,theCUSCANstatisticwithcircularbasehashighpowertoquickly detectanemergingoutbreak,withhigherinitialpowerwhentheoutbreakcoversmore regionsorahigherproportionoftheatriskpopulation,andarapidincreaseinpoweras theoutbreakpersistsformultipletimeperiods.TheCUSCANisalsoabletoaccurately detecttheregionsinspacewheretheoutbreakoccurs,withbothspatialprecisionand spatialrecallstartinghighandincreasingrapidlyasmoreoutbreakdatabecomesavailable. 3.5DemonstrationandPowerAssessment:NewYorkCityBenchmarkData 3.5.1DataDescription Kulldoretal.[2004]provideasetofbenchmarkdatafortestingtheeectivenessof spacetimediseasesurveillancemethods.Theythenconductedapoweranalysisonthe benchmarkdatausingthecircularspacetimescanstatisticimplementedinSaTScan.By usingthesamesetofdata,weareabletodirectlycomparethepoweroftheCUSCANto thespacetimescanstatisticwhendetectingemergingclustersofthisform. ThebenchmarkdatausesamapofNewYorkCitycontaining176zipcodesandatotal atriskpopulationof8,003,510.Thereare17dierentlocalizeddiseaseoutbreakssimulated withinthisdataset,alongwithonecitywideoutbreak,foratotalof18dierentoutbreak scenarios.ThelocationsandsizesofthelocalizedclustersareshowninFigure3.5.Fiveof theclustersconsistofasinglezipcode,veconsistofsmall10zipcodesclusters,ve consistofoutbreaksinentireboroughsofthecity,andtwoconsistofirregularlyshaped elongatedclustersontheedgeofthecity. Foreachoutbreakmodel,thedataaresimulatedwith30periodsofnooutbreak, followedbyone,two,orthreedayswhereanoutbreakispresent,foratotalof31,32,or33 days.Thedataaresimulatedbyrandomlydistributinganexpected100casesperdayso 45
PAGE 52
Figure3.3:CUSCANstatisticwithalarmsforthreerandomlyselecteddatasets.a ClusterA.3%population,dataset#19.bClusterB.1%population,dataset #70.cClusterC.9%population,dataset#57. 46
PAGE 53
Figure3.4:Identiedclusterattime t =30forclusterA,dataset#19.Theprecisionis 0.959andtherecallis0.992. 47
PAGE 54
3,100totalcasesforthe31daydataset,forexampleacrossallregionsandalldays,with dierentdaysgivenequalprobabilityofobservingacase,andregionsgivenprobabilities proportionaltotheirtotalpopulationi.e.,eachpersoninthecityisconsideredequally likelytobecomeacase.Ondayswhenanoutbreakispresent,theregionswithinthe assignedoutbreakhaveanincreasedrelativeriskofbecomingacase.Twodierent outbreakscenariosweresimulatedforeachcluster,onewithamediumincreasedrelative riskandonewithahighincreasedrelativerisk. Intotal,thedatacontain18dierentclusters,threedierentoutbreaklengths,and twodierentincreasedrisks,foratotalof108dierentoutbreakscenarios.Foreachof thesescenarios,1,000outbreakdatasetsweresimulated.Inadditiontothesecluster models,Kulldoretal.alsoprovidethreesetsofnulldatatouseforhypothesistesting, with31,32,or33daysofdatawithnooutbreak,eachwith9,999simulateddatasets.The nulldataweregeneratedasdescribedabove,butwithnoexcessriskassignedtoanyofthe cityregionsonanyofthedays. 3.5.2AnalysisDesign Forourpoweranalysis,wechosetofocusonthe31daydatasetsforourcomparison betweentheCUSCANandthespacetimescanmethod.Weconsiderthesedatasetsthe mostinformative,astheyrepresentourabilitytorapidlydetectanewoutbreak.We mirrorthepoweranalysisperformedwiththespacetimescanstatisticsothatourresults aredirectlycomparable.Wedescribetheanalysisbrieybelow.Additionaldetailsmaybe foundintheoriginalwork[Kulldoretal.,2004]. Poweranalysisforthespacetimescanstatisticwasperformedonthe17localized clusterdatasets.Acircularbase,amaximumtemporalwindowsizeof3days,anda populationupperboundof50%wereusedtodeterminethesetofpotentialclusters.A signicancelevelof =0 : 05wasusedforallhypothesistestingdecisions.Toaccountfor thepossibilityofcitywideoutbreaks,temporalonlyclusterswereincludedintheanalysis. Thesetemporalonlyclusterswerealsocomputedusingthespacetimescanstatistic,with 48
PAGE 55
Figure3.5:SimulatedoutbreakclustersinNewYorkCity.aSinglezipcodessolidand smallclusterssolid+shaded.bLargeclusters,wholeborough.cIrregularlyshaped clusters. 49
PAGE 56
100%ofthepopulationincludedinthewindowateachtimestep.Signicancewasassessed usingcriticalvalues,determinedbyapplyingthespacetimescanstatisticto9,999nulldata setsandidentifyingthe500thlargestmaximumstatisticforeachtimeperiod.Thepower wascalculatedastheproportionofmaximumstatisticsfromthe1,000outbreakdatasets thatwerehigherthanthiscriticalvalue.Noadjustmentsweremadeformultipletesting. Kulldoretal.alsoprovidepowerfromveselectedoutbreakmodelswiththe followingchangestotheaboveparameters:withtemporalonlyclustersexcluded,witha populationupperboundof5%ratherthan50%,settingthemaximumtemporalwindow sizeto1or7daysratherthan3days,andadjustingformultipletestingsothatonlyone falsealertwouldbeexpectedperyear.Powerresultsfromthetemporalonlyanalysisare alsoincludedfortheseselectdatasets.Inadditiontothreelocalclusteroutbreakmodels onesinglezipcode,onesmallcluster,andonewholeboroughcluster,thesemodelsalso includeacitywideoutbreakwholecityandnullmodelnooutbreak.However,asthe datasetusedtocomputepowerinthenooutbreaksettingforthespacetimescanstatistic isnotspecied,norisanyspecicnonoutbreaktestdatasetprovidedinthebenchmark data,weexcludethissettingfromourcomparisons. TomakeouranalysisascomparableaspossibletotheanalysisperformedbyKulldor etal.,welikewiseusedcircularwindowswithapopulationupperboundof50%to determinepotentialclusters.Asignicancelevelof =0 : 05wasusedforallhypothesis testingdecisions.Whenassessingsignicance,wechosetoapplytheCUSCANexactlyas designed,usingpvaluesratherthancriticalvalues.Usingthenulldataprovided,weuse theproceduredescribedinSection3.3.2with 0 =5daysand =0 : 95toselectavalueof k foruseintheCUSUMformulagiveninEquation.4.Theseparametersledusto choose k =6 : 115fortheCUSCAN. Unlikethespacetimescanstatistic,theCUSCANisnotanaturalchoiceforuseon temporalonlyclusters,asthespatialscanstatisticcannotbecalculatedwhenallregions areincludedinapotentialcluster.Instead,wesimplyusethenonrestartingCUSUMas 50
PAGE 57
describedinSection3.2.2tosearchfortemporalclustersbymonitoringtotalcasecounts forthecity.WeassumethatdailycountsaredistributedasaPoissonrandomvariable.As thedistributionofthedataisassumedtobeknowninthiscase,wecancompute k inthe CUSUMequation.4analyticallyasfollows[HawkinsandOlwell,1998]: k = 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( 0 ln 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(ln 0 ; where 0 istheincontrolmeanestimatedfromnulldata,and 1 isthesmallestshifted meanwewanttodetect.Weestimatetheincontrolmean 0 usingtherst15daysofdata fromeachclustermodelasabaseline.Weconsidervaluesof 1 correspondingtoa10%, 20%,or30%increaseinthemeannumberofcases. WealsoprovideadditionalpowerresultsfortheCUSCANforalllocalizedclusters withthefollowingmodications:withtemporalonlyanalysisexcluded,and2witha populationupperboundof5%insteadof50%. Fortheoutbreakmodelsincludingtheirregularlyshapedclusters,wecalculatethe poweroftheCUSCANusingtheellipticscanstatistic.Forthesemodels,inadditionto thecircularregionsfromthecircularscanstatistic,weconsiderellipticwindowswithshape S =1 : 5 ; 2 ; 3 ; 4 ; 5,with4,6,9,12,and15dierentequallyspacedrotationalangles respectively. Finally,wealsoprovidespatialprecisionandspatialrecallcalculationsforthe CUSCAN.AsSaTScandoesnotincludethesemetricsaspartofthepoweranalysisitcan perform,wearenotabletocomparetheseresultstothespacetimescanstatistic. 3.5.3Results AcomparisonofpowerbetweentheCUSCANandthespacetimescanstatistic STSSfortheprimaryanalysiscircularwindows,50%populationupperboundcanbe foundinTable3.2.AcomparisonbetweentheCUSCANandspacetimescanstatisticfor thefourselectmodelsdescribedaboveandmatchingalternateparameterspecicationscan 51
PAGE 58
befoundinTable3.3.FullpowerresultsfortheCUSCANwithtemporalonlyanalysis excludedandwithpopulationupperbounddecreasedto5%canbefoundinTable3.4,and resultswiththepopulationupperboundsetto5%andtemporalonlyanalysisincludedcan befoundinTable3.5.PowerresultsfortheCUSCANusingtheellipticscanstatisticare inTable3.6.Finally,spatialprecisionandrecallresultsfortheCUSCANareinTable3.7. BeginningwithTable3.2,weseethattheCUSCANperformsaboutaswellasthe spacetimescanstatisticonsmallerclusters,butexcelsatidentifyinglargerclusters,with rstdaydetectionpowerforhighexcessriskrangingfrom0.862to0.996forclustersof10 ormoreregions,includingtheirregularlyshapedHudsonRiverclusterFig1c.Thisis especiallynoticeablewhentheincreasedriskislower;theCUSCANdemonstrateshigher rstdaypowertodetectlowerintensityoutbreaksin12outof17outbreakmodels, includingthosewithsmallersimulatedclusters. Table3.3showsthecomparisonbetweentheCUSCANandthespacetimescan statisticSTSSforthefourselecteddatasetsforwhichthesamedatacouldbeused.We seethattheCUSCANsuersalargerdropinpowerthanthespacetimescanstatistic whentemporalonlyanalysisisexcluded,withtheexceptionofthewholecityoutbreak, wherethepowerwasunaected.Wealsoseethat,whilethechoiceofpopulationupper boundhasaminimaleectonthepowerofthespacetimescanstatistictodetectsmall clustersfrom0.85at50%to0.86at5%forthevezipcodecluster,nochangeinthe singlezipcodecluster,thepoweroftheCUSCANincreasessubstantially,from0.796at 50%to0.924at5%forthesinglezipclusterandfrom0.830to0.878inthevezipcode cluster.TheCUSCANalsohashigherpowerretentioncomparedtothespacetimescan statisticonthelargerclusters,withapowerof0.868at5%inthelargeManhattancluster comparedto0.77forthespacetimescanstatistic,and0.969forthewholecityoutbreak comparedto0.4forthespacetimescanstatistic. Whilespacetimescanstatisticpowerresultsforlowerpopulationboundsorwith temporalonlyanalysisexcludedarenotavailablefortheremainingsimulatedclusters, 52
PAGE 59
Table3.2:PowertodetectnewoutbreakonrstdayfortheCUSCANandthespacetime scanstatisticSTSS.Temporalonlyanalysisisincludedwhentestingforanincreasein expectedcasecountsof10%,20%,or30%.Thehighestvalueineachrowisinboldtext. HighExcessRisk OutbreakareaNo.ofregionsInc=10%Inc=20%.Inc=30%STSS A.Williamsburg,Brooklyn10.8060.8050.804 0.860 B.RooseveltIsland,Manhattan10.9190.9150.915 0.920 C.BullsHead,StatenIsland10.8110.8080.812 0.830 D.LaGuardia,Queens10.855 0.859 0.8560.850 E.WestFarms,Bronx10.838 0.842 0.8410.830 Awith4neighbors50.8380.8410.837 0.850 Bwith5neighbors60.8720.894 0.902 0.820 Cwith4neighbors50.8120.8160.818 0.830 Dwith9neighbors100.9140.923 0.929 0.880 Ewith4neighbors50.8350.8440.844 0.860 Rockaways50.8240.8250.830 0.840 HudsonRiver200.8490.884 0.903 0.660 Bronx250.9600.962 0.964 0.940 Brooklyn370.995 0.9960.996 0.980 Manhattan400.9750.975 0.976 0.920 Queens62 0.996 0.9950.9950.980 StatenIsland120.8620.867 0.8700.870 MediumExcessRisk OutbreakareaNo.ofregionsInc=10%Inc=20%Inc=30%STSS A.Williamsburg,Brooklyn10.3340.3240.328 0.350 B.RooseveltIsland,Manhattan1 0.399 0.3870.3880.370 C.BullsHead,StatenIsland10.2910.2830.281 0.340 D.LaGuardia,Queens1 0.366 0.3640.3590.320 E.WestFarms,Bronx10.323 0.3250.325 0.290 Awith4neighbors5 0.435 0.4260.4330.420 Bwith5neighbors60.4370.471 0.482 0.400 Cwith4neighbors50.3380.351 0.357 0.330 Dwith9neighbors100.5410.570 0.576 0.420 Ewith4neighbors50.3970.4060.412 0.430 Rockaways50.3300.3340.323 0.340 HudsonRiver200.4320.455 0.472 0.330 Bronx250.7040.7120.716 0.940 Brooklyn370.9000.899 0.901 0.790 Manhattan400.6850.690 0.700 0.570 Queens620.8710.868 0.879 0.730 StatenIsland120.432 0.441 0.4370.430 53
PAGE 60
Table3.3:Powerresultsonday31foralternateparameterspecicationsonselectdatasets forCUSCANandspacetimescanstatisticSTSS.Allmodelsarehighexcessrisk. CUSCANresultswithpurelytemporalclustersincludedarecalculatedwith 1 =1 : 3 0 . OutbreakareaMaximumSizePurelyTemporalCUSCANSTSS A.Williamsburg,Brooklyn50%Yes0.8040.860 50%No0.7960.860 5%No0.9240.860 N/AYes,only0.1850.190 Awith4neighbors50%Yes0.8370.850 50%No0.8300.850 5%No0.8780.860 N/AYes,only0.3300.290 Manhattan50%Yes0.9760.920 50%No0.9730.920 5%No0.8680.770 N/AYes,only0.8070.750 Wholecity50%Yes1.0000.860 50%No1.0000.840 5%No0.9690.400 N/AYes,only0.8070.750 54
PAGE 61
Table3.4:Powerresultswithtemporalonlyanalysisexcludedandpopulationupperbound reducedfrom50%to5%. HighExcessRisk OutbreakareaRegionsPopulation50%5%Change A.Williamsburg,Brooklyn11.1%0.7960.9240 : 128 B.RooseveltIsland,Manhattan10.1%0.9130.912 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 001 C.BullsHead,StatenIsland11.1%0.7990.9070.108 D.LaGuardia,Queens10.5%0.8500.9290.079 E.WestFarms,Bronx10.7%0.8300.9000.070 Awith4neighbors54.0%0.8300.8780.048 Bwith5neighbors63.1%0.8340.8940.060 Cwith4neighbors53.3%0.7990.9100.111 Dwith9neighbors108.2%0.8920.828 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 064 Ewith4neighbors53.7%0.8270.9110.084 Rockaways51.3%0.8010.8990.098 HudsonRiver2010.3%0.7610.707 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 054 Bronx2516.6%0.960.832 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 128 Brooklyn3730.8%0.9950.836 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 159 Manhattan4019.0%0.9730.868 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 127 Queens6228.0%0.9950.897 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 098 StatenIsland125.5%0.8420.9180.076 MediumExcessRisk OutbreakareaRegionsPopulation50%5%Change A.Williamsburg,Brooklyn11.1%0.3050.4410.136 B.RooseveltIsland,Manhattan10.1%0.3750.3640.011 C.BullsHead,StatenIsland11.1%0.2590.4600.201 D.LaGuardia,Queens10.5%0.3350.4620.127 E.WestFarms,Bronx10.7%0.2980.4400.142 Awith4neighbors54.0%0.4270.4810.054 Bwith5neighbors63.1%0.3970.4520.055 Cwith4neighbors53.3%0.3080.4780.170 Dwith9neighbors108.2%0.5190.417 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 102 Ewith4neighbors53.7%0.3820.5060.124 Rockaways51.3%0.3000.4630.163 HudsonRiver2010.3%0.3870.349 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 038 Bronx2516.6%0.7010.474 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 227 Brooklyn3730.8%0.8930.494 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 399 Manhattan4019.0%0.6670.472 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 195 Queens6228.0%0.8600.542 )]TJ/F37 10.9091 Tf 8.485 0 Td [(0 : 318 StatenIsland125.5%0.4000.5590.159 55
PAGE 62
Table3.5:Powerresultsforpopulationupperboundsetto5%.Temporalonlyanalysisis includedwhentestingforanincreaseinexpectedcasecountsof10%,20%,or30%. HighExcessRisk OutbreakareaNo.ofregionsInc=10%Inc=20%Inc=30% A.Williamsburg,Brooklyn10.9300.9300.930 B.RooseveltIsland,Manhattan10.9200.9160.917 C.BullsHead,StatenIsland10.9130.910.912 D.LaGuardia,Queens10.9320.9350.932 E.WestFarms,Bronx10.9080.9110.910 Awith4neighbors50.8870.8920.889 Bwith5neighbors60.9190.9320.938 Cwith4neighbors50.9200.9200.920 Dwith9neighbors100.8590.8790.890 Ewith4neighbors50.9180.9210.920 Rockaways50.9080.9080.912 HudsonRiver200.8040.8470.872 Bronx250.8690.9080.918 Brooklyn370.9270.9710.976 Manhattan400.9070.9380.947 Queens620.9360.9730.983 StatenIsland120.9320.9330.936 MediumExcessRisk OutbreakareaNo.ofregionsInc=10%Inc=20%Inc=30% A.Williamsburg,Brooklyn10.4690.4660.479 B.RooseveltIsland,Manhattan10.4010.4000.402 C.BullsHead,StatenIsland10.4970.4910.489 D.LaGuardia,Queens10.4950.4990.493 E.WestFarms,Bronx10.4720.4790.478 Awith4neighbors50.5160.5230.529 Bwith5neighbors60.5030.5250.537 Cwith4neighbors50.5190.5240.527 Dwith9neighbors100.4690.4940.504 Ewith4neighbors50.5400.5490.551 Rockaways50.4910.4990.500 HudsonRiver200.4300.4830.509 Bronx250.5680.6340.652 Brooklyn370.6850.7650.807 Manhattan400.5590.6190.656 Queens620.6550.7410.787 StatenIsland120.5970.6090.609 56
PAGE 63
Table3.6:Powerresultsforellipticscanstatistic.Temporalonlyanalysisisincludedwhen testingforanincreaseinexpectedcasecountsof10%,20%,or30%. OutbreakareaExcessRiskMaximumSizeInc=10%Inc=20%Inc=30% RockawaysHigh50%0.7180.7320.728 High5%0.8910.9000.895 Medium50%0.2880.3360.357 Medium5%0.4980.5340.564 HudsonRiverHigh50%0.8870.9090.922 High5%0.8750.9050.920 Medium50%0.4600.4800.496 Medium5%0.4830.5180.540 57
PAGE 64
Table3.7:Averageprecision,recall,andestimatedclustersizeonday31,50%population upperbound. HighExcessRisk OutbreakareaSizePopulationEstimatedSizePrecisionRecall A.Williamsburg,Brooklyn1.1%7.4%0.7940.999 B.RooseveltIsland,Manhattan0.1%1.6%0.9660.999 C.BullsHead,StatenIsland1.1%3.2%0.8680.986 D.LaGuardia,Queens0.5%3.7%0.9040.999 E.WestFarms,Bronx0.7%3.3%0.9120.992 Awith4neighbors4.0%17.9%0.5580.924 Bwith5neighbors3.1%14.1%0.6120.915 Cwith4neighbors3.3%7.2%0.7590.905 Dwith9neighbors8.2%25.7%0.5190.938 Ewith4neighbors3.7%11.7%0.7300.927 Rockaways1.3%4.1%0.8800.887 HudsonRiver10.3%37.2%0.2850.796 Bronx16.6%31.4%0.6130.925 Brooklyn30.8%45.1%0.6760.969 Manhattan19.0%39.3%0.4980.902 Queens28.0%45.2%0.5970.939 StatenIsland5.5%9.7%0.8370.855 MediumExcessRisk OutbreakareaSizePopulationEstimatedSizePrecisionRecall A.Williamsburg,Brooklyn1.1%17.7%0.5010.980 B.RooseveltIsland,Manhattan0.1%7.7%0.8120.987 C.BullsHead,StatenIsland1.1%9.9%0.6440.869 D.LaGuardia,Queens0.5%11.2%0.7000.970 E.WestFarms,Bronx0.7%13.0%0.6500.956 Awith4neighbors4.0%25.2%0.4060.898 Bwith5neighbors3.2%24.8%0.3730.894 Cwith4neighbors3.3%11.6%0.6190.783 Dwith9neighbors8.2%29.3%0.4180.898 Ewith4neighbors3.7%18.2%0.5230.874 Rockaways1.3%11.4%0.6540.785 HudsonRiver10.3%37.3%0.2780.735 Bronx16.6%31.9%0.5840.894 Brooklyn30.8%42.8%0.6840.915 Manhattan19.0%37.3%0.5040.836 Queens28.0%43.1%0.6060.893 StatenIsland5.5%13.8%0.7220.786 58
PAGE 65
Table3.4and3.5showthattheresultsinTable3.3appeartobetypicalfortheCUSCAN. Table3.4showsthattheCUSCAN,whenperformedwithoutsupplementarytemporalonly analysis,performsonlymarginallyworseinrstdaypowertodetect,withanaverage decreaseinpowerof0.0227inthehighriskmodelcomparedtothe 1 =1 : 3 0 modeland 0.0327inthemediumriskmodelwhenthepopulationupperboundis50%.Tables3.4and 3.5alsoshowsthesamepatternasdescribedpreviouslywhenthepopulationupperbound ischanged,wherepowertodetectaclusterincreasesforsmallclustersbutdecreasesfor largerones,withamoredramaticchangewhentheexcessriskislower.However,the powerisstillquitehighacrossallclusters,withtheexceptionofthewholeboroughclusters inthemediumexcessriskcasewithnotemporalonlyanalysisincluded. InTable3.6weseethatreplacingthecircularbasedscanstatisticwithanellipticscan statisticimprovesthepowertodetecttheirregularlyshapedHudsonRiverandRockaways clusters.Forthehighexcessriskmodel,themaximumpowertodetecttheHudsonriver clusterincreasesfrom0.903withthecircularscanstatisticto0.922withapopulation upperboundof50%,andforthemediumexcessriskmodel,thepowerincreasesfrom0.472 to0.496.WhilethepowertodetecttheRockawaysclusterdoesnotimproveinthehighrisk model,thepowerdoesimproveinthemediumriskmodel,increasingfrom0.323to0.357 with50%populationandfrom0.500to0.564with5%population.Weseethesamepattern hereasinprevioustrials:themethodperformsbetterwhendetectinglargerclusters regionsintheHudsonRiverclustercomparedto5intheRockawayscluster,butpowerto detectsmallclustersimprovessubstantiallywhenthepopulationupperboundislowered. Finally,inTable3 : 7,weseethattheCUSCANhashighspatialrecallontherstday oftheoutbreak,indicatingthatmostoftheatriskpopulationiscorrectlybeingidentied aspartofthecluster.However,thespatialprecisionislow,especiallyforirregularly shapedclustersliketheHudsonRiver.Inessence,theareaidentiedbytheCUSCANis largerthantheactualoutbreakareaonaverage.Thiscanbeprimarilyattributedtothe useofthecircularscanstatisticasthebaseoftheCUSCAN:ifthetrueclusterisnot 59
PAGE 66
circular,thenthecircularwindowthatcontainsalltheoutbreakregionswillnecessarily includeotherregionsaswellthatareaddedontollinthecircle.Thiseectismost extremeontheelongatedHudsonRivercluster,wherealargenumberofexcessregions wereaddedinwhenattemptingtotalloftheoutbreakregionsinsideacircularwindow. 3.6Discussion Fromtheresultsofoursimulationstudy,wedemonstratethattheproposedCUSCAN methodologyshowspromiseasatoolfortherapiddetectionofemergingdiseaseclusters, withhighpowertodetectanewclusterinthersttimeperiodofanoutbreakandahigh proportionofthetrueclusterpopulationcorrectlyidentied.Whilenomethodwillbe universallymostpowerfulinallcases,theCUSCANperformswellinidentifyinglarger spatialclusters,andisstronginidentifyingclustersofanysizewhentheintensityofthe outbreakislower.Thismakesitapowerfultoolfordetectingclusterswithlowerexcess risk,whichcanoftenbemissed. WhiletheCUSCANmethodhasmoredicultyinidentifyingsmallerclusters,suchas theonezipcodeclustersinthebenchmarkdata,thepowercanbeimprovedbylowering thepopulationupperbound,withthetradeoofmakingitmorediculttodetectlarger clustersinstead.Inpractice,assizeofapotentialfuturediseaseclusterwillbeunknown,it isuptothepractitionertodecidewhethertheywanttoprioritizethesearchforonesizeof clusteroveranother.Inthecasewherethedetectionofsmallerclustersisprioritized,a moremoderatevalueforthepopulationupperboundsuchas20%or30%willincreasethe abilitytodetectsmallerclusterswhilesacricinglesspowertodetectlargerones. WenoteinSection3.5.3that,whilethespatialrecalloftheCUSCANishigh,the precisiontendstobelow.Webelievethistoprimarilybeaweaknessofthecircularscan statisticusedasthebaseoftheCUSCAN,asmanynonoutbreakregionsareoften includedwithinacircularwindowcontaininganoncircularcluster.Indeed,whenthe CUSCANwasappliedtoclustersthatweremorecircularinshapeinSection3.4,the precisionwasashighastherecall,meaningfewexcessregionsneededtobeadded.In 60
PAGE 67
essence,lowspatialprecisioninthismethodisprimarilyanartifactofusingascanmethod notideallysuitedtothetrueshapeoftheclusters,andwhenpairedwithhighrecall,we seethattheCUSCANhasahighrateofidentifyingthecorrectspatiallocationofa clusteraswell. Kulldoretal.[2004]recommendthattemporalonlyanalysisshouldbeincludedin surveillancecontextswherethespacetimescanstatisticisused.Welikewiserecommend thattemporalonlyanalysisbeconsideredforusealongsidetheCUSCAN.Wedo acknowledge,however,thattheformoftemporalonlyanalysisweuseinourdemonstration doesincreasethenumberofparametersneedingtobespeciedbytheuser.Whilethe CUSCANemphasizesdatadrivencalculationsandrequiresfewparameterstobespecied bytheuserspecically, 0 , ,andthepopulationupperbound,whentemporalonly analysisisincludedusingthenonrestartingCUSUM,itisalsorequiredthattheuser specifyalevelof 1 todetect.However,shouldthepractitionerndthisundesirable,the methodusedtodetermine k inSection3.3.2maybeusedwithobservedcountsratherthan scanstatistics,andthesamelevelsof 0 and maybeusedforsimplicity.Additionally,as demonstratedinTable3.3,theCUSCANmethodisalsoabletodetectacitywide outbreakonitsownshouldonehappentooccur,thoughmoreanalysismaybenecessary torecognizetheoutbreakascitywide,astheCUSCANwouldonlybeabletopointtoa clusterofspeciedsizewithinthestudyarea. WhilewedemonstratetheCUSCANmethodusingthecircularandellipticscan statistics,othertypesofscanstatisticsmaybeusedwiththismethod,includingthe exiblescanstatistic.Indeed,anyscanstatisticthatsearchesoveraxedsetofpotential clustersmaybeused.ThisexibilityallowstheCUSCANtoremainrelativelyecientif computationalresourcesarescarcecircularscanstatistic,andallowsforthedetectionof irregularlyshapedclusterswhenresourcesareavailableexible,ellipticscanstatistic. Methodssuchastheexiblescanstatisticarebynecessitycomputationallyintensive, however,andwhileuseful,maynotalwaysbearealisticchoiceforthepractitioner.While 61
PAGE 68
themoreintensivescanstatisticsdoincreasedetectionpowerofirregularclusters,suchas theHudsonRiverandRockawaysclustersinthebenchmarkdata,themethodstill performsreasonablywellindetectingtheseclustersusingthemorecomputationally ecientcircularscanstatistic.Regardlessofthecomputationalresourcesavailable,the methodshouldbeaboonforpractitionerswhenusedinaprospectivesurveillancesetting. Oneadditionalaspecttoconsiderwhenutilizingthismethodistheuseofthe nonrestartingCUSUM.WhilethenonrestartingCUSUMimprovespowerduringa sustainedoutbreak[GandyandLau,2012],thiscomesattheexpenseofadditionalfalse alarmsfollowingtheendofanoutbreak.Asthebenchmarkdatasetdoesnotextend beyondthebeginningofanoutbreak,thisfalsealarmproblemisnotdemonstratedbyour simulationstudy,butcanbeseeninotherworksutilizingthenonrestartingCUSUMsuch asinDassanayakeandFrench[2016].Whilenotcurrentlyimplementedforthismethod, HallandFrench[2019]proposedacorrectionforthenonrestartingCUSUMthatcontrols postoutbreakfalsealarmsthatmaybeusedtoaddressthisissue. 62
PAGE 69
CHAPTERIV ANEXTENSIONOFTHECUSCANFORDYNAMICSCANNING METHODS:THERESTRICTEDFLEXIBLESCANSTATISTIC 4.1Introduction Rapiddetectionofemergingdiseaseoutbreaksisanessentialaspectofpublichealth, andoftentheprimarygoalofprospectivesurveillancemethods.Themorequicklyanew outbreakisdetected,themoreeectiveanyavailableinterventionwillbe.Forthisreason, surveillancemethodsthatboasthighpowerandspeedofdetectionareoftenfavored. However,powerofdetectionisnottheonlydesirablequalityinasurveillancemethod. Whendataincludeaspatialelement,suchascountsofdiseaseindexedbycountyorcensus tracts,itisalsoessentialthatthelocationofanoutbreakbeidentiedcorrectly.Amethod thatquicklydetectsanewoutbreak,butcannotaccuratelydetermineitslocation,isof littleusewhenfurtheractionisrequiredattheoutbreaksite. Spatialscanmethodsarepopularforlocatingclustersofdiseaseinspatialdata.These methodssearchoveravarietyofpotentialclusterlocationscalledwindows,comparingthe diseaseincidenceraterelativetoexpectedlevelsinsidethewindowtothatoutsidethe windowtodetermineifanoutbreakislikelytoexistwithinthatgivenwindow.Oneofthe mostcommonwaysthewindowsaredeterminedisusingaseriesofconcentriccircular areas,ascircularwindowscanbecomputedusingonlyameasureofdistanceandrequire relativelylittlecomputationalcomplexity.Kulldor[1997]originallyproposedthecircular scanmethodasamethodforexaminingcrosssectionaldata,anditwaslaterextendedfor usewithtimeindexeddataasaprospectivetoolbyKulldor[2001].Whilethecircular andspacetimescanmethodshavehighpowertodetectwhenanoutbreakispresent,their precisionindeterminingwhichregionsareincludedinthediseaseclusterislowwhen clustersarenotapproximatelycircularinshape,asacircularwindowthatfullycontainsa noncircularclusterwillbynecessityalsoincludemanynonclusterregions.Forthisreason, itcanbediculttodeterminewhichregionswithintheidentiedoutbreakareaare actuallyinneedofhealthintervention. 63
PAGE 70
Toreducetheproblemscausedbyxedshapedscanmethods,manyscanmethods havebeendevelopedwiththegoalofdetectingirregularlyshapeddiseaseclusters.One suchexampleistheexiblyshapedscanstatistic[TangoandTakahashi,2005],which searchesoverallconnectedsubsetsinalocalneighborhood,allowingforthedetectionof clustersofarbitraryshape.However,asthenumberofregionsinastudyareathesizeof thelocalneighborhoodsincrease,thenumberofpotentialclustersunderconsideration increasesexponentiallyandquicklybecomecomputationallyinfeasible.Thismakes detectinglargerclustersorclusterswithinlargeregionsextremelydicult.Inorderto mitigatethis,TangoandTakahashi[2012]proposedamodiedversionofthe exiblyshapedscanmethodthatreducesthenumberofpotentialclustersunder considerationbyrstlteringoutregionsthatarenotlikelytobehotspots.This eectivelyreducesthenumberofregionsunderconsideration,speedingupcomputation considerably,whilestillallowingforidentiedclusterstotakeonavarietyofshapes. InChapterIII,weproposedanovelprospectivesurveillancemethodbasedonspatial scanningmethodsthatwecalledtheCUSCANstatistic.Thismethod,denedby computingthecumulativesumCUSUMofspatialscanstatistics,wasshowntohavehigh powertodetectemergingdiseaseclusters,makingitpromisingforuseasanactive surveillancetool.However,astheCUSCANprimarilyusesthecircularscanmethodto computethescanstatisticsatitscore,itsuersfromthesamelowprecisionseeninother circularbasedscanmethods.Thisweaknesscanpotentiallybeeliminatedbyreplacingthe circularscanmethodwithonedesignedtodetectclustersofarbitraryshape.However,a currentrequirementoftheCUSCANmethodisthatthesetofpotentialclustersunder considerationmustbexedaheadoftime,sothattheCUSUMstatisticmaybecomputed. Thismeansthatmethodssuchastherestrictedexiblescanstatistic,whosepotential clustersaredeterminedbytheobserveddataandvaryovertime,couldnotbeusedwith theCUSCAN.Here,weproposeamodicationtotheCUSCANtoallowtheuseof spatialscanningmethodswithtimevariablewindowsbylinkingtogetheroverlapping 64
PAGE 71
clustersbetweentimeperiods.Wedemonstratethisextensionbyincorporatingthe restrictedexiblescanmethodintotheCUSCAN. Thestructureofthischapterisasfollows:inSection4.2,wereviewthepropertiesof theCUSCANandrestrictedexiblescanmethods.InSection4.3,wedescribeindetail theprocessofconnectingoverlappingclustersthroughtimethatallowsforthecomputation oftheCUSCANstatistic.InSection4.4,wedemonstrateourmodicationsandcompare thepropertiesofthecircular,elliptic,andrestrictedexibleCUSCANmethodsusing simulatedbenchmarkdata.Finally,inSection4.5,wesummarizeourconclusions. 4.2ReviewofMethods 4.2.1TheCUSCANMethod InChapterIII,weintroducedtheCUSCANmethod,whichincorporatesspatialdata intoanonrestartingCUSUMframeworkbycomputingthecumulativesumofspatialscan statistics. Supposethatwehaveastudyareaconsistingof N disjointspatialregions.Let n 1 ;:::;n N denotetheatriskpopulationwithineachregionand n + = P N i =1 n i thetotal population,bothofwhichweassumetobexedandunchangingovertime.Let Y ;t ;:::;Y N;t denotethecasecountsintheassociatedregionsatagiventime t . Foragivenstudyareawithxedpopulation,weexpecttheglobaldiseaseriskto remainconstantovertime,andtobeequaltothelocalriskwithineachregionwhenno outbreakispresent.Supposethatwehave M timeperiodsofbaselinedatawithno outbreaks.Let y t + = P N i =1 bethetotalcasecountfortime t .Wecancomputetheaverage globalrateofdiseaseas r = M P t =1 y t + Mn + : .1 Wecanthencomputetheexpectedcasecountateachregion,alsoassumedtobeconstant, as E i = rn i . 65
PAGE 72
Foragivenspatialscanmethod,weconsiderasetofpossibleclusters, Z ,assumedto bexed,eachconsistingofasubsetofcontiguousregionswithinthestudyarea.Foreach potentialcluster Z 2Z andtime t ,dene Y in = P i 2 Z Y i;t , Y out = P i= 2 Z Y i;t , E in = P i 2 Z E i , and E out = P i= 2 Z E i ,wherewesuppressthedependencyon t forsimplicity.Wethenbegin bycomputingthespatialscanstatisticforeachpotentialclusterateachtimeperiod: S Z;t = Y in E in Y in Y out E out Y out I Y in E in > Y out E out : .2 OncewehavecomputedthePoissonscanstatisticforeachpotentialcluster Z 2Z ,we computetheCUSUMstatisticforeachclusterby: C Z; 0 =0 ;C Z;t =max f 0 ;C Z;t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 + S Z;t )]TJ/F19 11.9552 Tf 11.955 0 Td [(k g ; .3 where C Z;t istheCUSUMstatisticforcluster Z attime t ,and k isaconstantttedvia simulationtokeeptheCUSUMboundednearzerowhennooutbreakispresentsee Section3.3.2fordetailsontheselectionof k .TheCUSCANstatisticattime t isthen computedasthemaximumCUSUMstatisticforthattimeperiod: C t =max Z 2Z f C Z;t g : .4 ThesignicanceoftheCUSCANstatisticisassessedviasimulation. N sim datasets aresimulatedunderthenullhypothesisofnooutbreak,andtheCUSCANstatisticis computedforeach.Let C i t betheCUSCANstatisticattime t forsimulateddataset i . WecomputethepvaluefortheCUSCANstatistic C t astheproportionofstatistics, including C t ,thatareasleastaslargeas C t : p t = 1+ Nsim P i =1 I h C i t C t i 1+ N sim : .5 66
PAGE 73
If p< foragivensignicancelevel ,wedeclarethatanoutbreakispresentatlocation Z ,where Z isthemaximalclusterfromequation.4. Currently,theCUSCANmethodrequiresthatthesetofpotentialclusters, Z ,bexed aheadoftimesothateach Z 2Z appearsineverytimeperiodand C Z;t maybe computed.InChapterIII,thefollowingtwoscanmethodswereusedtodetermine Z ,each dependingonlyonthexedpopulationandregionswithinthestudyarea: 1.Circularscanstatistic [Kulldor,1997].Beginningatthecentroidofeachregionin thestudyarea,expandacircularwindowtoincludenearbyregionsuntilthe populationinsidethelargestwindowreachessomeprespeciedupperbound, commonly50%ofthetotalatriskpopulation.Aregionisincludedinapotential clusterifitscentroidiswithinthecircularwindow. 2.Ellipticscanstatistic [Kulldoretal.,2006].Beginningatthecentroidofeachregion inthestudyarea,expandanellipticwindowtoincludenearbyregionsuntilthe populationupperboundisreached.Theseellipsesaredenedintermsoftheirshape & = a=b ,theratioofthemajorandminoraxesandangle ,theanglebetweenthe majoraxisandhorizontalaxis. 4.2.2TheRestrictedFlexibleScanMethod InChapterIII,itwasadditionallynotedthattheexiblyshapedscanstatistic[Tango andTakahashi,2005],whichsearchesallconnectedsubsetsofregionswithina neighborhoodofxedsize,mayalsobeusedwiththeCUSCAN.However,theexibly shapedscanstatisticisseverelycomputationallyintensive,makingitimpracticalforusein detectingclusterscontainingmorethanahandfulofregionsatatime.Toremovethis weakness,TangoandTakahashi[2012]proposedarestrictedexiblyshapedscanmethod thatsignicantlyreducesthenumberofregionsunderconsiderationwhencomputing connectedsubsets,drasticallyreducingthenumberofpotentialclustersandcomputation time. 67
PAGE 74
Supposeasbeforethatwehave N disjointregionsinourstudyarea,eachwithxed populationsize n 1 ;:::;n N andtimeindexedcasecounts Y ;t ;:::;Y N;t ,withtheglobal rate r andexpectedcounts E 1 ;:::;E N .Theexiblyshapedscanmethoddeterminesthe setofpotentialclusters Z asfollows:foreachregioninthestudyarea,createalocal neighborhoodconsistingoftheregionandits k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1nearestneighbors,measuredby intercentroiddistance.Foreachlocalneighborhoodof k regions,identifyallconnected subsetsofregionswithintheneighborhood.Thiscollectionofconnectedsubsetsisourset ofpotentialclusters Z ,andthePoissonscanstatisticforeach Z 2Z isthencomputedas: S Z = Y in E in Y in Y out E out Y out I Y in E in > Y out E out ; .6 withthemostlikelyclusterdeterminedtobethe Z thatsatises: max Z 2Z f S z g : .7 Bydeterminingthesetofpotentialclusters Z inthisway,theexiblyshapedscanmethod isabletodetectclustersofupto k regionswitharbitraryshape,makingitausefultoolfor identifyingirregularlyshapeddiseaseclusters. Itiseasytoseethatthenumberofpossibleclustersunderconsiderationincreases exponentiallyaseithertheneighborhoodsize k orthenumberofregionsinthestudyarea N increase.Inordertoreducethenumberofpotentialclustersunderconsideration,Tango andTakahashi[2012]suggestthatregionsrstbelteredbasedontheirlocalriskinorder todeterminewhichregionsaremostlikelytobeexperiencinganoutbreak. Todeterminewhetheraregionhasasucientlyhighlocalrisk,wecomputeitsmiddle pvalue: mp i =P Y i y i +1 j Y i Poisson E i + 1 2 P Y i = y i j Y i Poisson E i ; .8 where y i istheactualobservedcountinregion i .Thismiddlepvalueisthencomparedto 68
PAGE 75
apredeterminedthreshold 1 .Let Rf 1 ;:::;N g bethesetofregionswith mp i < 1 . Ratherthandetermineallconnectedsubsetsofallregionsinthestudyareauptoa maximumneighborhoodsize,weinsteadsearchonlyforconnectedsubsetsofregionsin R . Thisrestrictedsearchdenesamuchsmallersetofpotentialclustersfor Z ,signicantly reducingtheamountofcomputationsnecessarytocomputethespatialscanstatistics. 4.3CUSCANwiththeRestrictedFlexibleScanMethod OneeectoftheregionlteringdescribedinSection4.2.2isthatthesetofpotential clusters Z cannotbepredeterminedandwillvarybasedonactualobservedcasecounts. Forthisreason,whenwehaveatimeseriesofcasecounts Y ;t ;:::;Y N;t , Z t willdescribe adierentsetofpotentialclustersateachtimeindex t .Asagivencluster Z isnot guaranteedtobeinevery Z t ,wecouldnotoriginallycomputetheCUSCANstatisticas describedinChapterIIIandSection4.2.1. Weproposethat,ratherthanrequiringthesamecluster Z foreachtimeperiod,wecan insteadndachainofoverlappingclusters Z 1 ;:::;Z t tocomputetheCUSCANstatistic. Let Z t bethesetofpotentialclustersattime t .Attime t =1,wecancomputethe CUSCANdirectlybycalculatingthePoissonscanstatisticforeachclusterin Z 1 andusing equation.3.Foreachsubsequent t 2,weusethefollowingmethodtocomputethe CUSCAN: 1.Computethemiddlepvalueforeachregionusingequation.8anddeterminethe setofpotentialclusters Z t . 2.ComputethePoissonscanstatistic.2foreachpotentialcluster Z 2Z t . 3.Orderthestatistics S Z;t fromlargesttosmallest, S t ;:::;S jZ t j t .Let Z t ;:::;Z jZ t j t denotetheregionsassociatedwitheachstatistic. 4.Fromthisorderedlist,determinethesetofnonoverlappinglikelyclusters N t : a.Add Z t to N t ,asitisassociatedwiththelargestscanstatistic S t . 69
PAGE 76
b.Identifythenextlargeststatistic S i t suchthat Z i t sharesnoregionsincommon with Z t ,andadditto N t . c.Identifythenextlargeststatistic S j t suchthatthat Z j t sharesnoregionsin commonwith Z t or Z i t andadditto N t . d.Continuethisprocessuntilnofurthernonoverlappingclusterscanbefound. 5.Foreachcluster Z t in N t : a.Identifythecluster Z t )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 in N t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 withthelargestCUSCANstatisticthatoverlaps with Z t . i.Ifnosuchclusterexists,computetheCUSUMstatisticinequation.3with thescanstatisticstatisticfrom Z t andaninitialCUSCANstatisticof0. ii.Returntoa.forthenextcluster. b.ComputetheCUSUMstatisticasinequation.3usingthescanstatisticfrom Z t andtheCUSCANstatisticfrom Z t )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 . 6.CalculatethemaximumCUSCANstatisticattime t bytakingthemaximumofthe CUSUMstreamsasinequation.4. 7.Increment t andreturnto1. Todemonstratehowthisprocesslinksclusterstogetherovertime,weprovidethe followingexample.SupposewehaveastudyareathatwewanttocalculatetheCUSCAN foroverthreeconsecutivetimeperiods.Usingtherestrictedexiblescanmethod,we computethescanstatisticsforeachclusterin Z 1 ; Z 2 ; Z 3 .Afterorderingthescanstatistics andidentifyingthenonoverlappingclusters,wehave: N 1 = f Z 1 = f 1 ; 3 ; 4 g ;Z 1 = f 7 ; 9 ; 11 g ;Z 1 = f 18 ; 21 ; 22 gg N 2 = f Z 2 = f 4 ; 5 ; 7 g ;Z 2 = f 9 ; 12 ; 13 g ;Z 2 = f 16 ; 17 ; 20 gg N 3 = f Z 3 = f 3 ; 4 ; 5 g ;Z 3 = f 8 ; 10 ; 11 g ;Z 3 = f 17 ; 18 ; 20 gg 70
PAGE 77
where Z t istheregionassociatedwiththelargestscanstatistic, Z t thesecondlargest, and Z t thethirdlargest.Theseregionswouldthenbelinkedtogetherinthefollowing way: Cluster t =1 t =2 t =3 Z t f 1 ; 3 ; 4 g)166(!f 4 ; 5 ; 7 g)166(!f 3 ; 4 ; 5 g Z t f 7 ; 9 ; 11 g)167(!f 9 ; 12 ; 13 gf 8 ; 10 ; 11 g Z t f 18 ; 21 ; 22 gf 16 ; 17 ; 20 g)167(!f 17 ; 18 ; 20 g Analyzingtheprogression,weseethatat t =2: Cluster f 4 ; 5 ; 7 g in N 2 overlapswithboth f 1 ; 3 ; 4 g and f 7 ; 9 ; 11 g in N 1 .Welinkitto theclusterwiththelargerteststatistic, f 1 ; 3 ; 4 g . Cluster f 9 ; 12 ; 13 g in N 2 overlapsonlywith f 7 ; 9 ; 11 g in N 1 ,sowelinkthemtogether. Cluster f 16 ; 17 ; 20 g in N 2 doesnotoverlapanyclustersin N 1 ,andsoanewchain beginsassuming S t )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 =0. Sincenoclusterin N 2 overlapswith f 18 ; 21 ; 22 g in N 1 ,thechainendsat t =1. Whenweincrement t ,weseethatat t =3: Cluster f 3 ; 4 ; 5 g in N 3 onlyoverlapswith f 4 ; 5 ; 7 g in N 2 ,sowelinkthemtogether. Cluster f 8 ; 10 ; 11 g in N 3 doesnotoverlapanyclustersin N 2 ,andsoanewchain beginsassuming S t )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 =0. Cluster f 17 ; 18 ; 20 g in N 3 onlyoverlaps f 16 ; 17 ; 20 g in N 2 ,sowelinkthemtogether. Sincenoclusterin N 3 overlaps f 9 ; 12 ; 13 g in N 2 ,thechainendsat t =2. Inthisway,weareabletochaintogetherpotentialclustersfrom t =1to t =3,even thoughthesetofnonoverlappingclusterschangesovertime. 71
PAGE 78
Usingthismethodofchainingtogetherclustersthatoverlapbetweentimeperiods,we areabletocomputeseveralCUSUMscreamsovertimewhenthesetofpotentialclustersis dynamicratherthanxed.AftercalculatingtheCUSUMstreams,wewouldthentakethe maximumateachtimeperiodasourCUSCANstatistic. 4.4Demonstration:SimulatedDatabasedonNortheastBenchmarkData ThedataweusetodemonstratetheCUSCANwiththerestrictedexiblescan methodisbasedonbenchmarkdataconstructedbyKulldoretal.[2003]andDuczmal etal.[2006].Thesedatasetswereinspiredbybreastcancermortalitydatafrom19881992 inthenortheasternUnitedStates,includingvariousregionswithinConnecticut,Delaware, Maine,Maryland,Massachusetts,NewHampshire,NewJersey,NewYork,Pennsylvania, RhodeIsland,Vermont,andtheDistrictofColumbia.Themapoftheareacontains245 distinctregionswithatotalatriskpopulationofapproximately29.5million,withthe populationineachregionmeasuredasthefemalepopulationintheregionasrecordedin the1990U.S.Census. ThedatasetsweutilizeherecomefromDuczmaletal.[2006],whichcontainsmany irregularlyshapeddiseaseclusters.Foreachcluster,10 ; 000crosssectionaldatasetswere simulated,eachtotaling600cases.Additionally,99 ; 999crosssectionalnulldatasetsare providedwherethe600totalcasesarerandomlydistributedacrossallregions. TheclusterswechosetoanalyzeareclustersB,C,E,F,andGfromthisdataset, chosentoprovideavariedselectionofirregularshapesandsizes.Theseveclustersare showninFigure4.1andsummarizedinTable4.1. Astherestrictedexiblescanmethodrequiresthatsubsetswithinaneighborhoodbe connected,aspatialadjacencymatrixwasrequired.Thespatialadjacencymatrixusedwas providedalongwiththesimulateddatasetsaspartofthe neastbenchmark Rpackage, whichisavailableat http://www.githib.com/jpfrench81/neastbenchmark .Thismatrix wasprimarilyautogeneratedwiththeRpackage spdep ,whichautomaticallylocates adjacentregionsfromtheregionalmap,andupdatedtoconnectanislandregiontotherest 72
PAGE 79
Figure4.1:Simulateddiseaseclusterswithirregularshapes. Table4.1:Sizeandpopulationofselectedsimulateddiseaseclusterswithdescriptionsfrom originalsource. ClusterDescriptionNo.ofRegionsPopulation BHudsonRiver165.7% CLakeOntarioCoast72.4% ESusquehannaRiver215.0% FNewEnglandCoast2310.8% GPennsylvaniaExternalBorder268.4% 73
PAGE 80
ofthecontiguousregions.The neastbenchmark packagehasbeenupdatedsincethetime ofthisstudy,andtheresultsthatfollowutilizetheconnectivitymatrixin neastw_old . SincetheCUSCANisamethodfordetectingemergingclustersintimeseriesdata,we simulatedasetoftimeindexeddatausingthecrosssectionaldatasetsprovided.Foreach outbreakdataset,30crosssectionalnulldatasetsand3crosssectionaloutbreakdatasets wererandomlyselectedandplacedtogethersequentially.Foreachnulldataset,33 crosssectionalnullsetswererandomlyselected.Thisallowedustosimulateatimeindexed studywith30periodsofnooutbreakfollowedby3periodswhereanoutbreakispresent. Wesimulated9,999nulldatasets,and1,000outbreakdatasetsforeachoftheveselected clusters. InordertoobservetheeectthattherestrictedexiblemethodhasontheCUSCAN, wecomparetheperformanceoftheCUSCANwithrestrictedexiblebasetothe CUSCANwithcircularandellipticbases.Ratherthanusinga k )]TJ/F15 11.9552 Tf 9.298 0 Td [(nearestneighbor approachtotherestrictedexiblescanmethodasdescribedin4.2.2,weinsteadusea populationbasednearestneighborapproachtodeterminingtheneighborhoods,witheach localneighborhoodextendingfromitscentroiduntilupto50%ofthetotalpopulationof thestudyareaisintheneighborhood.Thisensuresthatthecircular,elliptic,and restrictedexiblescanwindowsallhavethecapabilitytodetectclustersofthesamesize. ToanalyzetheperformanceoftherestrictedexibleCUSCAN,weusethesame metricsasinChapterIII: 1.Basicpower: thebasicpowerattime t istheproportionofdatasetswherethe outbreakwasdetectedbytime t . 2.Delay: thedierencebetweenthetimeofdetection t andthetruestartofthe outbreak t s ,i.e., delay = t )]TJ/F19 11.9552 Tf 11.955 0 Td [(t s .Iftheoutbreakwasdetectedinthersttimeperiod itwaspresent, delay =0. 3.Spatialprecision: theprecisionistheproportionofthetrueclustercontainedinthe 74
PAGE 81
identiedcluster.Let A t betheidentiedclusterattime t , A bethetruecluster,and n Z thepopulationofthesetofregions Z .Theprecisionattime t isdenedbased onpopulationandisequalto1when A t A : precision t = n A t A n A t : 4.Spatialrecall: therecallistheproportionoftheidentiedclustercontainedinthe truecluster.Therecallattime t islikewisedenedbasedonpopulationandisequal to1when A A t : recall t = n A t A n A : 5.Falsealarmrate: thefalsealarmrateattime t istheproportionofdatasetsthat producedanalarmattime t whennooutbreakwaspresent. TherestrictedexibleCUSCANmethodisimplementedwith 1 =0 : 1 ; 0 : 15 ; 0 : 2.The EllipticCUSCANwasimplementedwithshapeparameters S =1,2,4with1,6,and12 rotationalanglesrespectively.Asignicancelevelof =0 : 05isusedinallpower calculations.UnlikeinChapterIII,wedidnotincludeanytemporalonlyclustersinthis analysis. Basicpower,falsealarmrate,andaveragedelayresultsforallthreemethodsarefound inTable4.2,averagespatialprecisioninTable4.3,andaveragespatialrecallinTable4.4. StartingwiththebasicpowerinTable4.2,weseethatwhiletherestrictedexible CUSCANhasreasonablygoodpoweronthesmallerormoreconnectedclusters,with rstdaypowerof0.6910.746forclusterB,0.7210.787forclusterC,and0.7250.740for clusterE,thepowerdropsowhenclusterscontainalargeamountofregionswithfew connectionsbetweenthem.ForclustersFandG,whichconsistoflongchainsofregions withminimalconnectionsbetweenthem,thepowerwasconsiderablylowerat0.3530.430 forclusterFand0.4210.484forclusterG.Allclustershadthehighestpowerateither 75
PAGE 82
Table4.2:BasicpowerfortheCUSCANmethodwithrestrictedexibleR,circularC, andellipticEbase. PoweronDay ClusterMethod 1 313233DelayDaysFalseAlarmRate BR0.100.7400.9480.9900.2950.050 R0.150.7460.9590.9940.2850.049 R0.200.6910.9320.9870.3560.049 C{0.7840.9620.9950.2450.049 E{0.8270.9750.9930.1850.049 CR0.100.7870.9580.9910.2390.050 R0.150.7680.9520.9900.2630.049 R0.200.7210.9240.9830.3270.049 C{0.8870.9921.0000.1210.049 E{0.9090.9911.0000.1000.050 ER0.100.7250.9250.9800.3160.051 R0.150.7400.9500.9910.2950.050 R0.200.7350.9520.9910.2980.050 C{0.8020.9700.9770.2230.051 E{0.8510.9840.9960.1580.051 FR0.100.4130.6890.8090.6380.051 R0.150.4300.6980.8180.6210.050 R0.200.3530.6280.7640.7160.049 C{0.6980.9050.9700.3470.049 E{0.7380.9360.9780.2880.049 GR0.100.4740.7340.8630.6000.052 R0.150.4840.7440.8710.5900.051 R0.200.4210.7030.8370.6570.051 C{0.4520.7110.8420.6190.050 E{0.5570.7960.9160.5230.049 76
PAGE 83
Table4.3:AverageprecisionfortheCUSCANmethodwithrestrictedexibleR,circular C,andellipticEbase. PrecisiononDay ClusterMethod 1 313233 BR0.100.8810.8900.894 R0.150.8840.8580.851 R0.250.8030.8090.803 C{0.6790.7230.753 E{0.6720.7190.749 CR0.100.9280.9390.950 R0.150.9090.9250.933 R0.200.8630.8760.875 C{0.7350.8040.830 E{0.7330.8030.828 ER0.100.8530.8780.882 R0.150.8080.8210.824 R0.200.7790.7930.793 C{0.5480.5660.574 E{0.5470.5610.572 FR0.100.8510.8840.894 R0.150.8190.8630.873 R0.200.7540.7980.802 C{0.6390.6640.663 E{0.6720.7190.749 GR0.100.7600.8070.923 R0.150.7100.7580.754 R0.200.6330.6760.675 C{0.5470.5560.573 E{0.6720.7190.749 77
PAGE 84
Table4.4:AveragerecallfortheCUSCANmethodwithrestrictedexibleR,circular C,andellipticEbase. RecallonDay ClusterMethod 1 313233 BR0.100.4070.3920.373 R0.150.4780.4590.441 R0.200.5390.5070.481 C{0.6060.6380.665 E{0.6050.6340.664 CR0.100.5600.5970.586 R0.150.6370.6210.605 R0.200.7050.6820.662 C{0.7580.7550.761 E{0.7540.7560.760 ER0.100.3210.2940.296 R0.150.3710.3460.350 R0.200.4460.4140.415 C{0.6430.6900.722 E{0.6320.6800.714 FR0.100.2430.2380.224 R0.150.2850.2800.267 R0.200.3270.3110.296 C{0.6220.6560.693 E{0.6010.6480.682 GR0.100.2060.1950.188 R0.150.2430.2220.214 R0.200.3120.2670.261 C{0.3880.3890.385 E{0.3500.3690.359 78
PAGE 85
1 =0 : 10or 1 =0 : 15,withpowerdroppingoat =0 : 2.Incontrast,thecircularand ellipticCUSCANboasthigheroverallpowerfouroutofveclusters,especiallyonthe minimallyconnectedclusterF.However,onthelargeandspreadoutclusterG,the circularCUSCANfarednobetterthantherestrictedexibleCUSCAN,andtheelliptic onlyslightlybetter.Despitelowstartingpowerinsomeareas,thepowerincreasesrapidly overtimeforallthreemethodsinallcases. Thetradeointermsofpowercomesintheformofhigherprecisionfortherestricted exibleCUSCAN.InTable4.3,weseethattherestrictedexiblemethodhashigh precisionacrosstheboard,withthehighestprecisioninthesmallestclusterCandlowest inthelargestclusterG.Asexpected,theprecisionincreasesas 1 decreases,sinceregions includedintheidentiedclusteraremorelikelytobetruehotspots.Thecircularand ellipticmethodshavenoticeablylowerprecisionforallclusters,astheyrequirethat nonhotspotregionsbeincludedinordertocreateacircleorellipseofregionswhen clustersarenotalreadyapproximatelycircularorapproximatelyelliptic.Likebasicpower, theprecisiontendstoincreaseovertimeasmoretimeperiodsofdataareaccumulated. Finally,inTable4.4,wenoticethattherestrictedexibleCUSCANtendstohavelow spatialrecall,especiallyinclusterswithmoreregions.Therecallimprovesas 1 increases, butdoesnotsurpasstherecallofthecircularandellipticCUSCAN,whichareuniformly morelikelytocontainahigherproportionofthetrueatriskpopulation.Forthecircular andellipticCUSCAN,therecalleitherremainsapproximatelyatorincreasesslightlyas theoutbreakcontinues,whereaswiththerestrictedexibleCUSCAN,theprecisionoften decreasesslightlywithtime. 4.5Discussion Inthischapter,weproposedanextensionoftheCUSCANtoallowtheuseofspatial scanmethodswithtimevariablesetsofpotentialclusters.Previously,theCUSCAN methodrequiredthatthespatialscanmethodatitsbasehaveaxedsetofpotential clustersbasedonlyonthestudyregionandpopulation,sothatthesamepotentialcluster 79
PAGE 86
couldbefollowedovertime.Here,wedescribedamodiedapproach,whereinsteadof followingclusterswiththesamesetsofregionsovertime,weinsteadlinkoverlapping clusterstogethertocomputetheCUSUMstreams.Thisallowsustousemethodswithout xedsetsofwindows,heredemonstratedwiththerestrictedexiblyshapedscanmethod, todetectclustersofarbitraryshape. WhilethepoweroftherestrictedexibleCUSCANwaslowrelativetothecircularor ellipticCUSCAN,ithadconsiderablyhigherprecisionthaneitherthecircularorelliptic methods,aseachregionwithintheidentiedclusterwasmuchmorelikelytobeanactual hotspot.Whenthecircularandellipticmethodsareusedtodetectirregularlyshaped clusters,manynonoutbreakregionsarebynecessityincludedintheidentiedclusterin ordertollout"thecircleorellipseofzones.Therestrictedexmethod,ontheother hand,imposesnoshaperestrictions,andsodoesnotneedtoaddinextrazones.This resultsinhigherprecisionandidentiedclustersofmorecorrectshapes. Thedownsidetotherestrictedexiblemethodisthatitrequiresconnectivityinthe subsetssearched.ConsiderforexampleclusterGfromthedatademonstrationinSection 4.4.Thisclusterconsistsofalongchainofregionsconnectedinaline.Ifoneregioninthe middleofthisclusterproducesamiddlepvaluehighenoughtobeexcludedfrom consideration,thentherestrictedexiblemethodwillnotbeabletoconnectthetwosides ofthecluster,andwillfalterwhentryingtodetectit.Inadditiontolowpower,spatial recallwillalsobelowinthesecases,asmanyregionsinthetrueclusterwillnotbeinthe identiedcluster,sincewithouttheconnectionnopotentialclustercontainsallofthe outbreakzones.Indeed,thisistheprimaryreasonwhypowerandrecallarelowerin clustersFandG,whereasinglebrokenconnectionmeansthatnotallhotspotregionsare abletoappearinthesamepotentialcluster.Ineect,theweaknessesoftherestricted exibleCUSCANaredueprimarilytotheweaknessesintherestrictedexiblescan methodatitsbase.However,eventhoughtheinitialpowertodetectanoutbreakislow, powerincreasesconsiderablyovermultipletimeperiods,andbydaythreeoftheoutbreak 80
PAGE 87
thepoweroftheCUSCANishighacrossalltheirregularlyshapedclusters. OnequestionbroughtupbytheresultsofthedatademonstrationinSection4.4iswhy therecallfortherestrictedexibleCUSCANappearstodiminishovertime.Webelieve thisisaneectcausedbyacombinationoftheregionlteringthatoccurswiththe restrictedexiblescanmethodandtheprocessofmatchingclustersbyoverlap.Dueto randomvariation,themiddlepvalueinaspecicregionwilluctuateovertime,andthe regionmayormaynotbeincludedinpotentialclustersinanygiventimeperiod.Because ofthis,theclustersdetectedbytheCUSCANtendtofeatureprimarilyregionsthatwere includedintheclusterinalloutbreaktimes.Sincetheprobabilitythataregionwillbe includedinpotentialclustersinalltimeperiodsdiminishesasthenumberofperiods increase,theclustersthatareselectedtendtodiminishinsizeovertime.Thisisoneofthe primaryweaknessesintherestrictedexibleCUSCAN,astheeectofunderestimating thesizeofadiseaseclusteristypicallyundesirable. AnotherweaknessposedbytheuseoftherestrictedexibleCUSCANisthatof computationallimitations.ForthedatademonstrationinSection4.4,welimitedthe 1 thresholdforregionlteringtovaluesof0.10,0.15,and0.20.Thisisdueprimarilytothe factthatthenumberofpotentialclusters,andthusthecomputationtimerequiredto calculatethescanstatisticsandtheCUSCANstatistics,increasesexponentiallyas additionalregionsareaddedforconsideration.At 1 =0 : 25andabove,thenumberof regionsishighenoughthatbenchmarkingtheprocessondatawithalargenumberof regionsorlargenumbersofdatasetsbecomesinfeasible.Whilethereisadditional computationtimeaddedtothisfromtheCUSCANaswecomputetheCUSUMstatistics foreachdataset,sincewedealprimarilywithnonoverlappingsetsofclusters,thenumber ofCUSUMstreamscomputedisoftensmallandtheadditionalcomputationtimeis minimal.Thebulkofthecomputationalcomplexitycomesfromtherestrictedexiblescan methodandcomputingtheinitialscanstatistics,whichisaknownweaknessofthe allconnectedsubsetsapproachtodeterminingpotentialclusters. 81
PAGE 88
DuetothelimitationsoftherestrictedexibleCUSCAN,whichareprimarilythe weaknessesoftherestrictedexiblescanmethod,adierentdynamicscanmethodmaybe moreappropriateifwewishtodetectclustersofarbitraryshape.Minimumspanningtree methods,suchasthedynamicminimumspanningtree[Assuncaoetal.,2006]or constrainedspanningtree[Costaetal.,2012],mayoerhigherpowerthantherestricted exiblescanmethod,andthosewithearlystoppingcriterionmayadditionallyreducethe amountofcomputationaltimeandcomplexityneededtocomputethescanstatistics.With thesemethods,clustersmaystillbelinkedtogetherovertimeasdescribedinSection4.3, andsotheCUSCANstatisticmaybecomputedaseasilyaswiththerestrictedexible method. AnalconsiderationisthattherestrictedexibleCUSCAN,likethecircularand ellipticCUSCANmethodsdescribedinChapterIII,makesuseofthenonrestartingform oftheCUSUMtoincreasepowerofdetectionduringextendedoutbreaks.Asdiscussedin ChapterII,thenonrestartingCUSUMoftensuersfromahighfalsealarmratefollowing theendofanoutbreak.WhilethecorrectionproposedinChapterIImaybeappliedtothe CUSCAN,caremustbetakenwhendecidinghowtosimulatedataunderthealternative hypothesis,astheregionswithinthedetectedclustersarenotguaranteedtoincludeallof thehotspotsintheactualdata,noriseveryregionwithinthedetectedclustersguaranteed tobeahotspotatall. 82
PAGE 89
CHAPTERV CONCLUSION Intheprecedingchapters,weexploredmayofthepropertiesofCUSUMcontrolcharts asaprospectivesurveillancetoolandproposedupdatedversionsthatdemonstratethe potentialforimprovedfunctionality.InChapterII,weexaminedthefalsealarmproblem inherenttothenonrestartingCUSUMmethodanddemonstratedthatfalsealarmscould becontrolledatthedesiredlevelthroughtheuseofmodiedsimulationsforhypothesis testing.InChapterIII,weproposedanewmeansofincorporatingspatialinformationinto aCUSUMframeworkbycomputingthecumulativesumofscanstatisticsCUSCAN. TheCUSCANmethodwithcircularandellipticwindowswasshowntohavehighpower todetectemergingoutbreaksasearlyastherstdaytheyappear.InChapterIV,we providedanextensionoftheCUSCANmethodtoallowtheuseofspatialscanmethods withtimevariablewindows.Bylinkingoverlappingclustersthroughtime,weareableto computetheCUSCANstatisticandidentifydiseaseclustersofarbitraryshape. WhenthenonrestartingCUSUMmethodisusedtomonitoradiseaseprocess,the statisticgrowslargeinthepresenceofanoutbreaklastingformultipletimeperiods.This allowstheCUSUMtosoundanalarmforeachtimeperiodtheprocessisoutofcontrol, butoftenresultsinmanyfalsealarmsoncetheoutbreakendsasthestatistictakestimeto returntoincontrollevels.Weproposedasolutiontothisproblemintheformofa modiedapproachtothepvaluemethodofhypothesistesting.Whenthenonrestarting CUSUMmethodisusedtomonitoradiseaseprocess,thestatisticgrowslargeinthe presenceofanoutbreaklastingformultipletimeperiods.ThisallowstheCUSUMto soundanalarmforeachtimeperiodtheprocessisoutofcontrol,butoftenresultsinmany falsealarmsoncetheoutbreakendsasthestatistictakestimetoreturntoincontrollevels. Whenusingapvalueapproachtohypothesistesting,falsealarmsoccurwhentheelevated CUSUMstatisticiscomparedtosimulatedCUSUMstreamsthathavebeenincontrol sincethestart.WeproposedchangingthewaytheMonteCarlosimulationsareperformed 83
PAGE 90
bysimulatingtheoutbreakduringtimesitwasidentied,sothatthesimulatedCUSUM streamslikewiseexperiencetheoutbreakandprovideamorerealisticcomparison.We providedthreeexamplesforhowtosimulatePoissoncountsunderthealternative hypothesisthatanoutbreakispresent:asrandomdrawsfromaPoissondistributionwith auserspeciedmean,asrandomdrawsfromaPoissondistributionwiththemean estimatedfromobservedcountsduringoutbreaktimeperiods,andbysamplingwith replacementfromtheobservedoutbreakcountsbootstrapping. WedemonstratedtheeectivenessofthemodiedCUSUMmethodusingboth simulatedandrealdata.Inthesimulationstudy,thecorrectedCUSUMtestswereableto detectthenewoubreakatthesamespeedastheuncorrectedCUSUMwhilealso controllingthepostoutbreakfalsealarmrateatthedesiredlevel =0 : 05forsimulations forthebootstrapandestimationmethodsTable2.6.Whensimulatingatauserspecied levelthatunderestimatedthetruesizeoftheoutbreak,thefalsealarmratewas substantiallyreduced,thoughnotcompletelycontrolledatthechosenlevelofsignicance. Inthedatastudy,falsealarmandfalsenegativeratesaremorediculttodetermine. Undertheassumptionthattheoutbreakoccurredbetweenweeks408and410,thefalse negativerateforthecorrectedCUSUMtestwasnoworsethantheuncorrectedin7outof 8regionswheretheoutbreakwasdetected,withdetectiontimeintheremainingstate delayedbyasingletimeperiod.Additionally,theuncorrectedtestcontinuedtoproduce alarmsfor712weeksforsmallerstatesandupto116weeksforlargerstates,whilethe bootstrapandestimationcorrectionsresultedinasignicantreductioninalarmsanda totalof08alarmsineachstateovertheremaining116weeksTable2.7. Inexchangeforadequatelycontrollingthepostoutbreakfalsealarmrate,themodied testtypicallyhaslesspowertocontinuouslysignalanalarmduringoutbreaktimeperiods. SincethesimulatedCUSUMstreamsaremeanttoapproximatelymatchthelevelofthe elevatedstatistic,lowerlevelsofobservedcasesarelesslikelytoregisterasextreme,andso falsenegativesoccasionallyoccur.However,falsenegativeswereprimarilyisolatedand 84
PAGE 91
sequentialtimeperiodsoffalsenegativeswererare.Consequently,somecareshouldbe takenininterpretingthehypothesistests,asastringofnegativeresultsismorelikelyto signalthetrueendofanoutbreakthananisolatednegativeresult.Additionally,whilethe rstalarmfromthestartofmonitoringwilloccuratthesametimeinboththecorrected anduncorrectedCUSUM,subsequentalarmsforthecorrectedCUSUMmaybedelayedas seeninthedatastudyinSection2.5.ThestateofNorthRhineWestphaliahad experiencedaprioroutbreakof Salmonella Newportshortlybeforethecountrywide outbreakatweek408,andsothesimulatedCUSUMstreamsremainedelevatedenoughat thestartofthenewoutbreaktoresultinanadditionalfalsenegative.Thiseectismore pronouncedifprioroutbreakswerelargerelativetofutureoutbreaks,orifseveral outbreaksoccurinquicksuccession.Onestrategythatmayhelpreduceissuesofthistype maybetoresetthesimulatedCUSUMstreamstozerowhentheCUSUMhasreturnedto zerofollowinganoutbreak,orresettingthesimulatedstreamstothecurrentCUSUMvalue afteraprespeciednumberofnegativetestresultsinthecasewherethestatisticmaynot returntozerobetweenoutbreaks.Theissueoffalsenegativesmayalsopotentiallybe reducedbyselectingalevelofsignicance conditionalonthepresenceofanoutbreak. Whilethepostoutbreakfalsealarmrateiscontrolledatthespeciedlevel ,theCUSUM correctionhastheeectofreducingtheincidenceofpreoutbreakfalsealarms,resultingin aneectivelymoreconservativetest.InthesimulationstudyinSection2.4,wherea signicancelevelof =0 : 05wasused,thepreoutbreakfalsealarmrateforthecorrected testswasapproximately0 : 02onaverage,wellbelow .Inessence,itmaybepossibleto controlthepreoutbreakfalsealarmrateatthedesired0 : 05levelwhileallowing > 0 : 05, resultinginincreasedpowerduringoutbreaktimeperiods.ThedesiredtypeIerrorrate forexample,0 : 05shouldbeusedfollowingtheendofanoutbreak,asthepostoutbreak alarmratehasonlybeenshowntobecontrolledatthislevel. OnenalconsiderationwhenusingthemodiednonrestartingCUSUMtestisthat controllingthepostoutbreakfalsealarmrateatthedesiredsignicancelevel requiresa 85
PAGE 92
reasonableapproximationofthetrueoutbreaksize.Whensimulatingunderthealternative hypothesis,wemaketheassumptionthatanoutbreakhasastationarymeanthatdoesnot varyoverthecourseoftheoutbreak.Ifthemeanofanoutbreakisnonstationary,the estimationandbootstrapcorrectionswillunderestimateanincreasingmeanand overestimateadecreasingmean.Asdemonstratedinthesimulationstudywiththe correctionusingauserspeciedmean 1 ,underestimatingthesizeoftheoutbreakreduces theeectivenessoftheCUSUMcorrectionandfalsealarmsarereducedratherthan controlled.WhilenotaddressedinChapterII,overestimatingthesizeoftheoutbreakcan alsosignicantlyimpacttheperformanceoftheCUSUM.WhenthesimulatedCUSUM statisticsarelargerthanneeded,powertodetectsubsequentoutbreakperiodsdiminishes. WhiletheproposedsimulationmethodswereshowninSection2.5toadequatelyestimate thesizeofanoutbreakinrealdata,theoutbreakstudiedwasrelativelyshortinduration. Moreprolongedoutbreaksoroutbreaksindatawithmorefrequenttimeupdatessuchas dailyratherthanweeklycountsaremorelikelytoexperiencenegativeeectsfrom incorrectlyestimatingtheintensityoftheoutbreak.Additionally,whenmorethanone outbreakoccursinagivenstudyperiod,theoutbreakswillnotnecessarilybeofthesame intensity.Assuch,inordertofacilitatemoreaccurateestimationoftheoutbreakmean,we suggestthatthepoolofoutbreaktimeperiodsusedintheestimationandbootstrap methodsberesetalongwiththesimulatedCUSUMstreamsonceanoutbreakis determinedtohaveended. Inadditiontobeingabletoidentifyboththebeginningandtheendofanoutbreak,a goodsurveillancemethodshouldbeabletoaccuratelydeterminethelocationofthe outbreakwhengeographicinformationisavailable.InChapterIIIandChapterIV,we describednewmethodsforincorporatingspatialinformationintothenonrestarting CUSUMusingdatafromspatialscanmethods.Spatialscanmethodsprovidemore evidenceoflocalclustersofdiseasethansimpleaggregatedcounts,andsocomputingthe cumulativesumofscanstatisticsCUSCANallowsustodeterminebothwhenandwhere 86
PAGE 93
clustersofcasesassociatedwithdiseaseoutbreaksexist. TheCUSCANmethodbeginswiththeselectionofanappropriatespatialscanmethod thatwilldeterminethetypeofclustersthemethodcandetect.Forexample,thecircular scanmethodwillresultinthedetectionofcircularclusters,whiletheellipticscanmethod canidentifymoreelongatedellipseshapedclusters.Onceascanmethodischosen,the Poissonscanstatisticiscalculatedforeachpotentialclusterdenedbythatmethodand eachtimeperiodofdata,andthenonrestartingCUSUMisusedtomonitorthescan statisticsthroughtime.Whenthechosenscanmethodusesaxedsetofpotentialclusters thatdoesnotvaryovertime,wecancomputeaseparateCUSUMstreamforeachpotential cluster.Whenthechosenscanmethodresultsinsetsofpotentialclustersthatdierfrom onetimeperiodtothenextsuchaswiththerestrictedexiblescanstatistic,wherethe setofpotentialclustersdependsonobservedcounts,thenonrestartingCUSUMisinstead computedfromthescanstatisticsofoverlappingclusters.ThenalCUSCANstatisticis thendeterminedbytakingthemaximumoftheCUSUMstatisticsateachtimestep,with themostlikelyclusterbeingthesetofregionsassociatedwiththemaximumstatistic. SincePoissonscanstatisticsdonotfollowanyknowndistribution,anonparametric approachwasrequiredtodeterminethevalueofthetuningparameter k intheCUSUM equation.4.Sinceweuseapvalueapproachtohypothesistesting,wherethestatistics fromtheMonteCarlosimulationsarecomputedinthesamewayastheobservedstatistic, thetypeIerrorratewillalwaysbecontrolledatthesignicancelevel regardlessofthe choiceof k .Theselectionof k isinsteadmotivatedbythedesiretokeeptheCUSUM constrainednearzerowhennooutbreakispresenttoconservecomputationalresourcesand allowforeasiervisualinspectionoftheevolutionofthestatistic.Inaddition,theselection of k hasanimpactonthepoweroftheCUSUMtodetectemergingoutbreaks,astoolarge avalueof k canpreventthestatisticfromgrowingeveninthepresenceofanoutbreak.We suggestedchoosingavalueof k tocontrolthemaximumsprintlengthunderthenull 0 , thatis,thenumberofconsecutivetimeperiodstheCUSUMstatisticisexpectedtoremain 87
PAGE 94
positivebeforereturningtozerowhennooutbreakispresent.Thisprocessresultsina valueof k largeenoughtopreventtheCUSUMfromgrowingwithoutboundintheabsence ofanoutbreak,whilealsosmallenoughtoallowtheCUSUMtoincreasequicklywhenan outbreakoccurs. WedemonstratedtheeectivenessoftheCUSCANmethodusingcollectionsof benchmarkdata.InSection3.5,weappliedtheCUSCANwiththecircularscanmethod tobenchmarkdataprovidedbyKulldoretal.[2004],whichincludes17distinctclustersof varyingsizewithoutbreakssimulatedatbothhighandmediumlevelsofintensity.Since thisdatasetwasprovidedaspartofademonstrationoftheeectivenessofthespacetime scanmethodasaprospectivesurveillancetool,wewereabletodirectlycomparethe performanceofthecircularCUSCANandthespacetimescanmethod.Theresultsofthis demonstrationshowedthatthecircularCUSCANhasconsiderablepowertodetect emergingclustersasearlyastherstdaytheyappear,withrstdaypowerrangingfrom 0.804to0.996inthehighexcessriskmodelsand0.2810.901inthemediumexcessrisk modelsTable3.2.Notably,whilethecircularCUSCANperformsaboutaswellasthe spacetimescanmethodfordetectingsmallclustersinthehighriskmodel,theCUSCAN outperformsthespacetimescanmethodindetectinglargerclustersinthehighrisk modelsaswellasclustersofallsizesinthemediumriskmodel.Whenthetwomethods wereadjustedtodetectsmallerclustersbyreducingthemaximumclustersizefrom50%of thetotalpopulationto5%,bothsawincreasedpowertodetectsmallclustersinexchange forreducedpowertodetectlargeclusters.However,thepowerimprovementwashigherfor theCUSCAN,causingittosurpassthespacetimescanmethodinthedetectionofsmall clusters.TheCUSCANalsoretainedhigherpowertodetectlargeclustersthanthe spacetimescanmethodTable3.3.Additionally,theCUSCANdemonstratedincreased powertodetecttheelongatedRockawaysandHudsonRiverclusterswhenellipticinstead ofcircularwindowswereusedtoidentifypotentialclustersTable3.6. Inadditiontobasicpower,wealsoassessedthespatialaccuracyofthecircular 88
PAGE 95
CUSCANmethodbycomputingthepopulationbasedspatialprecisiontheproportionof thepopulationintheidentiedclusterthatwaspartofthetrueatriskpopulationand populationbasedspatialrecalltheproportionofthetrueatriskpopulationthatwas includedintheidentiedcluster.ThecircularCUSCANhadhighrecallacrosstheboard, rangingfrom0.7960.999forthehighriskmodelTable3.7,indicatingthatthe CUSCANwasabletolocateupto99.9%ofthepopulationinoutbreakregionsontherst daythattheoutbreakappeared.Incontrast,theprecisionofthecircularCUSCANwas lower,particularlyforclusterssuchastheelongatedHudsonRiverclusterthatarenot approximatelycircularinshape.Lowprecisionisoftenaneectofusingcircularwindows todenethesetofpotentialclusters,asacircularwindowthatfullycontainsa noncircularclusterwillbynecessitycontainadditionalneighboringregionsthatarenot partofthetruecluster.Whileprecisionandrecalltendedtobeslightlylowerforthe mediumriskmodelthanthehighriskmodel,thedierenceswerenotconsiderable, indicatingthatthespatialaccuracyoftheCUSCANisnotdependentonoutbreaksize. AsspatialprecisionandspatialrecallwereneithercomputedbyKulldoretal.[2004]for thespacetimescanmethodnorincludedinpoweranalysisinSaTScan,wearenotableto makedirectcomparisonsintheperformanceofthetwomethodsinthisarea.Sincethe spacetimescanmethodusesthesamesetofcircularwindowstodeterminepotential clustersasthecircularCUSCAN,thespacetimescanmethodisexpectedtohavea similarpatternofperformancetothecircularCUSCAN,wherehighspatialrecallismet withlowspatialprecisionduetotheadditionofexcessregions. InSection4.4,weuseddatasetsbasedonbenchmarkdataprovidedbyKulldoretal. [2003]andDuczmaletal.[2006]todemonstratehowtheCUSCANmethodcanbeused withspatialscanmethodswithtimevariablewindowstodetectirregularlyshapedclusters, usingtherestrictedexiblescanmethodasourexample.Therestrictedexiblescan methodstartsbylteringoutregionsthatareunlikelytobeexperiencinganoutbreakby computingthemiddlepvalueforeachregionbasedonobservedcasecountsandexcluding 89
PAGE 96
regionswherethemiddlepvalueexceedsapredeterminedthreshold 1 .Themethodthen searchesoverallconnectedsubsetsoftheremainingregionsuptosomemaximumsize, allowingforthedetectionofarbitrarilyshapedclusters.Sincetheregionsincludedinthe searchdependontheobservedcasedata,thesetofpotentialclusterschangesfromone timeperiodtothenext,andsotheCUSCANwascomputedfromoverlappingclustersas describedinSection4.3.TheperformanceoftherestrictedexibleCUSCANwasthen comparedtotheperformanceofthecircularandellipticCUSCANmethodsusedinthe previousbenchmarkstudy. Wefoundthat,ingeneral,therestrictedexibleCUSCANhadlowerpowerthanthe circularorellipticCUSCANmethodstodetectanemergingdiseaseclusterontherst dayoftheoutbreakTable4.2.However,thepowerincreasedrapidlyovertime,andby thethirddayoftheoutbreaktherestrictedexibleCUSCANapproximatelymatchedthe powerofthecircularandellipticCUSCANmethods.Sincethepotentialclustersfromthe restrictedexiblescanmethoddonothavexedshape,theadditionofnonoutbreak regionstoidentiedclustersisnotoftennecessary,andsotherestrictedexibleCUSCAN recordedahighdegreeofspatialprecisionrelativetothecircularandellipticCUSCAN Table4.3.Incontrast,thecircularandellipticCUSCANmethodshadmuchhigher spatialrecallthantherestrictedexibleCUSCANTable4.4.Thelowrecallforthe restrictedCUSCANmethodresultsfromthewaythepotentialclustersaredeterminedin therestrictedexiblemethod.Ifanoutbreakregionismistakenlyexcludedfrom consideration,itcanbecomedicultorimpossiblefortheremainingoutbreakregionsto formaconnectedset.Assuch,therewillbetimeswhennopotentialclustercontainsmore thansomesmallsubsetofoutbreakregions,resultingintheidenticationofonlyasmall partoftheatriskpopulationandlowspatialrecall.Thedisconnecting"ofoutbreak regionslikewisehasadetrimentaleectonpowerasthesmallerpiecesmaynotprovide sucientevidencetotriggeranalarm.ThiseectismostpronouncedinclustersFandG, whichconsistoflongstringsofminimallyconnectedregionsthatareathighriskof 90
PAGE 97
accidentalseparationFigure4.1.Whilethiseectcanbemitigatedbyusingahigher middlepvaluethreshold 1 toremovefewerregionsfromconsideration,itisoftennot computationallyfeasibletodoso.Asthenumberofconsideredregionsincreases,the numberofconnectedsubsetsincreasesexponentially,resultinginasignicantincreasein thecomputationtimerequiredtocomputethescanstatistics. AsdemonstratedbythebenchmarkstudiesinSection3.5andSection4.4,oneofthe mostimportantaspectsoftheCUSCANisthechoiceofscanmethodusedtocomputethe scanstatistics.ThepropertiesoftheCUSCANtendtofollowthepropertiesofthechosen scanmethod:thecircularCUSCANandthecircularscanmethodarebothpowerfulfor outbreakdetectionbutsuerfromlowprecision,theellipticCUSCANandellipticscan methodbothincreasepowertodetectelongatedclustersinexchangeforincreased computationalcomplexity,andtherestrictedexibleCUSCANandrestrictedexiblescan methodbothoerhighprecisionbutlowrecallandoverallpower.Assuch,caremustbe takentochooseascanmethodappropriateforagivenstudythathasthedesiredqualities. Fortunately,theCUSCANisanincrediblyexiblemethod,andcanbeadaptedforuse withmanydierentscanmethodswithavarietyofproperties.Whiletheinitialpowerto detectanoutbreakdieredforthethreescanmethodspresentedinthisthesiscircular, elliptic,andrestrictedexiblemethods,theuseofthenonrestartingCUSUMresultedin arapidincreasetoequivalentpowerlevelsoverthecourseofanextendedoutbreak,making theCUSCANapowerfulsurveillancetoolwiththepotentialforbroadapplication. WhilemanyofthepropertiesoftheCUSCANhavebeenexploredinthisthesis,the propertiesoftheCUSCANusingscanmethodsotherthanthecircular,elliptic,and restrictedexiblemethodshavenotyetbeenstudied.Duetotheweaknessesinherentto therestrictedexiblemethod,otherscanmethodsfordetectingirregularlyshapedclusters arelikelytobemoreappropriateforusewiththeCUSCAN.Additionally,whiletheuseof thenonrestartingCUSUMintheCUSCANmethodprovideshighpowerduringoutbreak periods,applyingthenonrestartingCUSUMcorrectionpreviouslydescribedinChapterII 91
PAGE 98
totheCUSCANisnottrivial.WhiletheCUSCANisabletoaccuratelydeterminethe timeperiodsanoutbreakispresent,theabilitytodistinguishbetweenoutbreakand nonoutbreakregionsisheavilyaectedbythescanmethodused.Sincetheeectivenessof thenonrestartingCUSUMcorrectionisdependentontheabilitytoaccuratelyrecreatethe outbreakinsimulateddata,ensuringthattheoutbreakissimulatedinthecorrectregions isessential.Thismakesitdicultifnotimpossibletoimplementthecorrectionsthat simulatenewPoissoncountsi.e.,simulatingatuserspeciedmeanorameanestimated fromthedata.Itshouldbepossible,however,tocontrolthepostoutbreakfalsealarm ratebyapplyingabootstrapcorrectiontotheCUSCAN.Ifnewcountsarecreatedby drawingwithreplacementfromcasecountsduringoutbreaktimeswithinthesameregion, thenthesimulateddatasetswillhaveoutbreaklevelcountsinoutbreakregionsand nonoutbreaklevelcountsinnonoutbreakregionswithouttherequirementthatthetrue locationoftheoutbreakbepreciselyknown. 92
PAGE 99
REFERENCES R.Assuncao,M.Costa,A.Tavares,andS.Ferreira.Fastdetectionofarbitrarily shapeddiseaseclusters. StatisticsinMedicine ,25:723{742,2006. C.Bayer,H.Bernard,R.Prager,W.Rabsch,P.Hiller,B.Malorny,B.Pfeerkorn, C.Frank,A.DeJong,I.Friesema,etal.AnoutbreakofSalmonellaNewport associatedwithmungbeansproutsinGermanyandtheNetherlands,Octoberto November2011.2014. Y.BenjaminiandY.Hochberg.Controllingthefalsediscoveryrate:apracticaland powerfulapproachtomultipletesting. JournaloftheRoyalStatisticalSociety.Series BMethodological ,pages289{300,1995. S.ChatterjeeandP.Qiu.Distributionfreecumulativesumcontrolchartsusing bootstrapbasedcontrollimits. TheAnnalsofAppliedStatistics ,3:349{369,2009. M.Coory,S.Duckett,andK.SketcherBaker.Usingcontrolchartstomonitorquality ofhospitalcarewithadministrativedata. InternationalJournalforQualityinHealth Care ,20:31{39,2007. M.A.Costa,R.M.Assunc~ao,andM.Kulldor.Constrainedspanningtree algorithmsforirregularlyshapedspatialclustering. ComputationalStatistics&Data Analysis ,56:1771{1783,2012. D.Das,K.Metzger,R.Heernan,S.Balter,D.Weiss,andF.Mostashari.Monitoring overthecountermedicationsalesforearlydetectionofdiseaseoutbreaksNewYork City. MMWRMorbMortalWklyRep ,54Suppl:41{46,2005. S.DassanayakeandJ.P.French.Animprovedcumulativesumbasedprocedurefor prospectivediseasesurveillanceforcountdatainmultipleregions. Statisticsin Medicine ,35:2593{2608,2016. F.X.Diebold. ElementsofForecasting .ThompsonSouthWestern,2007. L.Duczmal,M.Kulldor,andL.Huang.Evaluationofspatialscanstatisticsfor irregularlyshapedclusters. JournalofComputationalandGraphicalStatistics ,15: 428{442,2006. S.FastingandS.E.Gisvold.Statisticalprocesscontrolmethodsallowtheanalysisand improvementofanesthesiacare. CanadianJournalofAnesthesia ,50:767{774,2003. M.Frisen.Evaluationsofmethodsforstatisticalsurveillance. StatisticsinMedicine , 11:1489{1502,1992. A.GandyandF.D.H.Lau.Nonrestartingcumulativesumchartsandcontrolofthe falsediscoveryrate. Biometrika ,100:261{268,2012. 93
PAGE 100
L.M.HallandJ.P.French.Amodiedcusumtesttocontrolpostoutbreakfalse alarms. StatisticsinMedicine ,2019. D.M.HawkinsandD.H.Olwell. Cumulativesumchartsandchartingforquality improvement .SpringerScience&BusinessMedia,1998. M.Kulldor.Aspatialscanstatistic. CommunicationsinStatisticsTheoryand methods ,26:1481{1496,1997. M.Kulldor.Prospectivetimeperiodicgeographicaldiseasesurveillanceusingascan statistic. JournaloftheRoyalStatisticalSociety:SeriesAStatisticsinSociety ,164 :61{72,2001. M.Kulldor.Satscanversion9.6:softwareforthespatialandspacetimescan statistics. http://www.satscan.org/ ,2003. M.KulldorandN.Nagarwalla.Spatialdiseaseclusters:detectionandinference. StatisticsinMedicine ,14:799{810,1995. M.Kulldor,T.Tango,andP.J.Park.Powercomparisonsfordiseaseclusteringtests. ComputationalStatistics&DataAnalysis ,42:665{684,2003. M.Kulldor,Z.Zhang,J.Hartman,R.Heernan,L.Huang,andF.Mostashari. Benchmarkdataandpowercalculationsforevaluatingdiseaseoutbreakdetection methods. MorbidityandMortalityWeeklyReport ,pages144{151,2004. M.Kulldor,L.Huang,L.Pickle,andL.Duczmal.Anellipticspatialscanstatistic. StatisticsinMedicine ,25:3929{3943,2006. J.M.Lucas.CounteddataCUSUM's. Technometrics ,27:129{144,1985. J.M.LucasandR.B.Crosier.FastinitialresponseforCUSUMqualitycontrol schemes:giveyourCUSUMaheadstart. Technometrics ,24:199{205,1982. E.S.Page.Continuousinspectionschemes. Biometrika ,41/2:100{115,1954. G.P.PatilandC.Taillie.Upperlevelsetscanstatisticfordetectingarbitrarily shapedhotspots. EnvironmentalandEcologicalstatistics ,11:183{197,2004. R.F.Raubertas.Ananalysisofdiseasesurveillancedatathatusesthegeographic locationsofthereportingunits. StatisticsinMedicine ,8:267{271,1989. S.Roberts.Controlcharttestsbasedongeometricmovingaverages. Technometrics ,1 :239{250,1959. C.Robertson,T.A.Nelson,Y.C.MacNab,andA.B.Lawson.Reviewofmethodsfor space{timediseasesurveillance. Spatialandspatiotemporalepidemiology ,13: 105{116,2010. 94
PAGE 101
W.A.Shewhart. Economiccontrolofqualityofmanufacturedproduct .ASQQuality Press,1931. C.Sonesson.Acusumframeworkfordetectionofspace{timediseaseclustersusing scanstatistics. StatisticsinMedicine ,26:4770{4789,2007. C.SonessonandD.Bock.Areviewanddiscussionofprospectivestatistical surveillanceinpublichealth. JournaloftheRoyalStatisticalSociety:SeriesA StatisticsinSociety ,166:5{21,2003. T.TangoandK.Takahashi.Aexiblyshapedspatialscanstatisticfordetecting clusters. Internationaljournalofhealthgeographics ,4:11,2005. T.TangoandK.Takahashi.Aexiblespatialscanstatisticwitharestrictedlikelihood ratiofordetectingdiseaseclusters. StatisticsinMedicine ,31:4207{4218,2012. K.L.Tsui,W.Chiu,P.Gierlich,D.Goldsman,X.Liu,andT.Maschek.Areviewof healthcare,publichealth,andsyndromicsurveillance. QualityEngineering ,20: 435{450,2008. S.Unkel,C.Farrington,P.H.Garthwaite,C.Robertson,andN.Andrews.Statistical methodsfortheprospectivedetectionofinfectiousdiseaseoutbreaks:areview. JournaloftheRoyalStatisticalSociety:SeriesAStatisticsinSociety ,175:49{82, 2012. L.A.WallerandC.A.Gotway. Appliedspatialstatisticsforpublichealthdata , volume368.JohnWiley&Sons,2004. W.Woodall.Theuseofcontrolchartsinhealthcareandpublichealthsurveillance. JournalofQualityTechnology ,38:89{104,2006. 95

