Citation
An investigation of the l2e method in fitting quadratic clusters in radio frequency data

Material Information

Title:
An investigation of the l2e method in fitting quadratic clusters in radio frequency data
Creator:
Varys, Leslie Anne
Publication Date:
Language:
English
Physical Description:
viii, 50 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Radio frequency discharges -- Measurement ( lcsh )
Atmospherics -- Measurement ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaf 59).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Leslie Anne Varys.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
57654202 ( OCLC )
ocm57654202
Classification:
LD1190.L622 2004m V37 ( lcc )

Full Text
AN INVESTIGATION OF THE L2E METHOD IN FITTING QUADRATIC CLUSTERS
IN RADIO FREQUENCY DATA
by
Leslie Anne Varys
B.S., Colorado State University at Pueblo, 2001
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Applied Mathematics
2004
f-------


This thesis for the Master of Science
degree by
Leslie Anne Varys
has been approved
by
S-i-oH
Date


Varys, Leslie Anne (M.S., Applied Mathematics)
An Investigation of the L2E Method in Fitting Quadratic Clusters in Radio Frequency Data
Thesis directed by Assistant Professor Mark Fitzgerald
ABSTRACT
Data records of radio frequency events that exceed a certain threshold are collected by the FORTE
(Fast On-Orbit Recording of Transient Events) satellite. A collection of these data records are known
to contain well defined storms in a specialized cluster. Each record consists of 100 to 400 points that
represent total electron count (TEC) and time of occurrence. The clusters were best defined by fitting
TEC as a quadratic function of time. Thousands of these data records are downloaded to earth each
day. The goal of this analysis is to automate the process of identifying well-defined storm clusters so
that those data records may be applied in future analysis. The L2E method can be utilized as an
estimation tool for multivariate regression estimation by focusing on the distribution of the error
terms; therefore it may be used to estimate the quadratic equation parameters, extrapolate the points
that belong to each cluster, and identify when well-defined storms are present. Previous evaluation of
methods of identifying storms and corresponding signal points relied on the comparison to test data
produced by a human analyst. To eliminate the bias of a human analyst, a simulated data set was
created to mimic a variety of potential noise and signal interaction. The L2E method was evaluated on
its performance to correctly identify signals when 1) the noise-to-signal ratio was systematically
increased for varying types of interference and signal clusters and 2) the distance between two signal
clusters was systematically decreased. The results of the investigation were to identify situations in
which the L2E method failed to find the present signal.
This abstract accurately represents the content of the candidates thesis. I recommend its publication.
Signed
in


DEDICATION
I dedicate this thesis to my husband for his unconditional love and words of encouragement in every
one of my endeavors.


ACKNOWLEDGEMENT
My thanks to my advisor, Mark Fitzgerald, for his guidance in researching the topics for this thesis
and for his patience with my schedule. Special thanks to Tom Burr and David Scott for providing
their S-PIus codes. I wish to also thank the staff of the Mathematics Department for their
understanding and constant willingness to help. Finally, I would like to thank my fellow office mates
Amy, Erin and Kristin, for always making me smile.


CONTENTS
Figures..................................................................................vii
Tables..................................................................................viii
Chapter
1. Introduction...........................................................................1
1.1 Previous Methods and Research..........................................................1
2. Data...................................................................................3
2.1 Examples..............................................................................4
2.2 Issues and Problems...................................................................6
3. Signals and Simulated Data.............................................................8
4. The Goal of Model Fitting........................................................... 12
5. L2E Method............................................................................13
5.1 Theory...............................................................................13
5.2 Investigation........................................................................18
6. L2E Results...........................................................................19
7. Future Research.......................................................................24
Appendix
A. Relationship Between TEC and Radio Frequency..........................................27
B. Original R Code.......................................................................28
C. Cited S-Plus Code.....................................................................44
References................................................................................50
VI


FIGURES
Figure
2.1 Sample Data Set D1........................................................................4
2.2 Real Data Sets D6, D7, D18, D16...........................................................5
3.1 Simulated Signals.........................................................................9
3.2 Signal and Noise Orientation .............................................................9
5.1 L2E Method Results on D1.................................................................17
6.1 Incorrect Signal Identification by L2E Method............................................21
vii


TABLES
Table
6.1 Results of Increasing Noise Conatmination........................................ 20
6.2 Results of Varying Signal Separation with Fixed Noise Contamination .............22
viii


1.
Introduction
Data records of radio frequency events that exceed a certain threshold are collected by the FORTE
(Fast On-Orbit Recording of Transient Events) satellite. Each record contains data points that
represent total electron count (TEC) and time of occurrence. Some data records contain well-defined
storms in the shape of concave-up bowls. The clusters were best defined by fitting TEC as a quadratic
function of time. Thousands of these data sets are downloaded to earth each day; therefore an
automated process that successfully identifies storm clusters is highly desirable. Burr, Mielke, and
Jacobson (2001) previously investigated several different methods of mixture fitting on a collection of
thirty queried data records to evaluate their performance in successfully finding signals in the
presence of noise. One of the mixture fitting methods investigated was the L2E method. Previous
evaluation of this method relied on the comparison to test data produced by a human analyst. To
eliminate the bias of a human analyst and to control the location and density of noise and signal
points, a simulated data set was created to mimic a variety of potential noise levels and signal
interaction. The L2E method was evaluated on its performance to correctly identify signals when 1)
the noise-to-signal ratio was systematically increased for varying types of interference and signal
clusters and 2) the distance between two signal clusters was systematically decreased. The results of
the investigation were to identify situations in which the L2E method failed to find the present signal.
1.1 Previous Methods and Research
Burr et al. (2001) investigated the performance of four signal detection methods on a set of thirty
hand-chosen data records from a larger database of radio frequency events: hierarchical clustering,
principal curve clustering, predicted values from the function mixreg for use in S-Plus (mixture of
regression models), and predictions from the minimum integrated squared error (ISE or L2E distance).
1


The last three methods approach the problem as a mixture-fitting problem. For each method (with the
exception of the L2E method), each data set was first subjected to initial noise reduction using klh
nearest neighbor algorithm with k equal to 3 or 5. All noise points were eligible to be included in a
signal later on. Next, signals were found using both (a) optimal parameter values (set A) and (b)
near optimal values (set B). Burr et al. (2001) found estimates of six defined parameters, using the
queried data record D1 as a training set, which gave the lowest false negative rate subject to having a
very low false positive rate. For each detected signal, uncertainty bands were defined around each
signal to form a cluster. Results from A and B were compared, and those clusters found in A and also
in B were accepted as a storm. Diagnostics were applied to ensure that the shape was concave up, the
time duration was adequately long, and the residual variance about the signal was not overly large.
Each method performed similarly on the 30 known data sets and a single simulated data set as
reported by Burr et al. (2001). Each method successfully found the single largest signal with a
relatively low false positive rate; however they failed to find any other present features.
2


2.
Data
The FORTE (Fast On-Orbit Recording of Transient Events) satellite collects records of radio
frequency events that exceed a threshold. The data supplied by Tom Burr was queried from archived
FORTE data records containing phenomena referred to as recurrent emission storms which may
also be defined as lightning emissions. These data are radio frequencies collected from a single
source while the satellite passes overhead. Previous methods of lightning data collected by satellite
were limited to the brightest events. The new satellite is expected to provide new information in
conjunction with ground based lightning data. Data obtained by the FORTE satellite goes through an
automated process to correct for chirp or ionospheric dispersion (delay due to tracking through
atmospheric layers), pre-whitening or cleaning from on board and other man-made background
noise (TV and FM carrier interference) and optimizing the signal.
Each micro event, or point, consists of a total electron count (TEC) emanated in a 409.6 ps window
and time of occurrence (See Appendix A for a description of the relationship between TEC and radio
frequency). The length of record time is approximately 15 minutes (which is the length of time it
takes the satellite to pass over a region of the earth). TEC ranges form 0 to lOx 107m'2 and time
ranges form 0 to 1000 seconds (approximately 15 minutes). Thirty hand chosen data records were
supplied; each consisted of at least 100 micro events (100 pairs of points: TEC and time). Previous
investigation of these data sets revealed concave-up bowl shaped clusters to which a quadratic
equation provided an adequate fit. The bowl shape has been assumed to hold based on analyses by
Jacobson, Knox, Franz, and Enemark (1999), who claim to have never seen a counter example, and
also based on known shape among lightning radio frequency emissions.
3


2.1 Examples
The thirty manually selected data records are labeled Dlthrough D30. In each data set, the first storm
cluster is labeled with a 1, the second with a 2, and so on until all that remains are the noise points
(labeled with the highest integer). The assignment of the data labels was determined by the human
analyst. Each data set contains noise about the signal or cluster as well as outside scene noise. The
number of storm clusters in each data set ranges form 0 to 7. Most records had 1 or 2 clusters, but a
few had 3,4, one contained 7 and one case had 0 clusters (all noise). Figure 2.1 shows a plot of data
set Dl, which was referred to as the training set by Burr et al. (2001). Storm detection in D1 is
moderately difficult due to the presence of two distinct, but somewhat close clusters.
Figure 2.1 Sample Data Set Dl. This data set was used as a training set to
identify optimal threshold values.
4


Although the thirty real data sets where manually selected, it is not always evident how the clusters
where defined by the human analyst. Figure 2.2 shows a selection of the 30 data sets that displays an
increasing degree of difficulty. Burr et al. (2001) defined a comparative measure of difficulty for each
data set that uses the median ratio of the within-to-between feature distances. Data set D6 is the
easiest example since only one signal is present and it has little scene noise interference. Data set D7
is slightly harder, having two signals relatively close to each other and interfering noise at the tails of
the signals. D18 is still harder with a large amount of noise and D16 is the most difficult of the four
with 4 signals separated by a noise cluster. Clusters 3 and 4 in D16 also appear to have a large
amount of signal variation and so we may expect that any curve fitting method may result in a large
amount of falsely identified signal points.
1 2 3
2 V - 3 3 3
J1 1 1" K
'i 1 TEC 4 J 3 1 3,3 3 IP 2
2 2 22* 2 OJ - *$#** .
2 a 1 _-_l 1 1 1 3 333 ~i 1 r 1 1 1 1
0 200 400 600 600
Time
D18
0 100 200 300 400 S00 600 700
Time
D16
Figure 2.2 Real Data Sets D6, D7, D18, D16. Increasing difficulty in signal detection.
5


2.2
Issues and Problems
After a data set is subjected to a model-fitting algorithm, the results are compared to the results of a
human analyst. The evaluation of each method tested by Burr et al. (2001) was based on how well the
test results matched those of the human analyst; specifically looking for a small number of falsely
identified clusters (as well as small false positive rate which is the number of false positive points
divided by the true number of points in a cluster). Burrs final results were evaluated on a single
simulated data set. A simulated data bank with a variety of signal and noise combinations will allow
more investigation into the confidence of finding a signal. Specific situations will be examined and
the performance of the L2E will be evaluated.
To simulate data, the distribution of noise data points was examined. The noise about each signal
appears to be Gaussian whereas the scene noise is slightly harder to define. The scene noise appears
to have some shape or clustering, however it happens that the noise is too sparse or has a shape that
fails to match the shape criterion for signal clusters. The possibility that the noise distribution was
bivariate normal was explored. If the noise distribution of TEC and time were in fact bivariate
normal, then the marginal distributions of TEC and time would be normal. The Shapiro-Wilks
normality test was performed (using shapiro.test function in R) on noise TEC and time individually
for each of the 30 real data sets. A majority of the tests (for each TEC and time) proved to be
significant, that is their reported p-values where not large enough to conclude that the data were
normally distributed. Visual inspection of normal probability plots (Q-Q plots) for TEC and time
confirm that the marginals were not normally distributed. It should be noted that when the marginal
distributions of TEC and time are not normal then we may assume that the joint distribution of TEC
and time is not bivariate normal. However, if the marginals were proven to be normal then further
6


investigation would be needed to assure a bivariate normal distribution was present. Further
investigation shows that clustering of noise data points may be the primary culprit for violating
normality. The clustering effect allows the noise data points to qualify for a possible mixture of
normal densities.
7


3.
Signals and Simulated Data
Simulated data sets were created from exploratory analysis by examining the different types of signal
curves in the 30 real data sets. The signal curves were categorized into single full curves, single
partial curves, and a combination of two signals (2 full, 2 partial, 1 full and 1 partial).
The full curve signal was created by randomly choosing 120 time variables from a normal distribution
(N(500,150 ). TEC was obtained by fitting time into a quadratic equation (a + b(time) + c(time) ).
To obtain variation about the signal, a random amount of noise was added to the equation. The noise
about the signal was randomly drawn from an N(0, 0.12) distribution with exception of the case with
two partial curves where the second partial curve had noise from an N(0, 0.082) distribution. For the
full signal, the parameter values chosen were: a = 10.427, b = -0.036, and c = 0.0000356. A similar
process was chosen for partial curve signals; however, the time variable was restricted to values above
530 and was randomly chosen from N(520, 1502). When two signals were simulated a simple vertical
shift (a + h, where h = 2.5 for the cases with increasing noise and ranged from 2 to 4 for the varying
separation distance situation). No horizontal shift was employed in this analysis. Also, when two
signals were simulated, the first signal consisted of 120 data points and the second signal (shifted
above the first signal) consisted of 90 data points. Figure 3.1 shows two types of simulated signal
data sets. The left plot shows one partial signal (data points labeled with 1 ) and scene noise (data
points labeled with *). The right plot shows one partial signal, one full signal and scene noise.
Along with various forms of signal, various types of noise interference were also simulated. Although
previous investigation of the distribution of the noise points rejected the bivariate normal,
8


Figure 3.1 Simulated Signals. The left plot shows one partial signal. The right plot shows one
partial signal and one full signal. The signal data points are represented as 1 and 2 and the
scene noise data points are represented as
Figure 3.2 Signal and Noise Orientation. The left plot shows two full curve signals with
parallel, non-intersecting scene noise. The right plot shows two partial signals with
perpendicular, intersecting scene noise.
it was observed that the individual noise cluster groups appeared to be bivariate normal. That is,
individual noise clusters in each data set appeared to have an organized shape that for one reason or
9


another did not match the criterion to be a signal. By focusing on a single noise cluster located near
the signal cluster(s), it was assumed that any other noise clusters would be less dense and located
farther away so that the robustness of the L2E method would disregard them as signal points. The
analysis proposed is to create increasingly worst case scenarios; therefore if the noise is generated
from a structured distribution, such as the bivariate normal, then this may be an example of a worst
case scenario.
Noise types were classified into intersecting and non-intersecting. Each classification was then
broken down into parallel and perpendicular orientation in relation to the signal(s). Figure 3.2 shows
examples of scene noise orientation. The left plot shows two full signals with parallel non-
intersecting noise. The right plot shows two partial curves and perpendicular intersecting noise. The
scene noise from the original data sets often displayed various patterns and clustering.
The parameters ptime, firec, o1ime, O'tec > P were varied to capture different combinations of noise
and signal. In the case of increasing the amount of noise, 20 noise points (time, TEC) where chosen
from an N2(p,Une > Ptec > O'1 time, a2TEC, p ) distribution where plime ~ N(500, 752), ftTEC ~ N(4.7,
1.052), a,ime ~ abs(N(l 00,302)), otec ~ abs(N(l .85, 0.52)), p ~ U(-0.8,0.8). The parameters where
fixed for each simulation and the first twenty noise points where chosen from this bivariate normal
distribution and then increasing increments of twenty noise points where chosen from the same
distribution. In the case of varying the separation distance between two curves fifty noise points
where randomly selected from an N2(650,7, 1002,22, p) where p ~ U(-0.6,0.6) distribution. When
noise and signal were combined into the simulated data records, each data point was assigned an
integer value where the highest integer value of a set corresponds to a noise data point and the lowest
10


integer value corresponds to a data point belonging to the strongest signal. This was the same
labeling convention used by the human analyst on the real data sets.
11


4.
The Goal of Model Fitting
Thousands of data sets can be downloaded each day (Jacobson et al., 1999); therefore one goal of the
model fitting process is to automate signal detection with a low false positive rate. Because of the
large volume of data records expected, it is more detrimental to falsely identify a signal that it would
be to throw away a weak signal. The desired outcome is to identify well-characterized storms.
Several model-fitting methods were examined in a paper by Burr et al. (2001) in which different
clustering techniques were applied to high-density regions of each data set. For each identified
cluster, TEC was fit as a quadratic function of time. All concave down clusters were rejected as false
signals as well as signals with too short time duration or too large residual error variation.
The minimum integrated squared error (L2E) method was investigated in depth. Tom Burr and David
Scott kindly supplied the S-Plus code for fitting a mixture of regression models. Performance
measures were defined as: (a) the maximum noise contamination level and (b) the minimum distance
between two signals, before we fail to find the existing signal or signals, were examined for the L2E
method.
12


5.
L2E Method
Integrated squared error, or L2E distance, is mostly known in the area of non-parametric density
estimation. Scott (2000) investigated the use of the L2E distance as an estimation tool for parametric
statistical models. In the same paper, it was demonstrated that minimum distance estimators,
including L2E are robust. Robustness of the L2E method was demonstrated by example and through
the induced M estimator.
5.1 Theory
The L2E method is primarily known for use in non-parametric density estimation utilizing a mixture
of known densities (such as Guassian). In density estimation, one tries to estimate the bin width, h, in
the density estimation:
h arg rnin H (x)-f(x)] dx. (5.1)
By expanding Equation (5.1) and ignoring terms that do not depend on k, then an estimate of h should
satisfy:
h = arg mm^J fh (x) dx-2 J fh (x)f(x)dx^j. (5.2)
The first integral can be evaluated at any specified value of h whereas the second term is the expected
height of the density estimate, Efh (x), and is the term that must be estimated. In Parametric
Statistical Modeling by Minimum Integrated Square Error by David Scott (2000) it was shown that
13


when the density function /(jc | 0) is assumed known (parametric setting) then estimating h is
equivalent to estimating 0, a vector of density parameters, using:
eL2E =argmin
where the true parameter 0O is unknown. Again, the expected height of the density needs estimating,
Note that this process relies on the assumption that the correct model fix | 0) has been chosen.
In the present case of model fitting, the L2E criterion can be used by focusing on the distribution of
the residuals obtained by fitting TEC as a single quadratic function of time with contamination. The
provided S-Plus code (appendix C) for fitting concave up quadratic fits (see appendix B for translation
into compatible R code) uses a weight option and a built in non-linear minimization (nlm) R function
(Dennis and Schnabel, 1983; Schnabel, Koontz and Weiss, 1985) to estimate 0 where
0 = (a, b, c, &£,w). The parameters come from fitting the model y = a + bx + cx1 + £ where
e ~ 7V(0, cre ) and the weights represent the percentage of the data points that meet the criterion for
feature points. Adjusting Equation (5.4) to the regression situation with f(e) = {£ | 0, ae ) gives:
which will be estimated by averaging over the entire sample. The L2E estimator of 0 is reduced to:
(5.4)
14


/
(5.5)
(<5, b, c, <5 ,w) = arg min
2
W
aJiyct 2-Jx<
nac
2w
n
n
1^)
,=i
The following is an outline of the L2E process performed on the data sets:
1) Estimate 0 (initial values are given as: (0,0, exp(0.5), 0.4,1).
A 2
2) Calculate the fits: yt =a + bxj + cxi .
y. y.
3) Calculate the standardized residuals: st = ------ .
4) Evaluate required conditions: if too few (less than 5) fits or too few (less than 5) data
points with small enough residual (for A: within 2 standard deviations, for B: within 2.2
standard deviations) STOP.
5) Define a point as belonging to a signal cluster (feature point) if it has a small enough
residual or less than the smallest residual of the w percent of the remaining data points.
6) Check normality of residuals belonging to feature points ( correlation between actual
normal quantiles and sorted residuals should be greater than 0.9)
7) Repeat process with feature points removed.
The process is continued as long as the number of feature points found is greater than 10, the
estimated MSE (mean squared error) is less than a set optimal value (for A: .25, for B: .20) and no
more than 7 features or signals have been found. This process is completed for each set A (optimal
values) and B (near optimal values). Features that were found in results from A and B are considered
as the final found signal(s).
15


Figure 5.1 illustrates the L2E method applied to data set D1. The upper left shows the original scaled
data. The top right is the first feature found by the L2E method. The middle left plot shows the data
with the found feature points removed. The middle right plot shows the second feature found by the
L2E method on the remaining data. The bottom left plot is the remaining data with both sets of feature
points removed. The bottom right plot shows the final result, which appears to be in relatively good
agreement with the human analyst.
16


TEC(scaled) TEC(scaled) TEC(ecaled)
Time(scaied)
Time(scaled)
Time(scaled)
Figure 5.1 L2E Method Results on D1.
17


5.2
Investigation
Under this specific investigation, the optimal outcome is to find the present signal(s) in the presence
of various noise situations and proportions and for the signal to fit the signal data points without being
influenced by the presence of noise. For example, two signals fit with only a single curve would be
considered as an incorrect fit since two signals were present but only one was identified. This type of
error is referred to as missed signals. Another example of an incorrect fit would arise if only one
signal with a defined shape is present but the presence of noise data points directs the program to a
different shape of the fitted curve (i.e. flattening the curve (c > c ) or shifting the curve above the true
signal ( a > a )) to include noise into the signal. This type of error is referred to as biased fit. In
this analysis it is better to find no signal than the wrong signal. Similar investigations may be done
with different operating criteria such as a small false positive rate or simply finding the single largest
(most dominant) signal. The first process of creating bad or contaminated data involves one signal
and systematically adding noise (increasing the noise contamination) until the method fails to
correctly find the signal. The second process increases the noise to signal ratio involving two signal
curves. The third process involves a fixed amount of noise contamination, but varying the distance
between signal clusters to investigate when the L2E method can distinguish between two separate
signals.
18


6.
L2E Results
Simulated data sets were created for each of the three situations under investigation: one signal with
increasing noise, two signals with increasing noise, and two signals with fixed noise and varying
distances between signals. For each signal combination in the increasing noise situation, 28 data sets
where created. Each group of seven simulated signals consisted of the same simulated signal(s) and
for each of these seven datasets the noise parameters varied. Twenty different data sets where
simulated for each of the two signal combinations in the variation of separation distance situation.
For each of these twenty data sets the vertical separation ranged form 2 to 4 (measured on the same
scale as TEC) by increments of 0.1. The mode of the failure contamination level and range of failure
distances are reported. The supplied L2E code was modified to accommodate the different data record
titles as well as provide detailed step-by-step plots for each iteration. Single noise clusters were
simulated for the investigation and were located relatively close to the signals to represent the worst
case scenario. Any other noise data points that are too far away from the signals would
theoretically have residual values higher than the critical value set by the L2E method. Once the L2E
method failed to correctly identify the present signal(s), a case was never witnessed during this
analysis in which the correct signal was identified.
Signal data tends to be dense at the vertex of the curve and less dense at the tails. The L2E method
performed relatively better when the noise cluster intersected the signal near the vertex of the signal
compared to noise clusters that intersected near the tails. In general, the L2E method also performed
better with full signal cluster compared to partial signal clusters. Table 6.1 presents the results of
when the L2E method failed to correctly identify the present signal(s) in the situation of increasing the
noise contamination. Noise contamination is defined as the number of noise data points divided by
19


the number of signal data points. When one fall signal cluster was present, the L2E method failed
when the noise contamination reached 200 / 120 = 5 / 3 (5 noise points for every 3 signal points) with
noise intersecting the vertex compared to 160 / 120 = 4/3 with noise intersecting the tail. When one
partial signal cluster was present, the L2E method performed similarly for both noise positions (vertex
and tail intersecting) failing when the noise contamination reached 80 / 120 = 2 / 3.
Signal Type Description of Noise Noise to Signal Ratio at Failure
One Full Signal intersecting near vertex of signal 200/120 = 5/3
intersecting near tail of the signal 160/120 = 4/3
One Partial Signal intersecting near vertex of signal 80/120 = 2/3
intersecting near tail of the signal 80/120 = 2/3
Two Full Signals intersecting near vertex of signal 80/210 = 8/21
intersecting near tail of the signal 40/210 = 4/21
Two Partial Signals intersecting near vertex of signal 60/210 = 2/7
intersecting near tail of the signal 40/210 = 4/21
One Full Signal, One Partial Signal intersecting near vertex of signal 80/210 = 8/21
intersecting near tail of the signal 60/210 = 2/7
Table 6.1 Results of Increasing Noise Contamination.
When two signal clusters were present the L2E method proved to be sensitive to any noise data points
that were located between the two signals. In this situation, the L2E method consistently fit a single
curve between the two signals and created one signal. Figure 6.1 displays an example of when the
L2E method failed to correctly identify two signals. In this investigation, fitting a single curve when
20


X .sam pie
Figure 6.1 Incorrect Signal Identification by L2E Method. There are two
signals present, however the L2E method was unable to correctly identify
them.
two signals are present is considered as a failure. Table 6.1 shows that the L2E method performs
better with two full signals (vertex intersecting: 80 / 210 = 8/ 21, tail intersecting: 4/21) than two
partial signals (vertex intersecting: 2/7, tail intersecting: 4/21). The L2E method performed the best
when one partial signal cluster and one full signal cluster was present (vertex intersecting: 8 / 21, tail
intersecting: 2 / 7).
The L2E method also appeared sensitive to the density of the noise cluster. When the noise cluster
was less dense (spread out) the L^E method succeeded in finding the present signal(s) regardless of
the location of the noise. For example, in one data record where the noise cluster variation was high
21


and one partial signal cluster was present the L2E method was able to correctly identify the signal
curve with a noise contamination of260 / 120 = 13 / 6.
The results of varying the vertical separation distance between two signals with a fixed amount of
noise are shown in Table 6.2. The vertical distance is defined as the amount of vertical shift between
the two signals, before random noise is added to each signal. Vertical separation distance is measured
on the same scale as TEC and was allowed to vary form 2 to 4. The amount of fixed noise was chosen
as an amount such that at a vertical separation distance of 2.5, the L2E method was able to correctly
identify the two signals as separate clusters. The L2E method is able to correctly distinguish two full
signal clusters when the vertical distance between the two ranged from 2.3 to 3.4. For correctly
identifying two partial clusters, the average range of separation is 2.3 to 2.6. For correctly identifying
one partial cluster and one full cluster, the average range of separation is 2.4 to 3.2. Overall, the L2E
method has difficulty distinguishing between signal clusters that are too close and signals that are too
separated as to allow too many noise points to appear between the signals. Signals that are too close
may in fact be the same signal or possibly echoes. When signals are too far apart and too many noise
data points appear between them, the L2E method interprets this as a single signal cluster and by
fitting a single curve between them creates residuals that are below the set L2E threshold.
Signal Type Fixed Noise Contamination Distance at Correct Identification
Two Full Signals 50:210 2.3-3.4
Two Partial Signals 50:210 2.3-2.6
One Full Signal, One Partial Signal 50:210 2.4-3.2
Table 6.2 Results of Varying Signal Separation with Fixed Noise Contamination.
22


Measuring the performance of the L2E method based on the simulated data results depends partially
on the main purpose of finding signals. If the desired outcome is to correctly identify each signal in a
data record (i.e. it is important to know how many signals are present) or when the exact equation of
the signal is desired then the L2E method performs accurately when finding single clusters with a
noise to signal ratio of80:120 and when finding two clusters separated by an average of 2.45 with a
noise to signal ratio of40:120. These results hold for noise clusters in close proximity to the
signal(s). If each data record were considered as a piece of a mosaic over a section of the earths
surface, then perhaps it would be satisfactory to simply identify records with at least one cluster. In
this situation it may be acceptable to only fit a single curve between two clusters and so the L2E
method performs relatively better in this situation and can handle a greater noise to signal ratio.
23


7.
Future Research
Areas for further investigation include manipulation of simulated data to extend the investigation of
the L2E performance, the versatility and sensitivity of the supplied L2E code, and collaboration with
scientists to interpret signal behavior and results.
The L2E code appeared to be sensitive to the amount of random noise variation about the signal. For
example, when that noise variation about simulated signals was 0.25 the L2E method failed to
distinguish between two signals even with a minimum amount of noise present and a moderate
amount of vertical separation between the two signals. When the noise variation was reduced to 0.1
and 0.08, the L2E method was successful in finding both signals. Perhaps by introducing another
constraint during the L2E process comparing the scene noise variation ( (Tnoise) to the signal noise
variation ( tendency to fit one curve to two signals. Another possible solution would be to include an iteration
process in which after the curves are fit the process is repeated using information learned from the
first fit. It is likely that the L2E method is more successful in detecting straight lines than curves;
therefore it would be interesting to investigate whether a square root transformation of TEC versus
time would flatten the signal curves into straight signal clusters that the L2E method could more easily
identify.
The L2E method currently fits a single quadratic curve to the data in the presence of noise
contamination. The identified signal data points are removed and the process is repeated. Another
approach would be to fit the L2E method as a mixture to look for multiple signals. That is, the L2E
24


criterion could be modified to directly fit two curves at once rather than fitting one, removing points,
and fitting again.
Future investigation will also investigate how much noise variation about signals, on average, can be
present before the L2E method fails to correctly identify both signals. Another manipulation of the
simulated data would be to systematically reduce the strength of the signal (reduce the number of
signal data points) with a fixed amount of noise and investigate how weak the signal can be before the
L2E method fails to identify the signal. Simulating multiple curves that are not parallel by
incorporating a horizontal shift is also on the agenda for future research since this behavior is evident
in the 30 real data sets.
The noise was simulated form a bivariate normal distribution; however diagnostic checks of the 30
real data sets used by Burr et al. (2001) proved that this assumption was false. Part of the reason for
the failure may be due to the multiple clusters of the noise points in each data record. Future data
simulation will involve more realistic noise simulation. The first step of this process will be to
estimate the density distribution using a non-parametric technique to fix a mixture of normal densities.
Such a process is described by Scott and Szewczyk (2000).
The L2E code provided was created for the original supplied data that was queried from a much larger
and more diverse database. For the goal of automated signal detection, the sensitivity of the starting
values of the L2E method as well as the dynamics of the nlm (non-linear minimization) function
should be examined. Specifically, a constraint that the weight parameter cannot be greater than one
should be included.
25


Finally, collaboration with FORTE satellite scientists may be organized to investigate whether or not
there are significant differences or interpretations between full and partial signals. Parallel signals
may also be an example of an echo (high altitude discharge based on well-demonstrated upward
discharge above thunderclouds during storm electrification conditions (Jacobson et al., 1999)). If
parallel signals are not necessarily echoes, then being able to distinguish the difference between the
two signals may be favorable.
26


Appendix A. Relationship Between TEC and Radio Frequency
Radio frequencies that exceed a threshold trigger the record system on the satellite. There exists a
relationship between ionospheric dispersion, radio frequency, and TEC (Jacobson et al., 1999):
t(jjs) = 1.34 x {|V /(1017 m~2 }x {//(lOOMfc)}-2, (A.1)
where t is the group delay,/is the radio frequency, and N is the line-of-sight-integrated total electron
content (TEC).
27


Appendix B. Original R Code
####################
### The following process increases noise to signal ratio
### to investigate the L2E effectiveness in picking out signals
# Start with data: full signal
# Fix number of signal points
# increase noise data points
# When do we fail to find the signal?
II<-15 # number of iterations, increasing the noise
A <- 10.427 # parameters obtained from sample data set
B <- -.036
CC <- .0000356
X. sample<-matrix(0,1,120) #Arrays for one full curve signal data
Y. sample<-matrix(0,1,120)
Xp<-matrix(0,1,120) #Arrays for one partial curve signal data
Yp<-matrix(0,1,120)
## simulates one full signal ##
X. sample<- rnorm(120,mean=500, sd=150)
Y. samplec- (A) + (B *X.sample) + CC (X.sampled)+ rnorm(120,mean=0,sd=.4)
gs .temp<-cbind(rep(1,120))
xs.temp<-cbind(X.sample)
ys.temp<-cbind(Y.sample)
sig.data<-cbind(gs.temp,ys.temp,xs.temp)
## simulates one partial signal ##
Xp<- rnorm(120,mean=520,sd=150)
for(j in 1:120){if (Xp[j] >=530) Xp[j]<- (2*520 Xp[j] 60) }
Yp<- (A+2) + (B *Xp) + (CC (Xp"2))+ rnorm(120,mean=0,sd=.4)
gp.tempc-cbind(rep(1,120))
xp.temp<-cbind(Xp)
yp.temp<-cbind(Yp)
sigp.data<-cbind(gp.temp,yp.temp,xp.temp)
Nxsigma<-matrix(0,7,1)
Nysigma<-matrix(0,7,1)
rho<-matrix(0,7,1)
Nxmu<-matrix(0,7,1)
Nymuc-matrix(0,7,1)
para<-matrix(0,7,5) irecords parameters of noise distribution
samp<-c("Sample 1","Sample 2","Sample 3","Sample 4","Sample 5",
"Sample 6","Sample 7","Sample 8","Sample 9,"Sample 10",
"Sample 11","Sample 12","Sample 13","Sample 14","Sample 15",
"Sample 16","Sample 17","Sample 18","Sample 19,"Sample 20",
"Sample 21","Sample 22","Sample 23","Sample 24","Sample 25",
"Sample 26,"Sample 27","Sample 28,"Sample 29", "Sample 30")
28


## The process below simulates increasing amounts of noise
## for one full curve
## The distribution of noise was assumed to be bivariate normal
for (k in 1:7){
Nxsigma[k]<-abs(rnorm(1,mean=100,sd=30))
Nysigma[k]<-abs(rnorm(1,mean=l.85,sd=.5))
rho[k]<-runif(1,-.8,.8)
Nxmu[k]<- rnorm(1,mean=500,sd=75)
Nymu[k]<- rnorm(l,mean=4.7,sd=l.05)
para[k, ]<-cbind(Nxsigma[k],Nysigma[k],rho[k],Nxmu[k],Nymu[k])
Nx<-matrix(0,20*11,1)
Ny<-matrix(0,20*11,1)
Nx<-rnorm(20*11,mean=Nxmu[k],sd=Nxsigma[k]) #starting values
ford in 1: (20*11) ) {
Ny[i]<-rnorm(1,mean=Nymu[k]+(rho[k]*Nysigma[k]*(Nx[i] -
Nxmu[k]))/Nxsigma[k],sd=Nysigma[k]*sqrt(1-rho[k]~2))
)
##Combining noise and signal into one data set
for (ii in 1:II) {
ss<-(20*ii)
g.temp<-matrix(0,ss,1)
x. temp<-matrix(0,ss,1)
y. temp<-matrix(0,ss,1)
plot(X.sample,Y.sample,xlim=c(0,1000),ylim=c(0,10),main=paste(samptii],"x20
",k,sep=" "))
points(Nx[1:ss],Ny[1:ss],pch="*",cex=l.3)
#browser()
hh<-120+ss
simul.data<-matrix(0,hh, 3) #Array for recording entire simulated data set
(noise&signal)
simul.data.noise<-matrix(0,ss,3)
g.temp<-cbind(rep(2,(ss)))
x. temp<-cbind(Nx[1:ss])
y. temp<-cbind(Ny[l:ss])
simul.data.noise<-cbind(g.temp,y.temp,x.temp)
simul.data<-rbind(sig.data,simul.data.noise)
assign(paste("fulll",ii,k,sep=".),simul.data)
}
}
## The process below simulates increasing amounts of noise
## for one partial curve
## The distribution of noise was assumed to be bivariate normal
for (k in 1:7){
Nxsigma[k]<-abs(rnorm(1,mean=100,sd=30))
Nysigma[k]<-abs(rnorm (1,mean=l.85,sd=.5))
rho[k]<-runif(1,-.8,.8)
Nxmu[k]<- rnorm(l,mean=200,sd=75)
29


Nymu[k]<- mormd,mean=4.7, sd=l. 05)
para[k,]<-cbind(Nxsigma[k],Nysigma[k],rho[k],Nxmu[k],Nymu[k])
Nx<-matrix(0,20*11,1)
Ny<-matrix(0,20*11,1)
Nx<-rnorm(20*11,mean=Nxmu[k],sd=Nxsigma[k]) #starting values
for(i in 1:(20*11)){
Ny[i]<-rnorm(l,mean=Nymu[k]+(rho[k]*Nysigma[k]*(Nx[i]-
Nxmu[k]))/Nxsigma[k],sd=Nysigma[k]*sqrt(1-rho[k]^2))
}
##Combining noise and signal into one data set
for (ii in 1:11) {
ss<- (20*ii)
g.temp<-matrix(0,ss,1)
x. temp<-matrix(0,ss,1)
y. temp<-matrix(0,ss,1)
#plot(Xp,Yp,xlim=c(0,1000),ylim=c(0,10),main=paste(samp[ii],"x20",k,sep="
))
#points(Nx[1:ss],Ny[1:ss],pch=*",cex=l.3)
(throws er ()
hh<-120+ss
simul.data<-matrix(0,hh,3) ((Array for recording entire simulated data set
(noise&signal)
simul.data.noise<-matrix(0,ss,3)
g.temp<-cbind(rep(2,(ss)))
x. temp<-cbind(Nx [1 :ss] )
y. temp<-cbind(Ny[l:ss])
simul.data.noise<-cbind(g.temp,y.temp,x.temp)
simul,data<-rbind(sigp.data,simul.data.noise)
assign(paste("parti",ii,k,sep="."), simul.data)
}>
####################
### The following code simulates 2 signals with increasing noise
II<-15 tt number of sample data sets to construct
A <- 10.427 # parameters obtained from above sample data set
B <- -.036
CC <- .0000356
X. sample<-matrix (0,1,12 0) ((Arrays for signal data
Y. sample<-matrix(0,1,120)
X2 sample<-matrix(0,1,90) ((Arrays for signal data
Y2 .sample<-matrix(0,1,90)
##########((((((((((simulating 2 signals: one full, one partial#################
X. sample<- rnorm(120,mean=500,sd=120)
Y. sample<- (A) + (B *X.sample) + (CC (X.sampled))+
rnorm(120,mean=0,sd=.2 5)
X2 .sample<- rnorm(90,mean=520, sd=150)
30


for(j in 1:90){if (X2.sample[j] >=530) X2.sample[j]<- (2*520 X2.sample[j]
- 60) .}
Y2 .samplec- (A+2.5) + (B *X2.sample) + (CC (X2.sampled)) +
rnorm(90,mean=0,sd=.3)
gls.temp<-cbind(rep(l,120))
xls.tempc-cbind(X.sample)
yls.temp<-cbind(Y.sample)
sigl.data<-cbind(gls.temp,yls.temp,xls.temp)
g2s.temp<-cbind(rep(2,90))
x2s.temp<-cbind(X2.sample)
y2s.temp<-cbind(Y2.sample)
sig2.data<-cbind(g2s.temp,y2s.temp,x2s.temp)
sig.data<-rbind(sigl.data,sig2.data)
Nxsigma<-matrix(0,7,1)
Nysigma<-matrix(0,7,1)
rho<-matrix(0,7,1)
Nxmu<-matrix(0,7,1)
Nymu<-matrix(0,7,1)
para<-matrix(0,7,5) #records parameters of noise distn
samp22<-c("Sample 1 w/2 signals","Sample 2 w/2 signals","Sample 3 w/2
signals","Sample 4 w/2 signals","Sample 5 w/2 signals","Sample 6 w/2
signals","Sample 7 w/2 signals","Sample 8 w/2 signals","Sample 9 w/2
signals","Sample 10 w/2 signals","Sample 11 w/2 signals","Sample 12 w/2
signals",Sample 13 w/2 signals","Sample 14 w/2 signals","Sample 15 w/2
signals","Sample 16 w/2 signals",Sample 17 w/2 signalsSample 18 w/2
signals","Sample 19 w/2 signals","Sample 20 w/2 signals","Sample 21 w/2
signals","Sample 22 w/2 signals","Sample 23 w/2 signals,"Sample 24 w/2
signals","Sample 25 w/2 signals","Sample 26 w/2 signals","Sample 27 w/2
signalsSample 28 w/2 signals","Sample 29 w/2 signals","Sample 30 w/2
signals")
## The process below simulates an increasing single noise cluster
## The distribution was assumed to be bivariate normal
for(k in 1:7){
Nxsigma[k]<-abs(rnorm(l,mean=100,sd=40))
Nysigma[k]<-abs(rnorm(l,mean=2,sd=l))
rho[k]<-runif(1,-.8,.8)
Nxmu[k]<- rnorm(l,mean=400,sd=195)
Nymu[k]<- rnorm(1,mean=6,sd=2)
para[k,]<-cbind(Nxsigma[k],Nysigma[k] rho[k],Nxmu[k] ,Nymu[k])
Nx<-matrix(0,20*11,1)
Ny<-matrix(0,20*11,1)
Nx<-rnorm(20*11,mean=Nxmu[k] sd=Nxsigma[k]) #starting values
for(i in 1: (20*11)){
Ny[i]<-rnorm(l,mean=Nymu[k]+(rho[k]*Nysigma[k]*(Nx[i]-
Nxmu[k]))/Nxsigma[k],sd=Nysigma[k]*sqrt(1-rho[k]~2))
}
##Combining noise and signals into one data set
for ( ii in 1:II){
ss<-20*ii
31


g.temp<-matrix(0, ss, 1)
x. tempc-matrix(0,ss,1)
y. temp<-matrix(0,ss,1)
plot(X.sample,Y.sample,pch=1",xlim=c(0,1000),ylim=c(0,10),main=paste(samp2
2[k],20x",ii,sep=" "))
points(X2.sample,Y2.sample,pch="2")
points(Nx[l:ss],Ny[1:ss],pch=*",cex=l.3)
#browser()
hh<-120+90+ss
simul.data<-matrix(0,hh,3)
simul.data.noise<-matrix(0,ss,3)
g.temp<-cbind(rep(3,ss))
x. temp<-cbind(Nx[1:ss])
y. temp<-cbind(Ny[l:ss])
simul.data.noise<-cbind(g.temp,y.temp,x.temp)
simul.data<-rbind(sig.data,simul.data.noise)
assign(paste("partfull",ii,k,sep="."),simul.data)
}
}
#################### simulating 2 signals: two full #######################
X. samplec- rnorm(120,mean=500,sd=150)
Y. sample<- (A) + (B *X.sample) + (CC (X.sampled))+
rnorm(120,mean=0,sd=.25)
X2 sample<- rnorm(90 ,mean=520, sd=150)
Y2.sample<- (A+2.5) + (B *X2. sample) + (CC (X2. sample/'2) ) +
morm (90, mean=0, sd=. 3)
gls.temp<-cbind(rep(1,120))
xls.temp<-cbind(X.sample)
yls.temp<-cbind(Y.sample)
sigl.datac-cbind(gls.temp,yls.temp,xls.temp)
g2s.tempc-cbind(rep(2,90))
x2s.temp<-cbind(X2.sample)
y2s.tempc-cbind(Y2.sample)
sig2.datac-cbind(g2s.temp,y2s.temp,x2s.temp)
sig.data<-rbind(sigl.data,sig2.data)
Nxsigmac-matrix(0,7,1)
Nysigmac-matrix(0,7,1)
rho<-matrix(0,7,1)
Nxmuc-matrix(0,7,1)
Nymuc-matrix(0,7,1)
para<-matrix(0,7,5) #records parameters of noise distn
samp22<-c("Sample 1 w/2 signals","Sample 2 w/2 signals",
signals","Sample 4 w/2 signals,"Sample 5 w/2 signals",'
signals","Sample 7 w/2 signals","Sample 8 w/2 signals",'
signals",Sample 10 w/2 signals","Sample 11 w/2 signals'
signalsSample 13 w/2 signals","Sample 14 w/2 signals'
signals","Sample 16 w/2 signals","Sample 17 w/2 signals'
signals","Sample 19 w/2 signals","Sample 20 w/2 signals'
"Sample 3 w/2
Sample 6 w/2
Sample 9 w/2
,"Sample 12 w/2
."Sample 15 w/2
."Sample 18 w/2
,"Sample 21 w/2
32


signals,"Sample 22 w/2 signals","Sample 23 w/2 signals","Sample 24 w/2
signals","Sample 25 w/2 signals","Sample 26 w/2 signals","Sample 27 w/2
signalsSample 28 w/2 signals","Sample 29 w/2 signals","Sample 30 w/2
signals")
## The process below simulates an increasing single noise cluster
## The distribution was assumed to be bivariate normal
for(k in 1:7){
Nxsigma[k] <-abs (morm(l,mean=100, sd=40) )
Nysigmafk]<-abs(rnorm(l,mean=2,sd=l))
rho[k]<-runif(1,-.8,.8)
Nxmu[k]<- rnorm(l,mean=400,sd=200)
Nymu[k]<- rnorm(1,mean=6,sd=2)
para[k, ]<-cbind(Nxsigma[k],Nysigmafk],rho[k] ,Nxmu[k],Nymu[k])
Nx<-matrix(0,20*11,1)
Ny<-matrix(0,20*11,1)
Nx<-rnorm(20*11,mean=Nxmu[k],sd=Nxsigma[k]) #starting values
for(i in 1:(20*11)){
Ny[i]<-rnorm(1,mean=Nymu[k]+(rho[k]*Nysigma[k]*(Nx[i]
Nxmufk]))/Nxsigma[k],sd=Nysigma[k]*sqrt(1-rho[k]^2))
}
##Combining noise and signals into one data set
for ( ii in 1:II){
ss<-20*ii
g.temp<-matrix(0,ss, 1)
x. temp<-matrix(0, ss, 1)
y. temp<-matrix(0, ss, 1)
#plot(X.sample,Y.sample,pch="1",xlim=c(0,1000),ylim=c(0,10),main=paste(samp
22 [k] ,"20x",ii,sep=" "))
#points(X2.sample,Y2.sample,pch="2")
#points(Nx[l:ss],Ny[1:ss],pch="*",cex=l.3)
#browser()
hh<-120+90+ss
simul.data<-matrix(0,hh,3)
simul.data.noise<-matrix(0,ss,3)
g.temp<-cbind(rep(3,ss))
x. temp<-cbind(Nx[l:ss])
y. temp<-cbind(Ny[1:ss])
simul.data.noise<-cbind(g.temp,y.temp,x.temp)
simul.data<-rbind(sig.data,simul.data.noise)
assign(paste("full2",ii,k,sep="."),simul.data)
}
}
################### simulating 2 signals: two partial #####################
X.sample<- rnorm(120,mean=500,sd=150)
for(j in 1:120){if (X.samplefj] >=520) X.sample[j]<- (2*530 X.sample[j] -
50) }
33


Y.sample<- (A) + (B *X.sample) + (CC (X.sampled))+
morm(120,mean=0, sd=. 25)
X2.sample<- rnorm(90,mean=520,sd=150)
for(j in 1:90){if (X2.sample[j] >=530) X2.sample[j]<- (2*520 X2.sample[j]
- 60) }
Y2.sample<- (A+2.5) + (B *X2.sample) + (CC (X2 .sampled) ) +
rnorm(90,mean=0,sd=.3)
gls.tempc-cbind(rep(l,120))
xls.tempc-cbind(X.sample)
yls.temp<-cbind(Y.sample)
sigl.data<-cbind(gls.temp,yls.temp,xls.temp)
g2s.temp<-cbind(rep(2,90))
x2s.temp<-cbind(X2.sample)
y2s.temp<-cbind(Y2.sample)
sig2.data<-cbind(g2s.temp,y2s.temp,x2s.temp)
sig.datac-rbind(sigl.data,sig2.data)
Nxsigma<-matrix(0,7,1)
Nysigma<-matrix(0,7,1)
rho<-matrix(0,7,1)
Nxmu<-matrix(0,7,1)
Nymuc-matrix(0,7,1)
para<-matrix(0,7,5) #records parameters of noise distn
samp22<-c("Sample 1 w/2 signals","Sample 2 w/2 signals","Sample 3 w/2
signals","Sample 4 w/2 signals","Sample 5 w/2 signals","Sample 6 w/2
signals","Sample 7 w/2 signals","Sample 8 w/2 signals","Sample 9 w/2
signals","Sample 10 w/2 signals","Sample 11 w/2 signals","Sample 12 w/2
signals","Sample 13 w/2 signals","Sample 14 w/2 signals","Sample 15 w/2
signals","Sample 16 w/2 signals","Sample 17 w/2 signals","Sample 18 w/2
signals","Sample 19 w/2 signals,"Sample 20 w/2 signals","Sample 21 w/2
signals","Sample 22 w/2 signals","Sample 23 w/2 signals","Sample 24 w/2
signals","Sample 25 w/2 signals","Sample 26 w/2 signals","Sample 27 w/2
signals","Sample 28 w/2 signals","Sample 29 w/2 signals","Sample 30 w/2
signals")
## The process below simulates an increasing single noise cluster
## The distribution was assumed to be bivariate normal
for(k in 1:7) {
Nxsigmafk]<-abs(rnorm(l,mean=100,sd=40))
Nysigma[k]<-abs(rnorm(l,mean=2,sd=l))
rho[k]<-runif(1,-.8, .8)
Nxmu[k]<- rnorm(1,mean=410,sd=170)
Nymu[k]<- rnorm(l,mean=6,sd=2)
para[k, ]<-cbind(Nxsigma[k],Nysigma[k] ,rho[k] ,Nxmu[k],Nymu[k]) #records
parameters of noise distn
Nx<-matrix(0,20*11,1)
Ny<-matrix(0,20*11,1)
Nx<-rnorm(20*II,mean=Nxmu[k],sd=Nxsigma[k]) #starting values
for(i in 1:(20*11)){
34


Ny[i]<-rnorm(l,mean=Nymu[k]+(rho[k]*Nysigma[k]*(Nx[i]-
Nxmu[k]))/Nxsigma[k],sd=Nysigma[k]*sqrt(1-rho[k]"2))
>
##Combining noise and signals into one data set
for ( ii in 1:II){
ss<-20*ii
g.temp<-matrix(0,ss,1)
x. temp<-matrix(0,ss,1)
y. temp<-matrix(0,ss,1)
#plot(X.sample,Y.sample,pch="l",xlim=c(0,1000),ylim=c(0,10),main=paste(samp
22[k],"20x",ii,sep=" "))
#points(X2.sarnie,Y2.sample,pch=2)
#points(Nx[l:ss],Ny[1:ss],pch=*",cex=l.3)
#browser()
hh<-120+90+ss
simul.data<-matrix(0,hh, 3)
simul.data.noise<-matrix(0,ss,3)
g. temp<-cbind(rep(3, ss))
x. temp<-cbind(Nx[l:ss])
y. tempc-cbind(Ny[1:ss])
simul.data.noise<-cbind(g.temp,y.temp,x.temp)
simul.data<-rbind(sig.data,simul.data.noise)
assign(paste("part2",ii,k,sep="."),simul.data)
}
}
####################
###The following code simulates two signals and varies the distance between
II<-15 # number of sample data sets to construct
A <- 10.427 # parameters obtained from above sample data set
B <- -.036
CC <- .0000356
samp22<-c("Sample 1 w/2 signalsSample 2 w/2 signalsSample 3 w/2
signals","Sample 4 w/2 signals,"Sample 5 w/2 signals","Sample 6 w/2
signals","Sample 7 w/2 signals","Sample 8 w/2 signals","Sample 9 w/2
signals","Sample 10 w/2 signals","Sample 11 w/2 signals","Sample 12 w/2
signals","Sample 13 w/2 signals","Sample 14 w/2 signals","Sample 15 w/2
signals","Sample 16 w/2 signals","Sample 17 w/2 signals","Sample 18 w/2
signals","Sample 19 w/2 signals","Sample 20 w/2 signals,"Sample 21 w/2
signals","Sample 22 w/2 signals","Sample 23 w/2 signals","Sample 24 w/2
signals","Sample 25 w/2 signals","Sample 26 w/2 signals","Sample 27 w/2
signals","Sample 28 w/2 signals","Sample 29 w/2 signalsSample 30 w/2
signals")
X. sample<-matrix(0,1,120) #Arrays for signal data
Y. sample<-matrix(0,l,120)
X2,sample<-matrix(0,1,90) #Arrays for signal data
35


Y2.samplec-matrix(0,11,90)
## The process below simulates a single noise cluster
## The distribution was assumed to be bivariate normal
Nxsigma<-matrix(0,1,1)
Nysigma<-matrix(0,1,1)
rho<-matrix(0,1,1)
Nxmu<-matrix(0,1,1)
Nymu<-matrix(0,1,1)
para<-matrix(0,1,5) #records parameters of noise distribution
Nxsigma<-abs(rnorm(l,mean=100,sd=40))
Nysigmac-abs(rnorm(l,mean=2,sd=l))
rho<-runif(1,-.6,.6)
Nxmu<- rnorm(l,mean=400,sd=195)
Nymu<- rnorm(l,mean=6, sd=2)
parac-cbind(Nxsigma,Nysigma,rho,Nxmu,Nymu)
Nx<-matrix(0,50,1)
Ny<-matrix(0,50,1)
Nxc-rnorm(50,mean=Nxmu,sd=Nxsigma) #starting values
for(i in 1:50) {
Ny[i]<-rnorm(1,mean=Nymu+(rho*Nysigma*(Nx[i]-
Nxmu))/Nxsigma,sd=Nysigma*sqrt(l-rho'2))
}
################ simulating 2 signals: one full, one partial ##############
################ decreasing separation distance ##############
X. samplec- rnorm(120,mean=500,sd=120)
Y. samplec- (A) + (B *X.sample) + (CC (X.sampled))+
rnorm(120, mean=0,sd=.25)
X2.samplec- rnorm(90,mean=520,sd=150)
for(j in 1:90)(if (X2.sample[j] >=530) X2.sample[j]c- (2*520 X2.sample[j]
- 60) }
for(ii in 1:11){
Y2.sample[ii,]c- (A+2+.2*ii) + (B *X2.sample) + (CC (X2.sampled))+
rnorm(90,mean=0,sd=.3)
gls.tempc-cbind(rep(1,120))
xls.tempc-cbind(X.sample)
yls.tempc-cbind(Y.sample)
sigl.datac-cbind(gls.temp,yls.temp,xls.temp)
g2s.tempc-cbind(rep(2,90))
x2s.tempc-cbind(X2.sample)
y2s. tempc-cbind (Y2 sample [ii] )
sig2.datac-cbind(g2s.temp,y2s.temp,x2s.temp)
sig.datac-rbind(sigl.data,sig2.data)
g. tempc-matrix(0,50,1)
x. tempc-matrix(0,50,1)
y. tempc-matrix(0,50,1)
plot(X.sample,Y.sample,pch="l",xlim=c(0,1000),ylim=c(0,10),main=paste(samp2
2[k],"20x,ii,sep=" "))
points(X2.sample,Y2.sample[ii,],pch=2")
points(Nx,Ny,pch="*",cex=l.3)
36


#browser()
simul.data<-matrix(0,210,3)
simul.data.noise<-matrix(0,50,3)
g.temp<-cbind(rep(3,50))
x. temp<-cbind(Nx[l:50])
y. temp<-cbind(Ny[1:50])
simul.data.noise<-cbind(g.temp,y.temp,x.temp)
s imul.data<-rbind(sig.data,s imul.data.noise)
assign(paste("sep2",ii,sep="."),simul.data)
)
#################### simulating 2 signals: two full #######################
#################### decreasing separation distance #######################
X. sample<-matrix(0,1,120) #Arrays for signal data
Y. sample<-matrix(0,1,120)
X2.sample<-matrix(0,1,90) #Arrays for signal data
Y2.sample<-matrix(0,11,90)
X. sample<- rnorm(120,mean=500,sd=120)
Y. sample<- (A) + (B *X.sample) + (CC (X.sampled))+
rnorm(120,mean=0,sd=.25)
X2 ,sample<- rnorm(90,mean=520,sd=150)
for(ii in 1:11){
Y2 .sample[ii,]<- (A+2+.2*ii) + (B *X2.sample) + (CC (X2.sampled)) +
rnorm(90,mean=0,sd=.3)
gls.temp<-cbind(rep(l,120))
xls. t'emp<-cbind (X. sample)
yls.temp<-cbind(Y.sample)
sigl.data<-cbind(gls.temp,yls.temp,xls.temp)
g2s.temp<-cbind(rep(2,90))
x2s.tempc-cbind(X2.sample)
y2s.temp<-cbind(Y2.sample[ii,])
sig2.data<-cbind(g2s.temp,y2s.temp,x2s.temp)
sig.data<-rbind(sigl-data,sig2.data)
g.temp<-matrix(0,50,1)
x. temp<-matrix(0,50,1)
y. temp<-matrix(0,50,1)
plot(X.sample,Y.sample,pch="l"
2[k],20x",ii,sep=" "))
points(X2.sample,Y2.sample[ii,
points(Nx,Ny,pch="*,cex=l.3)
#browser()
xlim=c(0,1000),ylim=c(0,10),main=paste(samp2
,pch="2")
simul.datac-matrix(0,210,3)
simul.data.noise<-matrix(0,50,3)
37


g.temp<-cbind(rep(3, 50) )
x. temp<-cbind(Nx[1:50])
y. temp<-cbind(Ny[l:50] )
simul.data.noise<-cbind(g.temp,y.temp, x.temp)
simul.datac-rbind(sig.data,simul.data.noise)
assign(paste("sepfull",ii,sep="."),simul.data)
}
################### simulating 2 signals: two partial #####################
################### decreasing separation distance #####################
X. samplec- rnorm(120,mean=500,sd=120)
for(j in 1:120){if (X.sample[j] >=535) X.sample[j]<- (2*525 X.sample[j] -
60) }
Y. sairplec- (A) + (B *X. sample) + (CC (X.sampled) ) +
rnorm(120,mean=0,sd=.25)
X2,sample<- rnorm(90,mean=520,sd=150)
for(j in 1:90){if (X2.sample[j] >=530) X2.sample[j]<- (2*520 X2.sample[j]
- 60) }
for(ii in 1:II){
Y2 sample [ii, ] <- (A+2+.2*ii) + (B *X2.sample) + (CC (X2. sampled) ) +
rnorm(90,mean=0,sd=.3)
gls.temp<-cbind(rep(1,120))
xls.temp<-cbind(X.sample)
yls.temp<-cbind(Y.sample)
sigl.data<-cbind(gls.temp,yls.temp.xls.temp)
g2s.temp<-cbind(rep(2,90))
x2s.temp<-cbind(X2.sample)
y2s.tempc-cbind(Y2.sample[ii,])
sig2.data<-cbind(g2s.temp,y2s.temp,x2s.temp)
sig.datac-rbind(sigl.data,sig2.data)
g.temp<-matrix(0,50,1)
x. temp<-matrix(0,50,1)
y. temp<-matrix(0,50,1)
plot(X.sample,Y.sample,pch="l",xlim=c(0,1000),ylim=c(0,10),main=paste(samp2
2[k],20x",ii,sep=" "))
points(X2.sample,Y2.sample[ii,],pch=2")
points(Nx,Ny,pch="*",cex=l.3)
{{browser ()
simul.data<-matrix(0,210,3)
simul.data.noise<-matrix(0,50,3)
g.teirg?<-cbind(rep(3,50) )
x. temp<-cbind(Nx[l:50])
y. temp<-cbind(Ny[1:50])
38


simul.data.noise<-cbind(g.temp,y.temp,x.temp)
simul.data<-rbind(sig.data,simul.data.noise)
assign(paste("seppart,ii,sep=."),simul.data)
}
###################
###The following function is called during the cluster fitting process
###It has been modified to accommodate simulated data
sim.reg.w <- function(x, y, xin, w.opt = T, pi = 1)
{
reg.w.crit <- function(x)
{
a <- x[l]
b <- x[2]
cc <- exp-(x[3])
sig <- exp(x[4])
if(w.opt) {
w <- x[5]
}
else {
w <- win
}
ei<-y-a-b*X-cc* X^2
w*2/(2 sqrt(pi) sig) 2 w mean(dnorm(ei, 0, sig))
}
plot(x, y, pch = paste(data[temp.use,1]))
assign("w.opt, w.opt)#, f = 1)
assign("X", x)#, f = 1)
assign("y", y)#, f = 1)
if(missing(xin)) {
xin <- c(0, 0, 0.5, 0.4, .5)
}
assign("win", xin[5])#, f = 1)
if(xin[3] <= 0)
stop("input cc parameter must be positive")
xO <- c(xin[l:2], log(xin[3:4]))
if(w.opt) {
xO <- c(x0, win)
)
ans <- nlm(reg.w.crit, xO, iterlim = 100, #max.fcal = 150,
print.level = pi)
a <- ans$estim[l]
b <- ans$estim[2]
cc <- exp(ans$estim[3])
sig <- exp(ans$estim[4])
xx <- seq(min(x), max(x), 1 = 100)
yp <- a + b xx + cc xx^2
lines(xx, yp)
if(w.opt) {
w <- ans$estim[5]
}
else {
39


w <- win
}
print(round(c(a, b, cc, sig, w), 4))
###################
###The following code uses the L2E method for curve fitting
###It has been modified to accommodate simulated data
source("A:/smiles/sim.reg.w.txt")
L<-3
do.list <- 1:15
sum.ana<-matrix(0,length(do.list), 3)
for(ii in 1:length(do.list)){
k <- do.list[ii]
print(k)
browser()
temp <- get(paste("partfull,k,L,sep="."))
data <- temp
cutl <- 2
# cutl <- 2
#cut2 <- .1 # mse what was threshold
cut2 <- .25 # mse what was threshold
min.temp <- 10
data[,2:3]<-scale(data[,2:3])
nfeatpts <- 0
featpts <- vector("listn)
featfits <- vector("list")
continue <- T
tenp.use <- l:nrow(data)
i <- 1
tempw <- 0
while(continue) {
cat("k=",k,"\n)
temp <- sim.reg.w(x=data[temp.use,3],y=data[temp.use, 2])
# tempw <- c(tempw,temp[5])
temp.fit <- temp[l] + temp [2]*data[temp.use,3] +
temp[3]*data[temp.use,3]^2
# evaluates fit at all remaining data
temp.resids <- (data[temp.use,2] temp.fit)/temp[4]
tempw <- c(tempw,temp[5],sum(abs(temp.resids) <
cutl)/length(temp.resids),length(temp.resids))
# sigma in temp [4] and w in temp[5] # SAVE this w
if(length(abs(temp.resids) < cutl) <5 | | length(temp.use) < 5 ) {
continue <- F
}
tempa <- qnorm(p=seq(0.001,0.999,length=length(temp.resids)))
tempb <- cor(tempa,sort(temp.resids))
#if(tempb < .9) { continue <- F}
# accept no more than w percent:
browser()
40


temp.cutw <- (sort(abs(temp.resids)))[temp[5]*length(temp.use)] # smallest
w resids
temp.cutw <- min(temp.cutw,cutl)
temp.feature <- (l:nrow(data[temp.use,]))[abs(temp.resids) < temp.cutw]
if(length(temp.feature) < 5) (continue <- F}
temp.feature.originaldata <- (1:nrow(data))[temp.use][abs(temp.resids) <
cutl]
#remaining data in current feature
temp.index <- temp.use[-temp.feature]
nfeatpts <- c(nfeatpts,length(temp.feature))
featpts[[i]] <- temp.feature.originaldata
featfits[[i]] <- temp
i <- i + 1
cat("i=",i, \n)
if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 7) (continue <-
F>
#if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 1) (continue
<- F}
temp.use <- (1:nrow(data[temp.use,]))[-temp.feature] # note this step
if(length(temp.use) < 5) (continue <- F}
print(temp.use)
}
p <- length(featpts)
guess <- numeric(nrow(data))
if (p > 0) {
for(i in l:p) {
guess[featpts[[i]]] <- i
}
}
guess[guess==0] <- p+1
print(k)
assign(paste("partfull",k,L,"12e.result2a",sep="."),guess) # max of 1
assign(paste("partfull",k,L,"12e.result2a.w", sep=". ) ,tempw[-l]) # max of
1
sum.ana[k,]<-cbind(k,sum(guess==l),length(guess))
}
for(ii in 1:length(do.list)){
k <- do.list[ii]
# for(k in 1:30) {
print(k)
temp <- get(paste("partfull,k,L,sep=".))
data <- temp
cutl <- 2.2
cut2 <- .20 # mse -- what was threshold
min.temp <- 10
data[ ,2:3]<-scale(data[,2:3])
nfeatpts <- 0
featpts <- vector("list)
featfits <- vector("list")
continue <- T
temp.use <- l:nrow(data)
41


i <- 1
tempw <- 0
while(continue) {
print(k)
temp <- sim.reg.w(x=data[temp.use,3],y=data[temp.use,2])
# tempw <- c(tempw,temp[5])
temp.fit <- temp[l] + temp[2]*data[temp.use,3] +
temp[3]*data(temp.use,3]^2
# evaluates fit at all remaining data
temp.resids <- (data[temp.use,2] temp.fit)/temp[4]
tempw <- c(tempw,temp[5],sumfabs(temp.resids) <
cutl)/length(temp.resids),length(temp.resids))
if(length(abs(temp.resids) < cutl) < 5 || length(temp.use) < 5 ) {
continue <- F
}
tempa <- qnorm(p=seq(0.001,0.999,length=length(abs(temp.resids))))
tempb <- cor(tempa,sort(temp.resids))
#if(tempb < ^9) { continue <- F)
temp.cutw <- (sort(abs(temp.resids)))[temp[5]*length(temp.use)] # smallest
w resids
temp.cutw <- min(temp.cutw,cutl)
temp.feature <- (l:nrow(data[temp.use,]))[abs(temp.resids) < temp.cutw]
if(length(temp.feature) < 5) [continue <- F)
temp.feature.originaldata <- (1:nrow(data))[temp.use][abs(temp.resids) <
cutl]
#remaining data in current feature
temp.index <- temp.use[-temp.feature]
nfeatpts <- c(nfeatpts,length(temp.feature))
featpts[[i]] <- temp.feature.originaldata
featfits[[i]] <- temp
i <- i + 1
if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 6) [continue <-
F}
#if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 1) (continue
<- F}
temp.use <- (l:nrow(data[temp.use,]))[-temp.feature] # note this step
if(length(temp.use) < 5) (continue <- F}
}
p <- length(featpts)
guess <- numeric(nrow(data))
if(p > 0) (
for(i in l:p) (
guess[featpts[[i]]] <- i
>
}
guess[guess==0] <- p+1
print(k)
42


assign(paste("partfull", k, L,"12e.result2b",sep=".") assign(paste("partfull",k,L,"12e,result2b.w",sep="."),tempw[-l])
}
43


Appendix C. Cited Spins Code
###################
### Indices used to read in data records
1 158 433 573 1016 1443 1631 2033 2230 2470 2616 2787 2926 3097 3215
3456 3669 4009 4272 4373 4677 4834 5147 5524 5798 6126 6292 6567 6701 6887
7073
###################
###Code supplied to read in the data records
# You may need to change path, to point to the correct directory
path = "A:/smiles/"
data.all = matrix(scan(paste("E:/Is.txt")),byrow=T,ncol=3)
tempi = scan(paste(path,"data.indices.txt",sep="/"))
indices.low <- tempi
indices.high <- tempi[2:31]-1
for(i in 1:3 0) {
temp <- data.all[indices.low[i]:indices.high[i],]
assign(paste("D",i,sep=""),temp)
}
# data is then in Dl, D2, D3, etc.
###################
###Function called in cluster analysis; non-linear minimization called
reg.w <- function(x, y, xin, w.opt = T, pi = 1)
{
reg.w.crit <- function(x)
{
a <- x[l]
b <- x[2]
cc <- exp(x[3])
sig <- exp(x[4])
if(w.opt) {
w <- x[5]
}
else {
w <- win
}
ei<-y-a-b*X-cc* X^2
wA2/(2 sqrt(pi) sig) 2 w mean(dnorm(ei, 0, sig))+.7*sig
}
plot(x, y, pch =
paste(data[temp.use,1]),xlab="Time(scaled)',ylab="TEC(scaled)")
browser()
plot(x, y)
assign("w.opt", w.opt)#, f = 1)
assign("X", x)#, f = 1)
assign("y", y)#, f = 1)
if(missing(xin)) {
44


xin <- c(0, 0, 0.5, 0.2,1)
}
assign("win", xin[5])#, f = 1)
if(xin[3] <= 0)
stop("input cc parameter must be positive")
xO <- c(xin[l:2], log(xin[3:4]))
if(w.opt) {
xO <- c(x0, win)
}
ass <- nlm(reg.w.crit, xO, iterlim = 100, #max.fcal = 150,
print.level = pi)
a <- ans$estim[l]
b <- ans$estim[2]
cc <- exp(ans$estim[3])
sig <- exp(ans$estim[4])
xx <- seq(min(x), max(x), 1 = 100)
yp <- a + b xx + cc xx~2
lines(xx, yp)
browser()
if(w.opt) {
w <- ans$estim[5]
}
else {
w <- win
}
print(round(c(a, b, cc, sig, w), 4))
###################
###The following code uses the L2E method in curve fitting
source("A:/smiles/reg.w.txt")
#par(mfrow=c(3,2))
#do.list <- c(l,2,29)
do.list <-1:1
for(ii in 1:length(do.list)){
k <- do.list[ii]
print(k)
#browser()
temp <- get(paste("D,k,sep=""))
data <- temp
#cutl <- 1
cutl <- 2
#cut2 <- .1 # mse what was threshold
cut2 <- .25 # mse what was threshold
min.temp <- 10
data[,2:3]<-scale(data[,2:3])
nfeatpts <- 0
featpts <- vector("list")
featfits <- vector("list")
continue <- T
temp.use <- l:nrow(data)
i <- 1
45


tempw <- 0
while(continue) {
cat("k=",k,"\n")
temp <- reg.w(x=data[temp.use,3],y=data[temp.use,2])
# tempw <- c(tempw,temp[5])
temp.fit <- temp[l] + temp[2]*data[temp.use,3] +
temp [3]*data[temp.use,3]^2
# evaluates fit at all remaining data
temp.resids <- (data[temp.use,2] temp.fit)/temp[4]
tempw <- c(tenpw,temp[5],sum(abs(temp.resids) <
cutl)/length(temp.resids),length(temp.resids))
# sigma in temp[4] and w in temp[5] # SAVE this w
if(length(abs(temp.resids) < cutl) <5 || length(temp.use) < 5 ) {
continue <- F
}
tempa <- qnorm(p=seq(0.001,0.999,length=length(temp.resids)))
tempb <- cor(tempa,sort(temp.resids))
#if(tempb < .9) { continue <- F}
# accept no more than w percent:
browser()
temp.cutw <- (sort(abs(temp.resids)))[temp[5]*length(temp.use)] # smallest
w resids
print("temp.cutwl)
print(temp.cutw)
temp.cutw <- min(temp.cutw,cutl)
print("temp.cutw2")
print (teirp.cutw)
temp.feature <- (1:nrow(data[temp.use,]))[abs(temp.resids) < temp.cutw]
print(temp.feature)
if(length(temp.feature) < 5) {continue <- F}
temp.feature.originaldata <- (1:nrow(data))[temp.use][abs(temp.resids) <
temp. cutw]
print(temp.feature.originaldata)
remaining data in current feature
temp.index <- temp.use[-temp.feature]
nfeatpts <- c(nfeatpts,length(temp.feature))
featpts[[i]] <- temp.feature.originaldata
featfits[[i]] <- temp
i <- i + 1
cat("i=",i,"\n)
if(length(temp.feature) < min.temp || temp[4] > cut2 [| i > 7) {continue <-
F}
if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 1) {continue
<- F)
temp.use <- (1:nrow(data[temp.use,]))[-temp.feature] note this step
if(length(temp.use) < 5) {continue <- F}
print(temp.use)
}
p <- length(featpts)
guess <- numeric(nrow(data))
if(p > 0) {
for(i in l:p) {
46


guess[featpts[[i]]] <- i
>
}
guess[guess==0] <- p+1
print(k)
assign(paste("D",k,12e.result2a,sep=""),guess) # max of 1
assign(paste("D",k,".12e.result2a.w",sep=""), tempw[-l]) # max of 1
}
for(ii in 1:length(do.list)){
k <- do.list[ii]
# for(k in 1:30) {
print(k)
temp <- get(paste("D",k,sep="))
data <- temp
cutl <- 2.2
cut2 <- .20 # mse what was threshold
min.temp <- 10
data[,2:3]<-scale(data[ ,2:3] )
nfeatpts <- 0
featpts <- vector("list")
featfits <- vector("list")
continue <- T
temp.use <- l:nrow(data)
i <- 1
tempw <- 0
while(continue) {
print(k)
ten) <- reg.w(x=data[temp.use,3],y=data[temp.use,2])
# tempw <- c(tempw,temp[5])
temp.fit <- temp[l] + temp[2]*data[temp.use,3] +
temp[3]*data[temp.use,3]A2
# evaluates fit at all remaining data
temp.resids <- (data[temp.use,2] temp.fit)/temp[4]
tempw <- c(tempw,temp[5],sum(abs(temp.resids) <
cutl)/length(temp.resids),length(temp.resids))
if(length(abs(temp.resids) < cutl) < 5 || length(temp.use) < 5 ) {
continue <- F
}
tempa <- qnorm(p=seq(0.001,0.999,length=length(abs(temp.resids))))
tempb <- cor(tempa,sort(temp.resids))
#if(tempb < .9) { continue <- F)
temp.cutw <- (sort(abs(temp.resids)))[temp[5]*length(tenp.use)] # smallest
w resids
temp.cutw <- min(temp.cutw,cutl)
temp.feature <- (1:nrow(data[temp.use,]))[abs(temp.resids) < temp.cutw]
if(length(temp.feature) < 5) {continue <- F]
47


temp.feature.originaldata <- (1:nrow(data))[temp.use][abs(temp.resids) <
cutl]
([remaining data in current feature
temp.index <- temp.use[-temp.feature]
nfeatpts <- c(nfeatpts,length(temp.feature))
featpts[[i]] <- temp.feature.originaldata
featfits[[i]] <- temp
i <- i + 1
if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 6) (continue <-
F}
#if(length(temp.feature) < min.temp || temp[4] > cut2 || i > 1) (continue
<- F}
temp.use <- (l:nrow(dataftemp.use,]))[-temp.feature] # note this step
if(length(temp.use) < 5) (continue F)
}
p <- length(featpts)
guess <- numeric(nrow(data))
if(p > 0) {
for(i in l:p) {
guess[featpts[[i]]] <- i
} '
}
guess[guess==0] <- p+1
print(k)
assign(paste("D",k,".12e.result2b,sep=""),guess) # max of 1
assign(paste("D",k,".12e.result2b.w",sep="),tempw[-l])
DDl<-cbind(D1[,1],scale(D1[,2:3]))
alx<-DDl[D1.12e.result2b>l,3]
aly<-DDl[Dl.12e.result2b>l,2]
a2x<-DDl[Dl.12e.result2b>2,3]
a2y<-DDl[Dl.12e.result2b>2,2]
kl<-featfits[[1]]
k2<-featfits[[2]]
xx<-seq(min(DDl[,3]),max(DDl[,2]),1=100)
yl<-kl [ 1 ] +kl [2 ] *xx+kl [ 3 ] *xx/'2
y2<-k2[l]+k2[2]*xx+k2[3]*xx~2
xl<-
seq(min(DD1[Dl.12e.result2b==l,3]),max(DD1[Dl.12e.result2b==l,3]),1=100)
yyl<-kl[1]+kl[2]*xl+kl[3]*xl~2
x2<-
seq(min(DD1[Dl.12e.result2b==2,3]),max(DDl[Dl.12e.result2b==2,3]),1=100)
yy2<-k2[1]+k2[2]*x2+k2[3]*x2^2
plot(DDl[Dl.12e.result2b==l,3],DDl[Dl.12e.result2b==l,2],xlab=Time(scaled)
",ylab="TEC(scaled)",xlim=c(min(DDl[,3]),max(DDl[,3])),ylim=c(min(DDl[,2]),
max(DDl[,2])) ,pch=1)
points(DDl[Dl_12e.result2b==2,3],DDl[Dl.12e.result2b==2,2],pch=2")
points(DDl[Dl.12e.result2b==3,3],DDl[Dl.12e.result2b==3,2],pch="3")
lines(xl,yyl)
lines(x2,yy2)
>
48


# then, do comparison and then do final diagnostic check and final result in
Dx.l2e.result2c
# as fair comparison to our brute force method.
49


REFERENCES
Burr, T., Mielke, A., and Jacobson, A. (2001). Identifying Storms in Noisy Radio Frequency Data
via Satellite: an Application of Density Estimation and Cluster Analysis. Proceedings of the
US Army Conference on Applied Statistics, Santa Fe, NM. LAUR-01-4874.
Dennis, J. E. and Schnabel, R. B. (1983). Numerical Methods for Unconstrained Optimization and
Nonlinear Equations. Englewood Clifts, NJ: Prentice-Hall.
Jacobson, A., Knox, S., Franz, R., and Enemark, D. (1999). FORTE Observations of Lightning
Radio-Frequency Signatures: Capabilities and Basic Results. Radio Sci 34(2): 337-354. See
also http://nis-www.Ianl.gov/nis-proiects/forte science/html.
R Development Core Team. (2003). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-00-3. See also
http://www,R-proi ect.org.
Schnabel, R. B., Koontz, J. E. and Weiss, B. E. (1985). A modular system of algorithms for
unconstrained minimization. ACM Trans. Math. Software 11:419440.
Scott, D. (2000). Parametric Statistical Modeling by Minimum Integrated Square Error.
Technometrics 43(3): 274-285.
Scott, D., and Szewczyk, W. (2000). From Kernels to Mixtures. Technometrics 43(3): 323-335.
Splus5 for Linux. (1999). Seattle, Washington: Insightful Corporation.
50