Citation
Comparing the generalized likelihood ratio and variance tests for binomial homogeneity with observation of Bayesian adjustment

Material Information

Title:
Comparing the generalized likelihood ratio and variance tests for binomial homogeneity with observation of Bayesian adjustment
Creator:
Ruelle, Alissa
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
xiii, 62 leaves : illustrations ; 29 cm

Subjects

Subjects / Keywords:
Bayesian statistical decision theory ( lcsh )
Mathematical statistics ( lcsh )
Probabilities ( lcsh )
Bayesian statistical decision theory ( fast )
Mathematical statistics ( fast )
Probabilities ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaf 62).
Thesis:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Applied Mathematics
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Alissa Ruelle.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
37306904 ( OCLC )
ocm37306904
Classification:
LD1190.L622 1996m .R84 ( lcc )

Downloads

This item has the following downloads:


Full Text
COMPARING THE GENERALIZED LIKELIHOOD RATIO
AND VARIANCE TESTS FOR BINOMIAL HOMOGENEITY
WITH OBSERVATION OF BAYESIAN ADJUSTMENT
B.S., University of Michigan, 1993
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Applied Mathematics
by
Alissa Ruelle
1996


1996 by Alissa Ruelle
All rights reserved.


This thesis for the Master of Science
degree by
Alissa Ruelle
has been approved
by
Date


Ruelle, Alissa (M.S., Applied Mathematics)
Comparing the Generalized Likelihood Ratio and Variance Tests for Binomial
Homogeneity with Observation of Bayesian Adjustment
Thesis directed by Assistant Professor James Koehler
ABSTRACT
H0: Pi =P2 = =PC =P
Ha: pj Pj for some i *j
Two testing procedures for homogeneity of c proportions are used commonly:
the generalized likelihood ratio and the variance tests for binomial homogeneity. Both
testing procedures produce statistics that asymptotically follow a x2 distribution, with
c-1 degrees of freedom. The asymptotic case, i.e. having a large number of Bernoulli
trials (n) behind each proportion, poses few problems, if any, for both testing
procedures. With small n, though, problems exist with respect to assuming that the
test statistics follow the appropriate x2 distribution.
Using Splus, proportions were generated and tested, using both testing
procedures, to estimate the test statistics true distributions. The power of each test
was established with these generated distributions and their respective critical regions.
Comparison of the power functions followed.
IV


In addition to having problems with small n, the variance test cannot be used
when one or more proportions equal 1 or 0. As a plausible remedy, an adjustment to
the variance test was proposed, incorporating Bayesian mathematics. This proposed
testing procedure was compared back to the generalized likelihood ratio and variance
tests.
This abstract accurately represents the content of the candidates thesis. I recommend
its publication.
Signed
v


In gratitude to Danielle, whose spirit offered me the energy,
to Karen, whose constant inquiry gave the energy direction,
and to Mom, who provided eveiything else.
vi


CONTENTS
Chapter
1. Introduction...................................................................1
1.1 Introduction: The Generalized Likelihood Ratio and Traditional Tests.......1
1.2 A Bayesian Approach........................................................4
2. The Generalized Likelihood Ratio Test..........................................6
2.1 The Test Statistic and Parameter Spaces of Hypotheses......................6
2.2 Type 1 Error...............................................................8
3. The Traditional Test..........................................................20
3.1 The Test Statistic........................................................20
3.2 Assessing the Probability of Failure of the Traditional Test Statistic....21
3.3 Type 1 Error..............................................................24
4. Power Comparison..............................................................32
4.1 Power Comparison: Sufficiency.............................................32
4.2 Power Comparison: Type 1 Error............................................34
4.3 The Final Power Comparison: Results.......................................34
5. Bayesian-Adjustment Approach..................................................42
5.1 Bayesian Theory...........................................................42
5.2 Results...................................................................45
5.3 Power Comparison with GLRT and Traditional................................52
6. Conclusion....................................................................54
6.1 Remarks...................................................................54
vii


Appendix
A. Splus code used to assess probability of usefiil set of ten proportions, given n and p.. .55
B. Splus code used to establish true critical regions of test statistics, given n, c, p.56
C. Splus code used to establish power of two tests, given true critical regions...........57
D. Splus code used to establish distribution of Bayesian-adjusted test statistics.........59
E. Critical values for generalized likelihood ratio test for homogeneity of proportions....60
Bibliography.......:......................................................................62
vm


DIAGRAMS
Diagram
2.2.1 Distribution of GLRT statistics (n=30,c=25,p=0.2)............................9
2.2.2 Distribution of GLRT statistics (n=30,c=25,p=0.5)............................9
2.2.3 Distribution of GLRT statistics (n=30,c=25,p=0.8)...........................10
2.2.4 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.2)..............................10
2.2.5 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.5)..............................11
2.2.6 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.8)..............................11
2.2.7 Distribution of GLRT statistics (n=10,c=25,p=0.8)...........................12
2.2.8 Distribution of GLRT statistics (n=10,c=10,p=0.8)...........................12
2.2.9 Cumulative distribution of GLRT statistic (n=30,c=25,p=0.8).................14
2.2.10 GLRT critical values c=l 0..................................................16
2.2.11 GLRT critical values c=l5...................................................16
2.2.12 Modeling GLRT critical values: residuals vs. fitted.........................17
2.2.13 Modeling GLRT critical values: Q-Q norm, plot...............................17
2.2.14 Modeling GLRT critical values: fitted vs. observed..........................18
3.2.1 Probability of ability to use traditional test c=10........................23
3.2.2 Contour plot of surface.....................................................23
3.3.1 Distribution of trad. Test statistics (n=30,c=25,p=0.2).....................25
3.3.2 Distribution of trad. Test statistics (n=30,c=25,p=0.5).....................25
3.3.3 Distribution of trad. Test statistics (n=30,c=25,p=0.8).....................26
3.3.4 Traditional critical values c=10............................................29
IX


Diagram ('continued')
3.3.5 Traditional critical values c=15..............................................29
4.3.1 Power comparison (n=10,c=5,p=0.2).............................................35
4.3.2 Power comparison (n=10,c=5,p=0.4).............................................36
4.3.3 Power comparison (n=10,c=5,p=0.5)........................................... 36
4.3.4 Power comparison (n=10,c=5,p=0.6).............................................37
4.3.5 Power comparison (n=30,c=5,p=0.2).............................................38
4.3.6 Power comparison (n=30,c=5,p=0.4).............................................38
4.3.7 Power comparison (n=30,c=5,p=0.5).............................................39
4.3.8 Power comparison (n=30,c=5,p=0.6).............................................39
4.3.9 Power comparison (n=30,c=25,p=0.2)............................................40
4.3.10 Power comparison (n=30,c=25,p=0.4)............................................40
4.3.11 Power comparison (n=30,c=25,p=0.5)............................................41
4.3.12 Power comparison (n=30,c=25,p=0.6)............................................41
5.2.1 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.2)...........46
5.2.2 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.5)...........47
5.2.3 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.8)...........47
5.2.4 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.1)............48
5.2.5 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.2)............49
5.2.6 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.3)............49
5.2.7 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.5)............49
5.2.8 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.7)............50
5.2.9 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.8)............50
x


Diagram (continued')
5.2.10 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.9)............50
5.2.11 Distribution of traditional test statistics (n=10,c=10,p=0.2).................51
5.2.11 Distribution of traditional test statistics (n=10,c=10,p=0.8).................51
5.3.1 Power comparison (n=10,c=5,p=0.2).............................................53
XI


TABLES
Table
1.1 ASPN: Referral care rates.........................................................2
3.3 Critical region for traditional chi-squared test for homogeneity of proportions 27
xii


ACKNOWLEDGMENTS
Dr. Jim Koehler for being my dedicated, sincerely interested, and caring educator, mentor,
and friend.
Dr. Karen Kafadar and Dr. Ram Shanmugam for participating on my graduate committee
and for their timely review of my thesis, followed by interesting and educating inquiry.
Biostatistics Division of AMC Cancer Research Center for providing employment and
invaluable experience throughout my graduate studies.
Ambulatory Sentinel Practice Network, specifically Jim Werner, for supplying data to
analyze.
Fellow students/friends at University of Colorado at Denver for offering relentless support
during my research and at all other times.
Xlll


1.
Introduction
1.1 Introduction: The Generalized Likelihood Ratio and Traditional Tests
Two testing procedures for homogeneity of proportions are used commonly: the generalized
likelihood ratio test (GLRT) and the traditional testing procedure (Fleiss, 1973; Hedges and Olkin,
1985). This traditional test is of similar format to a test termed the variance test of binomial
homogeneity (Snedecor and Cochran, 1967). The traditional testing procedure appeals to the
asymptotic nature of sum of squared standard normal variates, which is Chi-squared, with degrees of
freedom equal to the number of proportions minus one; the GLRT utilizes the ratio of the joint likelihood
functions for the sample of proportions under the null and alternative hypotheses parameter spaces. This
is also, asymptotically, a Chi-squared variate with degrees of freedom again equal to the number of
proportions minus 1. The performance of both tests is dependent upon 1) n, the size of the classes from
whence the proportions come, 2) p, the expected value of the hypothesized distribution of the proportions,
and 3) c, the number of proportions at hand.
An example of the necessity for a homogeneity test for proportions was found in the Ambulatory
Sentinel Practice Network (ASPN) database. ASPNs question concerned referral rates for follow-up
care, the percentage of times that patients were referred to a specialist after being treated at an emergency-
care level. To test the homogeneity of referral care rates, classes must be established. As case-mixing
most likely existed within each individual practice, (as a hypothetical example, within Practice H, 74% of
the emergency care patients were treated for respiratory complications; at Practice K, only 13%) it was
necessary to designate one type of patient, ailing from a particular illness. After type of patient was
1


selected, the class (practice) was used, thus a referral rate was calculated for each medical practice. A
listing of referral rates for ailments identified as respiratory complications follow in Table 1.1. The data of
Table 1.1 will be summoned throughout this document.
Table 1.1 ASPN: Referral Care Rates
P=Practice code N=Number of Patient R=Referral Rate
p N RR 1 iB p N RR 111 P N RR
1 10 1.000 1 Wm 24 8 0.875 PHI 47 20 0.900
2 16 0.875 | ppj 25 6 0.833 iijj 48 9 1.000
3 8 0.250 1 26 11 1.000 1111 49 55 0.818
4 20 0.900 1 flllj 27 23 0.913 IJlIj 50 4 1.000
5 11 0.818 I illi 28 16 0.563 1111 31 29 0.897
6 1 1.000 1 111] 29 23 0.957 1111 52 11 0.909
7 6 0.833 | Si 30 16 1.000 Mil 53 6 0.500
8 10 0.800 i Wm 31 15 0.667 1 34 6 0.167
9 6 0.667 I 1111 32 7 0.286 1119 35 4 0.750
10 15 1.000 i IBl 33 15 0.933 If 56 19 0.842
11 6 1.000 1 Big 34 23 0.957 III M 57 4 1.000
12 5 1.000 f 111 35 36 0.833 II II 38 18 1.000
13 36 0.889 I lli 36 3 1.000 111111 59 6 0.667
14 2 1.000 I 111 37 11 0.909 11 60 33 0.848
15 10 0.900 | H 38 31 0.710 HI 61 13 0.769
16 22 0.864 | 8 39 2 0.000 |§ 62 6 0.833
17 8 0.500 | M 40 11 0.636 11 63 5 0.600
18 40 0.825 | Bill 41 25 0.800 llil 64 8 0.875
19 21 0.952 | 1111 42 2 1.000 1111 65 17 0.764
20 12 0.500 | illi 43 17 0.941 3 66 5 1.000
21 13 0.769 1 1111 44 7 0.857 3 67 5 0.800
22 40 0.425 1 45 20 0.950 111 68 4 1.000
23 9 0.667 | illi 46 9 0.556
2


Using the traditional test for homogeneity of proportions, we calculate for each class a
proportion, followed by an additional formulation of a representative proportion for the c classes, which is
a weighted average of the individual proportions, using weights inversely proportional to the variance of
the proportion. The characteristics of the binomial distribution lead us to the weighting schema used in the
traditional test statistic, and it will be shown in Chapter 3 that when the observed proportions near 0 or 1,
the weights tend towards infinity. In fact, the traditional test statistic falters when we utilize some specific
combinations of n, p, and c. Of interest is how the traditional test statistic behaves when within a region of
usable n, p, and c.
Since each patient in each class can be viewed as a Bernoulli trial, each with hypothesized
parameter p;, each proportion is a fimction of a binomial variate, with parameters (ni( p;). With this
traditional testing mechanism, we test the hypotheses, assuming c classes:
H0:p,=p2 = ...=pc=^ VS. H^p^pj forsomei*j
This same hypothesis is also tested through the generalized likelihood ratio test, which utilizes the joint
distribution of the observed proportions, pjS, under the parameter spaces of the null and alternative
hypotheses. Unbiased and attaining minimum variance, the best estimator for an individual hypothesized
pt is the observed proportion of Bernoulli successes, which is the maximum likelihood estimator,:
Number of successful Bernoulli trials of Class i
p. = -------------------------------------------------------
1 Number of Bernoulli trials of Class i
Number of Patients Referred to a Specialist, within Practice i
Number of Patients Treated By Practice i
3


It will be shown that the traditional and generalized likelihood ratio tests are not of the same size,
when using critical regions of the hypothesized Chi-squared distribution. To accommodate for the
inequality of size using those hypothesized regions, the estimated true critical regions of the traditional test
statistic, for particular combinations of number of Bernoulli trials per class (n), hypothesized expected
value of the set of p;s (p), and the number of classes (c), will be provided. These estimated true critical
regions are obtained via the empirical distribution via simulation and are used through any hypothesis
testing within the thesis research.
In further analysis of these two testing procedures, in Chapter 4 it is shown that both test statistics
are dependent on sufficient statistics. Using estimated critical values, the Type 1 errors were forced to be
equal for the purpose of power comparison. For the power comparisons, inhomogeneity was forced into
simulated proportions, with the proportions as realizations from differing distributions, and power of each
test statistic is established. Power is graphed as a function of the deviation from the null hypothesis, per
testing procedure, granting pictorial comparison.
1.2 A Bayesian Approach
In Chapter 5, we entertain a totally different avenue in examining smaller and larger proportions,
as with proportions equal to 0 or 1 we are unable to test homogeneity adequately by the traditional chi-
squared test. We take a look at our ability to tweak these problematic proportions with an amount
basically determined by the size of the sample the proportion describes. This tweaking alleviates the
problem of failure of the traditional test, as we have no weights approaching infinity. Bayesian theory is
used, recognizing that the hypothesized p could be considered a variate from its own distribution, called a
prior distribution.
4


An estimate for the critical region for this Bayesian-adj usted test statistic is formulated, using
procedures similar to that used in Section 1.1. Estimation of power, again graphed as a function of the
deviation from the null hypothesis, followed, and was compared back to that of the traditional test and
GLRT.
5


2.
The Generalized Likelihood Ratio Test
2.1 The Test Statistic and Parameter Spaces of Hypotheses
In testing for homogeneity of proportions, we test the null hypothesis Hq: p, = p2 =... = pc = p
against the alternative HA: p; t p. for some i j. The distribution of an observed^ is derived from the
distribution of the number of successful Bernoulli trials within the ith class, termed X;. The distribution of
X; is binomial with parameters n and p;. We write the cumulative distribution function of X as
P(X^ x) = Fx(x), so utilizing change of variables, we see that the cumulative distribution function of pf
is:
Fp(Pi) = P(psPi) = Pi) = P(X^n^Pj) = F^vp.)
By the chain rule, we can write
Eqn. 2.1.1
5Fn
Pp(Pj) = -^(m^Pj) = n^^Cn^P;)
dPi
and the likelihood function of the individual hypothesized proportion is then
S£(Pi) = n;*
/
If
I VPi
Eqn. 2.1.2
pVPid.p^O-r,)
Eqn. 2.1.3
The generalized likelihood ratio test utilizes the distributions of the individual p; s to form
likelihood functions over the two parameters spaces
6


= {& : 0 Q = {pi : O^Pi^l i = }
hypothesized and with no constraints, respectively. The generalized likelihood test statistic (Larsen and
Marx, 1986), using the joint likelihood function of the individual hypothesized P| s, is thus -2 multiplied
by the natural logarithm of:
suPmSE(bIb)
supflS£(ele)
II",
"iPi
PmLE ' ( 1 P.MLE )
fi ni
i-l
"i
n;*p.
1

Eqn. 2.1.4
where p,^ is the maximum likelihood estimator for />, which can be calculated easily through
maximizing the natural log of the joint distribution of the p ;s. The estimator p^g is exactly the number
of Bernoulli successes divided by the number of Bernoulli trials, over all c classes.
E"iPi
_ i-l
Pmle "
^2 n: Eqn. 2.1.5
i-l
This statistic is a function of the individual p( s, weighted by their corresponding sample sizes, and is
simply the proportion of successful Bernoulli trials over all c classes. With it, we formulated
arepresentative statistic that attains small variance, and it can be checked that it attains the Cramer-Rao
Lower Bound (CRLB) for variance (n^n, i= 1,2.....c): (p,,p2......pc) ~ fp(p)= nfx(np)
7


where X is a binomial variate representing the number of
Bernoulli successes within a class. Thus fp(p) can be
expressed as
(
n
l np
,np(l -p)nO-P)
A[lnfp(p;p)]
dp
(P -P)
P(l ~P)
n
The CRLB can thus be written as
1
1
~P) )2
n
EpKp-p)2]
p(l -p)
n
Which is exactly the estimated variance of the pjS, under
the null hypothesis.
2.2 Type 1 Error
- The code of Appendix B generated 4,000 GLRT statistics. This simulation was performed for
175 combinations ofp, n, and c. Using the simulation for the parameter combinations (n=30, c=25,
/j=[0.2,0.5,0.8]), we witness that GLRT statistics do follow the assumed y~ distribution with 24 degrees
of freedom when p=0.2,0.5, and 0.8. Pictorialiy, the distribution of the GLRT statistics, for each
parameter combination, is demonstrated by the following histograms, representing the probability density
function. [Diagrams 2.2.1 -2.2.3] Superimposed on the histograms is the Chi-squared probability density
function, with 24 degrees of freedom, the asymptotic distribution of test statistic.
8


Probability Density Probability Density
Diagram 2.2.1: Distribution of GLRT Statistics (n=30,c=25,p=0.2)
Diagram 2.2.2: Distribution of GLRT Statistics (n=30,c=25,p=0.5)
GLRT
9


Diagram 2.2.3: Distribution of GLRT Statistics (n=30,c=25,p=0.8)
When comparing the quantiles of the above distributions to those of the x224 we see GLRT
produces test statistics that deviate slightly x2^. but it seems that the x224 fits the GLRT statistics
distribution quite well, in accordance with the empirical p.d.f. results. See Diagrams 2.2.4- 2.2.6.
Diagram 2.2.4: Q-Q Plot for GLRT Statistics (n=30,c=25,p=0.2)
Chi-sq(24) Quantiles
10


GLRT GLRT
Diagram 2.2.5: QQ-Plot for GLRT Statistics (n=30,c=25,p=0.5)
10 20 30 40 50
Chi-sq(24) Quantiles
Diagram 2.2.6: QQ-Plot for GLRT Statistics (n=30,c=25,p=0.8)
Chi-sq(24) Quantiles
Also simulated was the distribution for parameter combination (n=l 0,c=25,p=0.8). Varying the
size of the class from 25 to 10 yielded an empirical p.d.f. to the right of x224 Thus as we move into
11


0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.02 0.04 0.06 0.08 0.10
smaller-sized classes (smaller n), the distribution appears to deviate from the assumed distribution. See
Diagrams 2.2.7 and 2.2.8. This directly reflects that it is with asymptotic nature, i.e. as n grows large, that
the GLRT produces statistics that are chi-squared.
Diagram 2.2.7: Distribution of GLRT Statistics (n=10,c=25,p=0.8)
10 20 30 40 50 60
Diagram 2.2.8: Distribution of GLRT Statistics (n=10,c=10,p=0.8)
0 10 20 30
12


The simulated distributions from parameter combinations (n=l 0,c=l 0,/?=[0.2,0.5,0.8]) provided
evidence that the closer p is to 0.5, the less deviation from x29- The small deviation from the assumed
distribution, when p=0.8, is thus similar to the observed deviation of parameter combination
(n=10,c=25,/?=0.8). Decreasing the number of classes, c, and holding n and p constant, had little effect on
the distribution of the test statistics.
From these small pieces of pictorial information, we could draw a loose conclusion that the
GLRT produces test statistics that are drawn from a distribution similar to x2c_,, if the parameter
combinations involve large n. Altering the number of classes (the number of proportions to be tested)
created little to no difference in the test statistics deviation in distribution from the asymptotic x2c_, But
most important to note is that the GLRT performs quite well, with respect to maintaining and x2 -like
distribution, except for small n.
To move beyond the pictorial representation of the distributions difference, using as an example
the distribution simulated with (n=30,c=25,p=0.8), the distribution-free test statistic presented by the
Kolmogorov-Smimov Goodness-of-Fit test was used to test if the GLRT statistics follow a x224
Theorem 1 (Mood et al., p.508) LetX,, Xj,..., Xn,... be independent identically
distributed random variables having common continuous c.d.f.
FX(-)=F(-). Define DB = dn(X1,...,Xn)= sup |Fn(x)-F(x)| ,
-CO <\ where Fn(x) is the sample c.d.f.. Then
£?JWX) = n"l P[v/KD""X]
= Il-2E(-iy-,e^v] I(CH(X)
j-1
= H(x)
13


As discussed by Mood et al., the above equation does not depend on the c.d.f. from which the
sample was drawn (other than it being continuous); that is the limiting distribution of ^/n Dn is
distribution-free.
As an illustration of the test statistics used in the Kolmogorov-Smimov Goodness-of-Fit test
(KSGFT), the following diagrams of the empirical cumulative distribution function (c.d.f.) of the simulated
GLRT and the c.d.f. of %224 are provided in Diagram 2.2.9. The maximal vertical distance between the
two aforementioned c.d.f.s, multiplied by Jn, is the KSGFT statistic. This KSGFT statistic is compared
with the Critical Values for the Kolmogorov-Smimov Test of Goodness of Fit table of the Handbook of
Tables for Probability and Statistics, 1966.
Diagram 2.2.9: Cumulative Distribution of GLRT Statistic (n=30,c=25,p=0.8)
Calculation of the above K-S test statistic involved sorting the 4,000 GRLT statistics to
determine the empirical cumulative probability distribution function. For example, the 132nd order
14


statistic would yield, for the 4,000 simulated GRLT statistics, a cumulative probability of 132/4,000 =
0.033. Using the 132nd variate of the sorted GLRT statistics, lets call it GRLT]32, Fx(GLRT|32)is
calculated by referring to the cumulative distribution function of the hypothesized distribution. The
absolute difference is taken over all variates, i.e. | FX(GLRT132) 0.0331, multiplied by the square root of
the number of variates used, and a supremum is obtained. The S-Plus code that generated these
cumulative probabilities can be found in Appendix C. The KSGFT value was 2.44 for the GLRT. This
KSGFT value led to rejection of the null hypothesis at the 0.05 significance level, but the magnitude of the
test statistic was most likely determined by the size of the sample, 4,000.
The estimate of the true distribution yielded a critical region, a region that was used to conclude
rejection or failure of rejection of the null hypotheses. With establishing this estimate of the critical
region, it is safely stated that the test is of size a=0.05. This is of great importance when comparing the
power of two testing procedures, as will be discussed in Chapter 4. The estimated true critical value is the
95th quantile of the test statistics, per combination of n, p, and c. (See Appendix B for S-Plus code used)
Thus, the probability of rejecting the hypothesis of homogeneity, given that there is actually homogeneity,
is 0.05.
Diagrams 2.2.10 and 2.2.11 display the critical values of the estimated true distribution, graphed
as a function of n and p, holding c constant. Using diagrams such as these, an indication of some quadratic
nature to the response surface was viewed. To try to explain the variability of the critical values with
covariates n, c, and p, a polynomial regression of degree two was used as a first step, using orthogonal
polynomials to minimize multicollinearity. Any error with the model is a combination of lack-of-fit and
simulation noise and is thus an estimate of the true variability of the critical values, conservatively. A
quadratic model for the response surface of the GLRTs critical values proved to be quite sufficient.
15


Diagram 2.2.10: GLRT Critical Values c=10
Diagram 2.2.11: GLRT Critical Values c=15
16


Residuals of Model Residuals of model
Diagram 2.2.12: Modeling GLRT Critical Values: Residuals Vs. Fitted
10 15 20 25 30 35 40
Fitted
Diagram 2.2.13: Modeling GLRT Critical Values: Q-Q Norm. Plot
-2-10 1 2
Quantiles of Normal(0,1)
17


Diagram 2.2.14: Modeling GLRT Critical Values: Fitted vs. Observed
10 15 20 25 30 35 40
Observed Crit. Value
Yielding no strong evidence of non-constant variance [Diagram 2.2.12] and having distribution only
slightly skewed right from N(0,1) [Diagram 2.2.13], the errors provided more evidence that the model was
justified. The MSEofthe polynomial regression was 0.5378. The precision of the prediction equation
for the critical values can be witnessed through the graph of the fitted values versus the observed values.
[Diagram 2.2.14]
Since the model of the GLRT critical values fits so well, the MSE can be seen as simulation
noise, conservatively. This margin of eiTor can be established by moving two standard deviations away
from the observed critical value. Moving two standard deviations away from the critical value would yield
an approximate probability of 95% of encompassing the true critical value. It is suggested at this point to
use an upper bound of this estimating window to remain conservative in a hypothesis test. In conclusion,
each simulated critical value should estimate the true critical value1.5.
18


The polynomial regression used for this model is:
Y = 8.094 17.972/? + 17.829/?2 0.232n + 0.006n2 + 1.739c 0.008c2
- 0.006n*c + 0.004p *n + 0.03/7 *c + e
A less conservative approach to estimating the simulation noise is to estimate the variance
between critical values that should be the same value. Due to the symmetry of the GLRT across the
hypothesized p, i.e. we would expect the same critical region for (n,c,/?=0.2) and (n,c,/j=0.8), we can
utilize the two assumed similar critical values and determine a variance between those two points. A
2
sample variance can be established between the two numbers utilizing the function S2 = (Yj Y)2,
i-l
where YjS are the assumed similar critical values. The expected value of S 2 is the variance of the two
YjS, as we treated them as replicates.
As expected from the limiting x2c.\ distribution, the GLRT critical values increase in size as the
number of classes, c, increased. It was thus chosen to estimate the variance associated with the critical
values, per each level of c. As we had 75 replicated critical values and 5 levels of c, 15 estimates for the
each class variance were available: 15 degrees of freedom in estimating the true variance for each class.
For each level, the 15 estimates for the true variance were averaged. These averages were used to obtain
95% confidence windows, which is exactly plus or minus two standard deviations. These windows
follow, designated by the level of c: c=5, 0.42; c= 10, 0.43; c= 15, 0.52; c=20, 0.84; c=25, 0.71.
Note that these are narrower confidence windows than those obtained through usage of the error of the
polynomial model. See Appendix E for the GLRT critical values.
19


3.
The Traditional Test
3.1 The Test Statistic
The traditional test for homogeneity of proportions involves a weighting schema using weights
inversely proportional to the variance of the observed proportion. Hypothesized as a variate coming from
a binomial distribution, we see that the observed proportion p; has corresponding estimated variance
Pi ^ Pi^ where n, is the number of Bernoulli trials within the class (e.g. the number of
ni
emergency care patients treated by Practice i)
- Number of successful Bernoulli trials of Class i
Pi = --------------------------------------------------------
Number of Bernoulli trials of Class i
Number of Patients Referred to a Specialist, within Practice i
Number of Patients Treated By Practice i
To obtain a representative statistic for the set of proportions, weights for the estimates for the p s are a
function of:
W: = --- = -------!
a2(Pi) P; (1 Pi)
Eqn. 3.1.1
and the aforementioned representative proportion, utilizing these established weights, is thus:
i-l
Wj Pj
I> i
Eqn. 3.1.2
20


Using the computations above, we formulate the traditional test for homogeneity :
Xtrad
2 _
^ (Pi P+)2
i-l o-(Pj)
Eqn. 3.1.3
which is asymptotically a Chi-squared variate, with c-1 degrees of freedom, as the n;s tend towards
infinity.
At casual observation, it should be seen that the closer the p; is to 1 or 0, the more weight it has
in the representative statistic. With the example of the ASPN practices in Table 1.1, the rate at which the
patients afflicted with respiratory problems were referred to pulmonary specialists within Practice 9 was
0.667; within Practice 4, the rate was 0.90. The representative proportion (p) is determined more
heavily by the referral rate of Practice 4 than by the rate of Practice 9, even if, which is not the case with
these selected two, the practices treat the same number of emergency care patients with this particular
condition.
Snedecor and Cochran (1967) use the maximum likelihood estimator, derived in Section 2.1,
instead of p. Using the p^ described in Equation 3.1.2 minimizes Equation 3.1.3. Comparison between
this testing procedure and that of Snedecor and Cochran's still needs to be performed.
3.2 Assessing the Probability of Failure of the Traditional Test Statistic
The traditional test statistic cannot be used when an observed p; is equal to 0 or 1. Observing
the inverse relationship between the proposed weights and the estimated variances of the proportions, it is
concluded that the weight corresponding to a small or large proportion tends to infinity, thus enlarging the
21


Using the computations above, we formulate the traditional test for homogeneity :
(Pi ~ P+)2
o2(P,)
Eqn. 3.1.3
which is asymptotically a Chi-squared variate, with c-1 degrees of freedom, as the as tend towards
infinity.
At casual observation, it should be seen that the closer the p; is to 1 or 0, the more weight it has
in the representative statistic. With the example of the ASPN practices in Table 1.1, the rate at which the
patients afflicted with respiratory problems were referred to pulmonary specialists within Practice 9 was
0.667; within Practice 4, the rate was 0.90. The representative proportion (p J is determined more
heavily by the referral rate of Practice 4 than by the rate of Practice 9, even if, which is not the case with
these selected two, the practices treat the same number of emergency care patients with this particular
condition.
Snedecor and Cochran (1967) present p as the maximum likelihood estimator derived in
Section 2.1. Using the p. described in Equation 3.1.2 minimizes Equation 3.1.3, thus providing smaller
values of the traditional test statistic. This yields then a more conservative testing procedure than that of
Snedecor and Cochrans.
3.2 Assessing the Probability of Failure of the Traditional Test Statistic
The traditional test statistic cannot be used when an observed p; is equal to 0 or 1. Observing
the inverse relationship between the proposed weights and the estimated variances of the proportions, it is
concluded that the weight corresponding to a small or large proportion tends to infinity, thus enlarging the
21


test statistic leading to a false rejection of the null hypothesis. Note: Since the referral rates of Table 1.1
include 0 and 1, this traditional testing procedure may not be used.
But how small, or large, can the hypothesized p be in order to properly use the traditional testing
procedure? This question can only be answered after obtaining information about the other two
parameters n and c.
What needs special consideration at this point is the region where the traditional test statistic can
be utilized with low probability of failure, to compare to the GLRT statistic. The GLRT statistic falters
nowhere, so we need to make sure, for comparisons sake, that the traditional test statistic can be used as
well.
The probability associated with a useful combination of PjS, given the hypothesized p, the
number of Bernoulli trials per class, and the number of classes, is calculated by:
PfUsefiilcombinationofpjS] =fj P[p;useful]
i-l
= fj[ 1 -P[p; not useful]
i-l
=n i-h p(i-Jp)nr- ( \ n>
i-i voJ c
=n i-(i-/oni-pni
p^'V-pf'-"'
i-l
Eqn. 3.2.1
An S-Plus program to compute this probability is given in Appendix A. This probability is graphed as a
function of n andp in Diagram 3.2.1, for c=10, where a = n, i=l,..,10. Easily identified by this surface is
the region where we are safe to utilize the traditional test statistic without failure. This region is
indicated by the flatness of the surface. Using the contour plot of Diagram 3.2.2, it can be viewed that the
22


Hypothesized p
usage of 10 proportions coming from a binomial distribution of hypothesized p between .4 and .6, with
each class size of magnitude greater than 20, is warranted with risk <0.001 of failure.
Diagram 3.2.1: Probability of Ability to Use Traditional Test c=10
Diagram 3.2.2: Contour Plot of Surface
Size of Class
23


Diagram 3.2.2 displays the ineffectiveness of the traditional test statistic when the
magnitude(number of Bernoulli trials) of the average class is minimal (less than 10) and when we have a
relatively low/high hypothesized number of successes per class. The issue of these problematic
combinations of average class size and p will be addressed in Chapter 5, where a Bayesian modification to
testing homogeneity of proportions is introduced.
3.3 Type 1 Error
As a first glance at the behavior of the traditional test statistics distribution, the parameter
combination (n=30,c=25,/?=0.8) was used. The S-Plus code of Appendix B generated 3,872 test statistics,
formulated under the aforementioned parameter combination. [Note at this point that a simulation of 4,000
traditional test statistics was attempted. Through generation of 25 binomial variates, per each iteration of
the simulation, only 3,872 iterations had a useable set of variates. This was because a 0 or 1 was
generated as one (or more) of the binomial variates, rendering the traditional test useless. See Section 3.2]
For this parameter combination, the distributions deviation fromx224 is one of longer-tailedness; the
values observed in the traditional testing procedure are larger than what we would expect, assuming a x224
distribution Diagram 3.3.3 displays the histogram representing the empirical p.d.f. of the traditional test
statistic, for this specific parameter combination, and superimposed on it is the asymptotic distribution,
X224. The distribution of the traditional test statistic was also performed for parameter combinations
(n=30,c=25,p= 0.2,0.5). Withp-0.2, we witnessed results similar to those ofp=0.8, which was to be
expected because of the symmetric nature of the traditional test statistic across the ps. With p-0.5, the
distribution ofthe test statistics followedx224 quite nicely. Diagrams 3.3.1-3.3.3 display these
observations.
24


Probability Density Probability Density
Diagram 3.3.1: Distribution of Trad. Test Statistics (n=30,c=25,p=0.2)
Traditional
Diagram 3.3.2: Distribution of Trad. Test Statistics (n=30,c=25,p=0.5)
25


Diagram 3.3.3: Distribution of Trad. Test Statistics (n=30,c=25,p=0.8)
O
o
CO
o
o
20 40 60
Traditional
To see how the traditional test statistic behaves for smaller n, the parameter combination
(n=10,c=25,/>=0.8) was used to simulate, in attempt, 4,000 traditional test statistics, again using the S-Plus
code of Appendix B. The simulated distribution was shifted left, drastically, from the hypothesized x224
Thus, as we decreased n, the size of each class, the traditional test statistics distribution deviated from the
asymptotic distribution greatly. Similar results followed, with deviation of the same magnitude and
direction, with simulations using parameter combinations (n= 10,c= 10,/?=0.2,0.8). Thus altering the
number of classes had little effect on the behavior of the distribution of the traditional test statistic.
Important to note is that with combination (n=10,c=10,/>=0.5) the distribution was left-skewed from xV
The skewness was in the opposite direction (right) than those distributions of (n=10,c=10,/>=0.2,0.8), and
it was more chi-squared-like. So, in conclusion, when we have smaller n and p close to 0.5, the traditional
test statistic does behave somewhat like a x29 variate.
26


Table 33 Critical Values for Traditional Chi-Squared Test for
Homogeneity of Proportions (Z^2) P [ x i x^2 ] = 0.05
p Size of Each Class (n) Number of Classes (c)
5 10 15 20 25
0.2 10 6.773239 12.316631 17.626680 20.658656 26.620682
15 9.044598 15.824357 22.940475 29.432538 34.901556
20 10.555826 18.911261 27.256803 34.383117 41.181982
25 11.493894 21.372230 29.182222 37.904865 45.730485
30 11.742557 23.168593 32.206702 40.693759 48.385532
0.3 10 9.797422 19.118012 26.593878 35.007753 40.746125
15 12.215306 22.105102 31.485198 39.774807 48.17185
20 12.47389 24.82931 31.98462 41.96835 50.10555
25 12.50450 24.01396 33.98946 46.23746 54.97637
30 11.18971 21.67130 30.68705 38.70966 46.47078
0.4 10 13.64638 24.71230 35.53848 45.10311 54.50412
15 12.63613 24.13758 34.86871 44.95899 52.40975
20 11.52461 20.90910 29.24025 37.97424 46.85223
25 10.93307 19.89903 28.05386 35.70133 42.94649
30 10.53307 19.76447 27.48838 34.20329 41.38230
0.5 10 15.33199 27.40924 38.65350 49.20346 59.85152
15 11.75673 21.88062 31.26730 39.91737 48.42441
20 11.02656 19.91688 28.00401 35.43433 42.81818
- 25 ! 10.59926 18.83105 26.62084 35.04673 40.16616
30 10.21207 18.31242 26.46031 32.08912 39.81960
0.6 10 13.79942 24.91524 34.59076 44.44536 54.67369
15 13.12923 24.82879 35.24545 44.09692 51.91096
20 11.80682 21.51856 30.39668 39.56237 46.86581
25 'l 1.29377 20.38845 28.09596 36.90418 44.86007
30 11.13045 19.70270 27.75258 35.01562 42.91455
27


Table 3.3 (continued) Critical Values for Traditional Chi-Squared Test for
Homogeneity of Proportions (x^2) =0.05
0.7 10 10.20962 18.66921 26.38649 34.60401 40.57658
IS 11.66760 22.16062 31.09343 38.68213 47.36754
20 1280923 23.251257 32.881049 41.977614 50.680730
25 11.735456 22.452635 33.076669 41.548290 52.587338
30 11.614164 21.271008 29.835853 38.850912 46.406711
0.8 10 7.006548 12.859265 17.775812 22612319 27.200953
IS 9.190978 15.711640 22.653064 29.143503 34.710637
20 10.380357 18.983298 26.678808 34.173714 41.685613
25 11.527602 21.193220 29.861222 37.794547 46.478983
30 12.14935 22.17020 31.60790 40.75586 49.01912
As was done with the GLRT, the distribution-free KSGFT was used to test if the traditional test
statistics, for parameter combination (n=30 ,c=2 5 ,p=0.8) follow a x224- Pictorially, we observed a longer-
tailed empirical p.d.f. for this parameter combination [Diagram 3.3.3], and as a result, the KSGFT statistic
was observed to be 13.09. This led to rejection of the hypothesis that the two distributions were the same,
at significance level 0.05.
Since we witnessed strong evidence that the traditional test statistics distribution was not the
same as the assumed asymptotic distribution, x2c.]. better estimated critical regions were established.
Through each of these 175 simulations of the true distributions, the critical value was formulated, defined
as the point where 5% of the test statistics lie above. By designating this critical region, we can adequately
say the test is of size a=0.05. These critical values are provided in Table 3.3. As examples, Diagrams
3.3.4 and 3.3.5 present graphs of these critical values as functions of n, c, and p.
28


29


Unfortunately, Table 3.3 can be used only for these convenient settings for (n, p, c), and usually
analysts have at hand different sizes of classes, number of classes not equal to a multiple of 5, and have no
concrete prior indication of p. In hopes of interpolating between the cells of this convenient table, a
polynomial regression was performed to observe a possible functional relationship between the critical
values and the three parameters: n, c, and p.
With no transformation of the critical values, the mean squared error, the point estimate for the
variance of the model, proved to be rather large (2.55), thus traversing two standard deviations away from
the observed critical value would yield an estimation window of approximate width 6.4, which is not very
precise. It was also shown, through a scatter plot of the residuals versus the fitted values of the model, that
the variance was increasing. With this evidence of the inappropriateness of the model, a variance-
stabilizing transformation was required. The log and square root transformations were tried. Though
plausible for the purpose of stabilizing the variance, the transformation proved to yield a large MSE as
well, but smaller than that of model with no transformation of the critical values. It was concluded that the
model of the traditional tests critical values had more error than would be tolerated in obtaining predicted
values for given (n,p,c) and that the model was not flexible enough to fit the critical values. Thus the best
that could be done was to present a margin of error for each of the critical values within Table 3.3; this is
the error created by the simulation of the traditional tests critical values. The formulation of this error
follows.
As was performed for the critical values of the GLRT, the simulation noise was estimated, per
levels of c, by utilizing two test statistics that should be the same. With these two replicates, a sample
variance was established, whose expected value is the variance of the two critical values. Per each level of
c, 15 estimates for the true variance were averaged; these averages provided 95% confidence windows,
defined by plus or minus two standard deviations. These windows are presented by level of c:
30


c=5,0.57; c=10, 1.02; c=15, 0.91; c=20, 2.10; c=25, 1.39. With the size of these confidence
windows, it isnt quite clear that the behavior of the critical values is predictable, but it is the best the
simulations have to offer.
31


4.
Power Comparison
4.1 Power Comparison: Sufficiency
In comparing two testing procedures, it is important to assess the probability of rejecting the null
hypothesis when, in fact, it is justified to reject it, given the alternative hypothesis; this defines power of a
test statistic. Before comparing the power of the two test statistics, it is necessary to comply with some
assumptions.
Test statistics to be involved in such a comparison should be dependent on sufficient statistics
only, thus it would complement our study in analyzing the sufficiency of the statistics on which our test
statistics depend. Sufficient statistics are statistics that condense the parameter space under the alternative
hypothesis, fi, in such a way that no information about is lost,
(Mood et al., p.301) which implies that all information about the hypothesized parameter space can be
found within the sample.
Definition 15 Sufficient statistic (Mood et al.1
Let X,, Xj....Xn be a random sample from the density f (, 0) where 0 may be a
vector. A statistic S=s(X,,X2,...,Xn) is defined to be a sufficient statistic if and
only if the conditional distribution of X^Xj,... ,Xn given S=s does not depend on 0
for any value of S.
32


The sufficiency can be proven using the factorization method;(Mood et al., p.408) since the
observed p,, p2,..pc, later referred to together as the vector p are independent and binomially
distributed, as addressed earlier. Without loss of generality, the number of Bernoulli trials per class was
set constant, whereas n; = n Vie[l,...,c]. Since each trial (j) within a class (i) is a Bernoulli trial:
Yjj = 0 if failure"
= 1 if "success"
with fy(y)=/>y(l-/>)1yI,011(y) Thejoint density of the Y;. can thus be configured, letting
i
X.-EY,:
j=l
fy(y) = n ft P,yij (1 -/>;)' 'yj and WLOG n-n, i=[l,...,c]
- ~ i-l j-1
= n ^(i^i)n'Xi
i-l
= n eHr?~Pi'r">
i-l
-n*
i-l
[X| ln(P|) (n -Xj)ln(l />;)]
i-l
- nln(l -pi)
*
n
nln(l-j>i)
Eqn. 4.1.1
X; _
where is simply the observed proportion, p. of n Bernoulli trials that were successful. In
n
accordance with the factorization criterion for sufficiency, thejoint distribution of the variate Yy can be
33


factored into a product of two functions: one positive and solely dependent upon the data; the other is
dependent upon both the hypothesized pK s and the data, but dependent on the data only through the
sufficient statistics. This proves that p is sufficient for p. In conclusion, both the traditional test statistic
and the GLRT statistic are dependent on these individual p; s, so we have chosen adequate test statistics to
compare with regards to sufficiency.
4.2 Power Comparison: Type 1 Error
For the two test statistics to be compared, their Type 1 errors, a must be equivalent to deem
one or the other the more powerful test of size a. The justification for fixing the size of the Type 1 error
to be a...seems to arise from those testing situations where the two hypotheses are formulated in such a
way the one type of error is more serious than the other. The hypotheses are stated so that the Type 1 error
is the more serious, and hence one wants to be certain that it is small. (Mood et al., p.411) Logically, if
one test statistic has a Type 1 error of a', in accordance to a critical value c, we can choose a test statistic
whose distribution dictates a larger Type 1 error a", imposing a smaller Type 2 error. This would yield a
false comparison of the two test statistics.
4.3 The Final Power Comparison: Results
Using these critical values established in Section 4.2 and given specific deviations from the null
hypothesis, power of the two tests can be established. This is the probability of rejecting the hypothesis of
homogeneity when, in fact, there exists some heterogeneity; using a metric to measure the magnitude of the
deviation from the null hypothesis, the power can be graphed as a function of this deviation, for graphical
comparison.
34


GLR tests, for a given departure from the null hypothesis. A deviation from the null hypothesis was
characterized as a fluctuation of one of the five hypothesized pK s. That is, one of the observed p;s comes
from a different distribution than do the other p; s. The deviation from the null hypothesis is measured by
the magnitude of the deviation of that sole hypothesized px from the others. The deviation ranged from
0.00 to 0.20, in increments of 0.04. This yielded 6 points of analysis, for a given parameter combination.
Diagrams 4.3.1-4.3.4 display the two estimated power functions, dependent on deviation, for the
parameter combinations n=10, c=5, and p=[0.2,0.4,0.5,0.6]
Diagram 4.3.1 displays the complete domination of the traditional tests power over that of the
GLRT. Diagrams 4.3.2 and 4.3.3 indicate that when n=10 and c=5 andp=0.5, the GLRT is the more
powerful test as deviation grows, but when /t=0.4, the roles reverse. With the power comparison for
/>=0.4 and deviation=0.00, the graphs do not meet. We had assumed for this power comparison that the
tests were of the same size. This equality of size should have been represented by the meeting of the
graphs, as deviation=0.00 implies that we, in fact, have all proportions coming from the same distribution
Diagram 4.3.1: Power Comparison (n=10,c=5,p=0.2)
Deviation
35


Power Power
Diagram 4.3.2: Power Comparison (n=10,c=5,p=0.4)
Diagram 4.3.3: Power Comparison (n=10,c=5,p=0.5)
36


Diagram 4.3.4: Power Comparison (n=10,c=5,p=0.6)
Deviation
(power and size should be equal). This difference in size could be attributed simulation noise in
establishing the critical values of these two test statistics, as discussed in Section 3.3. The unexpected
direction change of power seen in Diagram 4.3.3 is speculated as being a result of simulation noise. We
would expect that power would only increase as deviation grew.
Also produced were similar power functions for parameter combinations (n=30,c=5) and
(n=30,c=25),p=0.2,0.4,0.5,0.6. With witnessing the distributions for (n=30,c=5) [Diagrams 4.3.5-
4.3.8], we see that the GLRT and the traditional test perform in the same fashion, with respect to power,
particularly for p near 0.5. This similarity could be the product of the asymptotic nature of both tests, as
we are using a relatively large n (30). The greater difference between the power curves for p=0.2 is
interesting, leading to the conclusion that the tests perform differently when we have different values of p,
namely smaller or larger. This greater difference in power curves for the smaller p was witnessed also
with (n=30,c=25) [Diagram 4.3.9-4.3.12], relative to those power curves for p=0.4,0.5,0.6. Similar to
results of Section 2.2, decreasing c yielded no difference in behavior of the test statistic, this example
37


Power Powar
showing the little difference of power. From these graphs we are unable to conclude that one test is
consistently better than the other.
Diagram 4.3.5: Power Comparison (n=30,c=5,p=0.2)
38


Power Power
Diagram 4.3.7: Power Comparison (n=30,c=5,p=0.5)
0.0 0.05 0.10 0.15 0.20
Deviation
Diagram 4.3.8: Power Comparison (n=30,c=5,p=0.6)
Deviation
39


Power Power
Diagram 4.3.9: Power Comparison (n=30,c=25,p=0.2)
Deviation
Diagram 4.3.10: Power Comparison (n=30,c=25,p=0.4)
Deviation
40


Power
Diagram 4.3.11: Power Comparison (n=30,c=25,p=0.5)
Diagram 4.3.12: Power Comparison (n=30,c=25,p=0.6)
41


5.
Bayesian-Adjustment Approach
S.1 Bayesian Theory
Widespread familiarity with the traditional testing procedure indicates that analysts will continue
using it, correctly or falsely. Assuming that analysts will continue to use it and knowing that there is a
possibility of false usage (see Section 3.2), let us introduce an adjustment for the p j s so that the test will
be usable always.
The methods addressed in the previous chapter depended on the parameters being fixed. For this
chapter, let us assume that these parameters, specifically p, are actually variates from their own
distribution. This is called a prior distribution.
The nature of p dictates a range from 0 to 1, as there are never more Bernoulli successes than
there are trials, and there are certainly never a negative number. These constraints are very similar to the
defined domain of the beta distribution [Potthoff and Whittinghill, Rice], so that would be an excellent
place to begin with our guesses of how p is distributed. We write ps prior distribution as:
p*-'(l-p)b-'
Eqn. 5.1.1
o
where E(p)=------ and var(/?)=
a+b
ab
(a+b)2 (a+b+1)
42


Continuing to assume that our data are distributed binomially, with parameters n and p, we write the
conditional distribution of our data dependent on those two parameters, per usual, as:
7t(x|/7)

Eqn. 5.1.2
The joint distribution between the data and p follows:
= :t(x!/>)*(/>) = (^jp*{\-pT* p^rcb)^ (1 _/7)b''
Eqn. 5.1.3
Following Bayesian rule, the posterior distribution ofp, the distribution after observing X, is:
7C(^|X) = t
f tt(x|p)7i(p)dp
Eqn. 5.1.4
Using equations 5.1.2 and 5.1.3 in the above equation 5.1.4, we see that:
43


7r(/7 ( X) =
n
x
px(i-prx
px(i-prx
T(a+b)
r(a)T(b)
T(a+b)
r(a)T(b)
p-'(l-p)b-'
p-' (l-p^dp
p-1 (l -p)n-x,b'1
_
J px o
/7X)"-X*b~1
r(x + a)r(n-x+b)
T(n+a+b)
____r(n + a + b)__ xa-l / J _p\n-x-*b-l
r(x + a)T(n-x+b)
Eqn. 5.1.5
which is probability density function (p.d.f.) of beta with parameters (x+a,n-x+b). Note that when ps
prior distribution is uniform, its p.d.f.s parameters are (a=l, b=l). A prior distribution of uniform nature
is most conservative, stating that there exists no indication of the size of p.
With the assumed prior distribution of beta, it is shown that the expected value of p, given the
data, is a function of the observed proportion of Bernoulli successes, p, the number of Bernoulli trials, n,
and the distributions two parameters, a and b, through the following formulation:
E(p| x)
(x+a)
(x+a)+(n-x+b)
x+a
a+n+b
a..(_a_)+ n. (*)
a+b+n a+b a+b+n n
Eqn. 5.1.6
44


and with our first conservative step towards specifying the distributions parameters (a=l, b=l), we see
The beauty of this derivation is that these ps of Eqn. 5.1.7 are exactly those that cause problems
within the use of the traditional testing procedure, as corresponding weights are infinity (Section 3.1)
Justification can be sought then for adjusting these problematic ps so that the traditional testing procedure
can be used freely, without violation.
5.2 Results
Four thousand traditional test statistics were obtained for each of the three combinations
(n,p,c)={(30,0.2,25),(30,0.5,25),(30,0.8,25)}, similar to those used in Section 3.3 analyses. Generated
through simulation [see Appendix D for S-Plus code used], these test statistics are those construed using
the expected values of the hypothesized p, given the observed data, as described in Chapter 5.1. These
expected values simply replaced the observed p;s within the weights of the traditional test statistic.
What we would ideally like to see is that the distribution of the test statistics is similar to a known
distribution; for convenience, let us compare the distribution to the Chi-squared, with c-1 degrees of
freedom. This is the asymptotic distribution of the traditional test statistic, which was shown to be
significantly different from the estimated true distribution, as discussed in Chapter 3.3. Comparison of
that E(p | x) = ^(0.5) + -2-(p) where 0.5 is the expected value of the uniform distribution of interval
2+n 2+n
[0,1].
It is important to note at this point that:
E(/,|/?=0)=
2+n
Eqn. 5.1.7
45


these distributions have been done both pictorially and through the non-parametric Kolmogorov-Smimov
Goodness-of-Fit test
Diagrams 5.2.1,5.2.2, 5.2.3 display the estimated probability density functions (p.d.f.) of the test
statistics of parameter combinations (n,/>,c)={(30,0.2,25),(30,0.5,25),(30,0.8,25)}, respectively; each
indicated by the histogram. The observed p.d.f.s are compared to the p.d.f. of x224 indicated by the solid
black line.
46


Diagram 5.2.3: Distribution of Bayesian-Adjusted
Test Statistics (n=30.c=25.p=0.r
47


As indicated by Diagrams 5.2.1 -5.2.3, the estimated true distribution of the Bayesian-adjusted traditional
test statistic, for relatively large n, fits the hypothesized y?2A quite well and better than those of the
respective Diagrams 3.3.1-3.3.3 for the non-adjusted traditional test.
More interesting was the behavior of this Bayesian-adjustment technique for those problematic
areas discussed in Section 3.2. For this analysis, Diagram 3.2.1 was used again to identify the problematic
regions corresponding to n= 10, c=10, andp=[0.1,0.2,0.3,0.5,0.7,0.8,0.9]. This analysis also required
another generation of4,000 test statistics using this Bayesian adjustment for each of the generated
proportions, per each combination of n, c, and p. In Diagrams 5.2.4-5.2.10 are the histograms
representing the p.d.f.s of each parameter combination described above. These histograms demonstrate
the Chi-squared-like nature of the Bayesian-adjusted test statistics, except for the small/large proportions
(p=0.1,0.9) The reader should keep in mind that this demonstration involves relatively small n. For
further research, it would be necessary again to establish a critical value for these parameter combinations.
48


49
o
10
20
40
50 60


50


The reader should note at this point the extreme difference between the distributions of the
traditional test statistics for parameter combinations (n=10,c=10^=[0.2,0.8]) [Diagrams 5.2.11, 5.2.12]
compared to the distributions corresponding to the same parameter combinations of the Bayesian-adjusted
test statistics. [Diagrams 5.2.5,5.2.9]
51


It is shown by these Diagrams 5.2.5 and 5.2.9 that the Bayesian adjustment created a distribution that was
more of y?2A nature than the distribution of the non-adjusted. This is a remarkable observation. With the
traditional testing procedure, we were only able to utilize approximately 31.5% of the test statistics where
p=0.2,0.8; 98% for p=0.5. Because of the low probability of usability for the smaller/larger ps, the true
distribution of the test statistic may have been under represented. With this Bayesian adjustment, each
generated proportion posed no problem when incorporated into the test statistic, thus 100% of the test
statistics were used to determine their distribution. Using the ASPN data of Table 1.1, where the
traditional testing procedure was not used due to problematic proportions, but we could test for
homogeneity with the Bayesian adjustment. The null hypothesis of homogeneity was rejected, using this
Bayesian technique, with p-value<0.001.
5.3 Power Comparison with GLRT and Traditional
Comparison of the Bayesian-adjusted traditional testing procedures, the non-adjusted traditional
procedures, and the GLRT procedures power was performed. The simulation used only the parameter
combination (n=10,c=5,p=0.2). Diagram 5.3.1 displays this power function comparison, with power as a
function of the deviation from the null hypothesis of homogeneity. The deviation is defined in the fashion
described in Section 4.3, with one of the generated proportions as a variate from a different binomial
distribution than that of the others. The difference in the distribution is measured by the difference
between the hypothesized p;s. We see from Diagram 5.3.1 that the Bayesian-adjusted traditional test
performs similarly to the unadjusted traditional test statistic, with respect to power. Both tests perform
better than the GLRT, as indicated by the vertical shifts in the graphs. This observation is coupled with
the fact that the Bayesian-adjusted test statistic may be used for all proportions.
52


Power
Diagram 5.3.1: Power Comparison (n=10,c=5,p=0.2)
Deviation
53


6.
Conclusion
6.1 Remarks
In our ability to compare the generalized likelihood ratio and the traditional tests, we stumbled
upon information leading us to disbelieve their assumed hypothesized distribution, x2c_i Chapters 2
and 3, the true size of the tests was established, thus yielding a critical region to be utilized for both tests.
Chapter 4 presented the power for each of the size a tests, represented as a function of the deviation from
the null hypothesis, it was witnessed that the traditional and generalized likelihood ratio testing procedures
performed equally for particular combinations of n, c, and p, when the established true regions of rejection
were utilized, but with other parameter combinations, the domination of one testing procedure over the
other varied, leading to inconclusive results. One lingering problem with the traditional testing procedure
was the inability to use it when faced with a problematic combination of proportions: those that rendered
the test statistic useless, as one or more of them were equal to 1 or 0.
To make usable the unusable, a Bayesian adjustment was proposed in Chapter 5. For specific
combinations of the parameters n,p, and c, the distribution of4,000 test statistics generated through usage
of this Bayesian adjustment were displayed. It was concluded that the distribution for each combination,
where p=0.2-0.8, was similar to the hypothesized distribution, XV,, and closer than the distribution of
those generated without the adjustment. The Bayesian-adjusted test statistic also retained the power of the
non-adjusted traditional test statistic for a specific parameter combination. It is suggested to use this
Bayesian technique as a testing procedure for binomial homogeneity.
54


Appendix A Splus code used to assess probability
of useful set of 10 proportions, given
n and p
prmat2<-NULL
for (m in 1:10) (
for (k in 1:99) {
hyp.p<-k/100
classiz<-5*m
prob.p<- (1- ((hyp.p) ,'classiz+ (1-hyp.p) ''classiz)) A10
^rxnat2<-rbind(prmat2,c(hyp.p,classiz,prob.p))
55


Appendix B Splus code used to establish time
critical regions of test statistics,
given n, c and p
typelerr.f<-function(n,pr,numclass) {
For(times=l:4000,{
if (times==l) {
testmatc-numeric()
]Dhat< numeric ()
phat<-rbinom(length(n),n,pr)/n
pmle<-sum(phat)/length(n)
lnpxnle<-log (pmle)
lnlmmlec-log(1-pmle)
sumlc-siim (n*phat*lnpmle)
sum2<-sum(n*(1-phat)*lnlmmle)
sum3<-(-1)*sum((n*phat)[phat!=0]*log(phat[phat!=0]))
sum4<- (-1) *sum( (n* (1-phat)) [phat!=l] *log(l-phat [phat
, l-l]))
glrtc- (suml+sum2+sum3+sum4) (- 2)
tradtest<-NA
if (sum(bade-phat==1|phat==0)==0) {
w<-n/(phat*(1-phat))
sumwts<-sum(w)
pplus<-sum(w*phat)/sumwts
tradtest<-sum((phat-pplus)*(phat-pplus)*w)
testmat<-rbind(testmat,c(tradtest,girt))
}
)
degfree<-numclass-1
okl<-!is.na(testmat[,1))
sumtradc-sum(testmat[okl,1]>=qchisq(.95,degfree))
sumglrtc-sum(testmat[,2]>=qchisq(.95,degfree))
pchiglrt<-sumglrt/4000
pchitrad<-sumtrad/(4000-sum(!okl))
crittrad<-quantile(testmat[,1],probs=c(.95),na.rm=T)
critglrt<-quantile(testmat[,2],probs=c(.95),na.rm=T)
c (crittrad, critglrt,qchisq( .95,degfree) ,pchitrad,pch
iglrt,sum(!okl),pr[1],n[l],numclass)
56


Appendix C Splus code used to establish power of
two tests, given true critical regions
powerdev.fc-function(alpha, n, pr, pwrdev, chisum,
tradsum, devation, hypothp) {
crtrad <- crtr[round(length(n)/5), round(n[1]/5) ,
round(10 pr[2] 1)]
crglrt <- crglr[round(length(n)/5),
round(n[1]/5), round(10 pr[2] 1)]
For(times = 1:4000, (
if(times == 1) {
phat <- numeric()
pwrmat <- numeric()
phat <- rbinom(length(n), n, pr)/n
pmle <- sum(phat)/length(n)
lnpmle <- log(pmle)
. lnlmmle <- log(l pmle)
suml <- sum(n phat lnpmle)
sum2 <- sum(n (1 phat) lnlmmle)
sum3 <- (-1) sum((n phat)[phat != 0] *
log (phat [phat != 0])
)
sum4 <- (-1) sum((n (1 phat)) [phat =
1] log(l phat[
phat != l]))
girt <- (suml + sum2 + sum3 + sum4) (-2)
tradtest <- NA
if(sum(bad <- phat == 1 | phat == 0) == 0)
w <- n/(phat (1 phat))
sumwts <- sum(w)
* pplus <- sum(w phat)/sumwts
tradtest <- sum( (phat pplus) (phat
- pplus) w)
}
pwrmat <- rbind(pwrmat, c(tradtest, girt))
)
degfree <- length(n) 1
okl <- !is.na(pwrmat[, 1])
sumtrad <- sum(pwrmat[okl, 1] >= crtrad)
sumglrt <- sum(pwrmat[, 2] >= crglrt)
pchiglrt <- sumglrt/4000
pchitrad <- sumtrad/(4000 sum(!okl))
57


Appendix C (continued)
Splus code used to establish power of
two tests, given true critical regions
c(crtrad, crglrt, qchisq(0.94999999999999996,
degfree), pchitrad, pchiglrt,
sum(!okl), pr[2], n[l], length(n) devation)
58


Appendix D Splus code used to establish
distribution of Bayesian-adjusted test
statistics
bayes210.f<-function(n,pr,numclass) {
For(times=l:4000f{
if (times==l) {
bayes210<-numeric ()
phat<-numeric()
ophat<-numeric()
ophat<-rbinom(numclass,n,pr)/n
phat<-(1/(2+n))+(n/(2+n))*ophat
tradtest<-NA
w<-n/(phat*(1-phat))
sumwts< sum (w)
pplus<-sum(w*ophat)/sumwts
tradtest<-sum( (ophat-pplus) (ophat-pplus) *w)
bayes210<-rbind(bayes210,tradtest)
}
)
degfree<-numclass-l
okl<-!is.na(bayes210)
sumtrad<-sum(bayes210[okl]>=qchisq(.95,degfree))
pchitrad<-sumtrad/(4000-sum(!okl))
crittradc-quantile(bayes210,probs=c(.95),na.rm=T)
c (crittrad, qchisq( ,95,degfree) ,pchitrad, sum( !okl) ,pr
[1] ,n[l] ,numclass)
}
59


Appendix E
Critical Values for Generalized Likelihood Ratio Test for
Homogeneity of Proportions ( x^2) P [ x ^ Xo*21 = 0 05
p Size of Each Class (n) Number of Classes O
5 10 15 20 25
0.2 10 10.5532 18.8285 26.5557 33.4652 40.3242
15 10.7294 18.6883 26.1684 33.3299 39.7710
20 10.4266 18.0017 25.8823 32.1944 38.9742
25 9.8970 18.0641 24.7155 31.8633 38.6316
30 9.8148 17.9903 24.3759 31.4190 37.5747
0.3 10 10.5264 18.5832 26.7902 33.2966 40.1662
15. 10.4127 18.2024 25.5416 323912 38.2954
20 10.2386 18.0654 23.8820 30.9469 37.7438
25 10.0308 17.9569 24.7506 31.4586 38.0930
30 9.6962 17.6057 24.1002 31.0309 36.6498
0.4 10 10.6821 18.2016 25.4461 32.8585 39.0980
15 9.8004 17.4470 24.9343 31.6513 38.1341
20 9.6003 16.7973 23.2409 29.7820 36.3268
25 9.3750 16.8195 23.3728 29.8742 35.8151
30 9.2906 17.0800 23.9814 29.6477 35.6353
0.5 10 10.5346 18.2338 25.2350 32.3394 39.1347
15 9.3004 16.2619 23.5884 29.3358 35.8746
20 9.2580 16.3133 22.9261 29.2166 35.2785
25 9.3183 16.3440 22.9175 30.1496 34.9830
30 9.1651 16.2428 23.1305 28.4271 35.4041
0.6 10 10.7047 18.3665 25.6470 33.1218 39.4052
IS 10.1305 17.7842 24.5145 .31.1136 37.6647
20 9.8078 17.1956 23.6380 30.8959 36.4060
25 9.6516 17.2464 23.4157 30.2821 36.7025
30 9.8854 17.1970 23.8971 30.1222 36.7249
60


Appendix E (continued)
Critical Values for Generalized Likelihood Ratio Test for
Homogeneity of Proportions (x^2) Plx^x^2] =0.05
0.7 10 10.5355 18.5832 25.8818 33.3677 40.0863
IS 9.9993 18.0193 25.3698 31.4183 38.652$
20 10.4775 17.6339 24.2437 31.3704 38.3753
25 9.639J 17.3387 24.9628 30.6490 37.5943
30 10.0031 17.7484 24.2586 30.0357 36.7276
0.8 10 10.52641 18.7236 26.5587 33.8362 40.5709
IS 10.6803 18.8908 26.2571 33.0036 39.5356
20 10.0553 18.1507 25.2396 32.5497 39.1177
25 10.1826 18.0175 25.1527 31.5435 37.9807
30 9.9308 17.5195 24.4648 31.1350 37.9958
61


Bibliography
Bardsley, W. E., An algorithm for calculating an upper
bound to x2 for testing binomial homogeneity with
some observations uncategorized." Computers and
Geosciences Volume 6 (1980); pp. 315-319.
Beyer, William H. CRC handbook of tables for probability
and statistics. Cleveland, Ohio: Chemical Rubber
Company, 1966.
Fleiss, J. L. Statistical methods for rates and
proportions. New York, New York: John Wiley and
Sons; 1973.
Green, Larry et al., "How representative of typical
practice are practice-based networks? Arch. Family
Medicine Volume 2 (September, 1993); pp. 939-949.
Hedges, L; Olkin, I. Statistical methods for meta-
analysis San Diego, California: Academic Press,
Inc., 1985.
Larsen, Richard J.; Marx, Morris L. Introduction to
mathematical statistics and its applications.
1 Englewood Cliffs, New'Jersey: Prentice Hall, 1986.
Mood, A. M.; Graybill, F. A.; Boes, D. C. Introduction to
the theory of statistics. McGraw-Hill; 1974.
Potthoff, R.F.; Whittinghill, M. Testing for
homogeneity." Biometrika Volume 53 (1966); pp.167-
' 182.
Rice, John A. Mathematical statistics and data analysis.
Belmont, California: International Thomson
Publishing; 1995.
Snedecor, G. W.; Cochran, W. G. Statistical methods.
Ames, Iowa: Iowa State University Press; 1967.
62


Full Text

PAGE 1

COMPARING THE GENERALIZED LIKELlliOOD RATIO AND V ARlANCE TESTS FOR BINOMIAL HOMOGENEITY WITH OBSERVATION OF BAYESIAN ADWSTMENT by Alissa Ruelle B.S., University of Michigan, 1993 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Applied Mathematics 1996

PAGE 2

1996 by Alissa Ruelle All rights reserved.

PAGE 3

This thesis for the Master of Science degree by Alissa Ruelle has been approved by

PAGE 4

Ruelle, Alissa (M.S., Applied Mathematics) Comparing the Generalized Likelihood Ratio and Variance Tests for Binomial Homogeneity with Observation of Bayesian Adjustment Thesis directed by Assistant Professor James Koehler ABSTRACT Ho: P1 =p2 = =pc =p HA: pi* pi for some i*j Two testing procedures for homogeneity of c proportions are used commonly: the generalized likelihood ratio and the variance tests for binomial homogeneity. Both testing procedures produce statistics that asymptotically follow a x2distribution, with c-1 degrees of freedom. The asymptotic case, i.e. having a large number ofBernoulli trials (n) behind each proportion, poses few problems, if any, for both testing procedures. With small n, though, problems exist with respect to assuming that the test statistics follow the appropriate x2 distribution. Using Splus, proportions were generated and tested, using both testing procedures, to estimate the test statistics' true distributions. The power of each test was established with these generated distributions and their respective critical regions. Comparison of the power functions followed. IV

PAGE 5

In addition to having problems with small n, the variance test cannot be used when one or more proportions equal I or 0. As a plausible remedy, an adjustment to the variance test was proposed, incorporating Bayesian mathematics. This proposed testing procedure was compared back to the generalized likelihood ratio and variance tests. This abstract accurately represents the content of the candidate's thesis. I recommend its publication. Signed v

PAGE 6

In gratitude to Danielle, whose spirit offered me the energy, to Karen, whose constant inquiry gave the energy direction, and to Mom, who provided everything else. VI

PAGE 7

CONTENTS Chapter Introduction ........................................................................................................... ... 1 1. 1.1 Introduction: The Generalized Likelihood Ratio and Traditional Tests........ ....... 1 1.2 A Bayesian Approach ..................................................................................... ...4 2. The Generalized Likelihood Ratio Test .......................................... ... 6 2.1 2.2 The Test Statistic and Parameter Spaces ofHypotheses ......................................... 6 Type 1 Error. ................................................................................................. ... 8 3. The Traditional Test................................................................................................... .20 3.1 The Test Statistic............................................................................................ ..... 20 3.2 Assessing the Probability of Failure of the Traditional Test Statistic............. . .21 3.3 Type 1 Error. .......................................................................................................... 24 4. Power Comparison ............................................................................ ,............. ... 32 4.1 4.2 4.3 Power Comparison: Sufficiency ........................................................... Power Comparison: Type I Error ....................................................... The Final Power Comparison: Results ...................................... .... .32 ..34 ..... 34 5. Bayesian-Adjustment Approach ........................................... ..... ..42 5.1 5.2 5.3 Bayesian Theory ................................................................ Results ........................................................................................... Power Comparison with GLRT and Traditional ................................. 6. Conclusion ............................. 6.1 Remarks ......................... vii .. ..42 ... ..45 .52 ..54 ..... .54

PAGE 8

Appendix A Splus code used to assess probability of useful set often proportions, given nand p .. 55 B. Splus code used to establish true critical regions of test statistics, given n, c, p.. ..56 C. Splus code used to establish power of two tests, given true critical regions.. .......... 57 D. Splus code used to establish distribution ofBayesian-adjusted test statistics. .... 59 E. Critical values for generalized likelihood ratio test for homogeneity of proportions ... 60 Bibliography ............................................................................................... ................. ........ 62 viii

PAGE 9

DIAGRAMS Diagram 2.2.1 Distribution ofGLRT statistics (n=30,c=25,p=0.2)..................................... ..9 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 Distribution of GLRT statistics (n=30,c=25,p=0.5) .................... ..9 Distribution ofGLRT statistics (n=30,c=25,p=0.8).................. ................ .. ....... I 0 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.2) ....................................................... 10 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.5) .................................. Q-Q plot for GLRT Statistics (n=30,c=25,p=0.8) ............................ Distribution ofGLRT statistics (n=IO,c=25,p=0.8) ........................................... .. Distribution ofGLRT statistics (n=IO,c=IO,p=0.8) .......................................... ..II .. ... 11 ..... 12 ...... 12 2.2.9 Cumulative distribution ofGLRT statistic (n=30,c=25,p=0.8) .................................... 14 2.2.1 0 2.2.ll GLRT critical values c=lO .......................................................................... .. GLRT critical values c=l5 ..................................................... .. ... 16 .. .. 16 2.2.12 Modeling GLRT critical values: residuals vs. fitted ................................................... 17 2.2.13 Modeling GLRT critical values: Q-Q norm. plot... .................................................... 17 2.2.14 Modeling GLRT critical values: fitted vs. observed ...................... .. ...... 18 3.2.1 3.2.2 3.3.1 3.3.2 3.3.3 3.3.4 Probability of ability to use traditional test c= 1 0 .................................................. ..... 23 Distribution oftrad. Test statistics (n=30,c=25,p=0.5) ... Traditional critical values c=l 0 ................................................................................... 29 IX

PAGE 10

Diagram (continued) 3.3.5 4.3.1 4.3.2 Traditional critical values c= I 5 ...................... Power comparison (n= 1 O,c=5,p=0.2) ..................................... Power comparison (n=I O,c=5,p=0.4) ............................. 4.3.3 Power comparison (n=10,c=5,p=0.5) .......................... 4.3.4 Power comparison (n=10,c=5,p=0.6) ........................................ .. 4.3.5 Power comparison (n=30,c=5,p=0.2) ..................................................... 4.3.6 Power comparison (n=30,c=5,p=0.4) ............................................... 4.3.7 Power comparison (n=30,c=5,p=0.5) ....................................... .. 4.3.8 Power comparison (n=30,c=5,p=0.6) ............................. .. 4.3.9 Power comparison (n=30,c=25,p=0.2) .................. .. 4.3.10 Power comparison (n=30,c=25,p=0.4) .. .. 4.3.11 Power comparison (n=30,c=25,p=0.5).............. .. ............................ .. 4.3.12 Power comparison (n=30,c=25,p=0.6) .................................. .. 5.2.1 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.2) .... Distribution ofBayesian-adjusted test statistics (n=30,c=25,p=0.5) .... ......... 29 ....... .35 .... 36 ... 36 ....... .37 .... 38 ............. 38 ...... .39 ............ .39 ..... .40 .. .... .40 ..41 .. ........... .41 ..46 .. .. .47 5.2.2 5.2.3 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.8)............... .. ...... .47 5 .2.4 Distribution of Bayesian-adjusted test statistics (n=1 O,c= 1 O,p=O.l ).... ..48 5.2.5 Distribution of Bayesian-adjusted test statistics (n= I O,c= I O,p=0.2).... .. .... .49 5 .2. 6 Distribution of Bayesian-adjusted test statistics (n= I 0 ,c= 10 ,p=O .3).......... .... .. ....... 4 9 5.2.7 Distribution of Bayesian-adjusted test statistics (n=!O,c=10,p=0.5)............. .. .... .49 5.2.8 Distribution ofBayesian-adjusted test statistics (n=IO,c=IO,p=0.7)........ .. ...... 50 5.2.9 Distribution of Bayesian-adjusted test statistics (n=IO,c=IO,p=0.8)......... .. .... 50 X

PAGE 11

Diagram (continued) 5.2.1 0 Distribution ofBayesiau-adjusted test statistics (n=10,c=1 O,p=0.9).. 5.2.11 Distribution of traditional test statistics (n=10,c=10,p=0.2) .. 5.2.11 5.3.1 Distribution of traditional test statistics (n=10,c=10,p=0.8). Power comparison (n=10,c=5,p=0.2) ................................................. xi ....... 50 ....... .51 ....... 51 .. 53

PAGE 12

Table 1.1 3.3 TABLES ASPN: Referral care rates.................................................................. ..... 2 Critical region for traditional chi-squared test for homogeneity of proportions .... 27 xu

PAGE 13

AC](NOWLEDGMENIS Dr. Jim Koehler for being my dedicated, sincerely interested, and caring educator, mentor, and friend. Dr. Karen Kafadar and Dr. Ram Shanmugarn for participating on my graduate committee and for their timely review of my thesis, followed by interesting and educating inquify. Biostatistics Division of AMC Cancer Research Center for providing employment and invaluable experience throughout my graduate studies. Ambulatory Sentinel Practice Network, specifically Jim Werner, for supplying data to analyze. Fellow students/friends at University of Colorado at Denver for offering relentless support dtrring my research and at all other times. X!ll

PAGE 14

1. Introduction 1.1 Introduction: The Generalized Likelihood Ratio and Traditional Tests Two testing procedures for homogeneity of proportions are used commonly: the generalized likelihood ratio test (GLRT) and the "traditional" testing procedure (Fleiss, 1973; Hedges and Olkin, 1985). This "traditional" test is of similar format to a test termed the "variance test of binomial homogeneity" (Snedecor and Cochran, 1967). The "traditional" testing procedure appeals to the asymptotic nature of sum of squared standard normal variates, which is Chi-squared, with degrees of freedom equal to the number of proportions minus one; the GLRT utilizes the ratio of the joint likelihood functions for the sample of proportions under the null and alternative hypotheses' parameter spaces. This is also, asymptotically, a Chi-squared variate with degrees of freedom again equal to the number of proportions minus 1. The performance of both tests is dependent upon 1) n, the size of the classes from whence the proportions come, 2) p, the expected value of the hypothesized distribution of the proportions, and 3) c, the number of proportions at hand. An example of the necessity for a homogeneity test for proportions was found in the Ambulatory Sentinel Practice Network (ASPN) database. ASPN's question concerned referral rates for follow-up care, the percentage of times that patients were referred to a specialist after being treated at an emergency care level. To test the homogeneity of referral care rates, "classes" must be established. As case-mixing most likely existed within each individual practice, (as a hypothetical example, within Practice H, 74% of the emergency care patients were treated for respiratory complications; at Practice K, only 13%) it was necessary to designate one type of patient, ailing from a particular illness. After type of patient was 1

PAGE 15

selected, the class (practice) was used, thus a referral rate was calculated for each medical practice. A listing of referral rates for ailments identified as respiratory complications follow in Table I. I. The data of Table 1.1 will be sununoned throughout this document. Table 1.1 ASPN: Referral Care Rates P=Practice code N=Number of Patient R=Referral Rate p N RR p N RR 8 0.875 20 0.900 6 0 .833 48 9 1.000 3 8 0.250 11 1.000 49 55 0.818 4 20 0.900 23 0.913 50 4 1.000 5 11 0.818 16 0.563 51 29 0. 897 6 1 1.000 23 0.957 11 0.909 7 6 0.833 16 1.000 6 0.500 8 10 0.800 15 0.667 6 0.167 9 6 0.667 7 0.286 4 0. 7 50 10 15 1. 000 15 0.933 19 0.842 11 6 1. 000 23 0.957 4 1.000 15 10 31 0.710 13 0. 7 69 16 22 0.864 2 0.000 6 0.833 17 8 0.500 11 0.636 5 0.600 18 40 0.825 25 0.800 8 0.875 19 21 0.952 2 1.000 17 0. 7 64 20 0.500 17 0.941 5 1.000 21 13 0. 7 69 7 0.857 5 0.800 40 0.425 20 0.950 23 9 0.667 9 0.556 2

PAGE 16

Using the traditional test for homogeneity of proportions, we calculate for each class a proportion, followed by an additional formulation of a representative proportion for the c classes, which is a weighted average of the individual proportions, using weights inversely proportional to the variance of the proportion. The characteristics of the binomial distribution lead us to the weighting schema used in the traditional test statistic, and it will be shown in Chapter 3 that when the observed proportions near 0 or l, the weights tend towards infinity. In fact, the traditional test statistic falters when we utilize some specific combinations of n, p, and c. Of interest is how the traditional test statistic behaves when within a region of usable n, p, and c. Since each patient in each class can be viewed as a Bernoulli trial, each with hypothesized parameter p,, each proportion is a function of a binomial variate, with parameters ( n,, p). With this traditional testing mechanism, we test the hypotheses, assuming c classes: This same hypothesis is also tested through the generalized likelihood ratio test, which utilizes the joint distribution of the observed proportions, pis, under the parameter spaces of the null and alternative hypotheses. Unbiased and attaining minimum variance, the best estimator for an individual hypothesized P, is the observed proportion of Bernoulli "successes," which is the maximum likelihood estimator,: Pi Number of successful Bernoulli trials of Class i Number of Bernoulli trials of Class i Nwnber of Patients Referred to a Specialist, within Practice i Nwnber of Patients Treated By Practice i 3

PAGE 17

It will be shown that the traditional and generalized likelihood ratio tests are not of the same size, when using critical regions of the hypothesized Chi-squared distribution. To acconunodate for the inequality of size using those hypothesized regions, the estimated true critical regions of the traditional test statistic, for particular combinations of number of Bernoulli trials per class (n), hypothesized expected value of the set of pis (p ), and the number of classes (c), will be provided. These estimated true critical regions are obtained via the empirical distribution via simulation and are used through any hypothesis testing within the thesis research. In further analysis of these two testing procedures, in Chapter 4 it is shown that both test statistics are dependent on sufficient statistics. Using estimated critical values, the Type l errors were forced to be equal for the purpose of power comparison. For the power comparisons, inhomogeneity was forced into simulated proportions, with the proportions as realizations from differing distributions, and power of each test statistic is established. Power is graphed as a function of the deviation from the null hypothesis, per testing procedure, granting pictorial comparison. 1,2 A Bayesian Approach In Chapter 5, we entertain a totally different avenue in examining smaller and larger proportions, as with proportions equal to 0 or 1 we are unable to test homogeneity adequately by the traditional squared test. We take a look at our ability to "tweak" these problematic proportions with an amount basically detennined by the size of the sample the proportion describes. This "tweaking" alleviates tl1e problem offailure of the traditional test, as we have no weights approaching infinity. Bayesian theory is used, recognizing that the hypothesized p could be considered a variate from its own distribution, called a "prior distribution." 4

PAGE 18

An estimate for the critical region for this Bayesian-adjusted test statistic is formulated, using procedures similar to that used in Section I. I. Estimation of power, again graphed as a function of the deviation from the null hypothesis, followed, and was compared back to that of the traditional test and GLRT. 5

PAGE 19

2. The Generalized Likelihood Ratio Test 2.1 The Test Statistic and Parameter Spaces of Hypotheses In testing for homogeneity of proportions, we test the null hypothesis H,: p1 = p2 = ... = p, = p against the alternative HA: P; pi for some i j. The distribution of an observed pi is derived from the distribution of the number of successful Bernoulli trials within the ith class, termed X;. The distribution of X; is binomial with parameters n; and P; We write the cumulative distribution function of X as P(Xs x) = Fx(x), so utilizing change of variables, we see that the cumulative distribution function of P; is: --X--FP(p;) = P(ps p;) = P( n" P;) = P(X s n; *P;l = Fx(n;*P;l By the chain rule, we can write and the likelihood function of the individual hypothesized proportion is then "( .) = *( n; ) n;P;(l .)";(1-p;) P1 nl P1 P, 11i*Pi Eqn. 2.1.1 Eqn. 2.1.2 Eqn. 2.1.3 The generalized likelihood ratio test utilizes the distributions of the individual P; sto form likelihood functions over the two parameters spaces 6

PAGE 20

w = { 0 ,; p1 = p2 = ... = p, = p,; I Q = { 0 < P; < I i = I ... ,c } hypothesized and with no constraints, respectively. The generalized likelihood test statistic (Larsen and Marx, 1986), using the joint likelihood function of the individual hypothesized P; s, is thus multiplied by the natural logarithm of: ). = supw"'(Jljji) sup0 "'Cllill) Eqn. 2.1.4 where pMLE is the maximum likelihood estimator for p, which can be calculated easily through maximizing the natural log of the joint distribution of the Pis. The estimator plvlLE is exactly the number of Bernoulli successes divided hy the number of Bernoulli trials, over all c classes. Eqn. 21.5 This statistic is a function of the individual j)1 s, weighted by their ccmesponding sample sizes, and is simply the proportion of successful Bemou!li trials over all c classes. With it, we formulated arepresentative statistic that attains small variance, and it can be checked that it attains the CramerwRao Lower Bound (CRLB) for variance (n,=n, i =I ,2, ... ,c): ( p" p2 ... P,) fp(P) = n fx(np) 7

PAGE 21

where X is a binomial variate representing the number of Bernoulli successes withm a class. Thus fP(p) can be expressed as n( n) p"P(J-p)n(l-p)-J:..[lnf. (p;p)] = (p-p) np ap P p(l-p) n The CRLB can thus be written as p(l-p) n Which is exactly the estimated variance of the j)1s, under the null hypothesis. 2.2 Type l Error The code of Appendix B generated 4,000 GLRT statistics. This simulation was perfonned for 175 combinations of p, n, and c_ Using the simulation for the parameter combinations (n=30, c=25, p=[0.2,0.5,0.8]), we witness that GLRT statistics do follow the assumed x' distribution with 24 degrees of freedom whenp=0.2, 0.5, and 0.8. Pictorially, the distribution of the GLRT statistics, for each parameter combination, is demonstrated by the following histograms, representing the probability density function. [Diagrams 2.2.1.2.3] Superimposed on the histograms is the Chi-squared probability density function, with 24 degrees of freedom, the asymptotic distribution of test statistic. 8

PAGE 22

Diagram 2.2.1: Distribution of GLRT Statistics (n=30,c=25,p=0.2) ci "' 0 ci c 0 c ?.-
PAGE 23

Diagram 2.2.3: Distribution of GLRT Statistics (n=30,c=25,p=0.8) ci 0 c 0 Cl :0 .0 0 e ci 0. 11 0 0 ci 20 40 60 GLAT When comparing the quantiles of the above distributions to those of the x' 24' we see that the GLRT produces test statistics that deviate slightly x'"' but it seems that tl1ex224 fits the GLRT statistics' distribution quite well, in accordance with the empirical p.d.f. results. See Diagrams 2.2.42.2.6. 0 <0 lil 0 "' ,._ a: ...J 0 "' @ Diagram 2.2.4: Q-Q Plot for GLRT Statistics (n=30,e=25,p=0.2) :------------------.. ----10 20 30 40 Chl-sQ(24) Ouanti!es 10 ,.' ... ... .. 50

PAGE 24

0 v Diagram 2.2.5: QQ-Piot for GLRT Statistics (n=30,c=25,p=0.5) ... ....... ,' ...... .. .. ---10 20 30 40 50 Chi-sq(24) Quantiles Diagram 2.2.6: QQ-Piot for GLRT Statistics (n=30,c=25,p=O.B) .. ... ... .. ........... ...... .... 10 20 30 40 50 Chi-sq(24) Quantiles Also simulated was the distribution for parameter combination (n=l O,c=25,p=0.8). Varying the size of the class from 25 to 1 0 yielded an empirical p.d.f to the right of x' 24. Thus as we move into 11 r i

PAGE 25

smaller-sized classes (smaller n), the distribution appears to deviate from the assumed distribution. See Diagrams 2.2.7 and 2.2.8. This directly reflects that it is with a:.J1!1ptotic nature, i.e. as n grows large, tl1at the GLRT produces statistics that are chi-squared. Diagram 2.2.7: Distribution of GLRT Statistics (n=10,c=25,p=0.8) 10 20 30 40 50 60 Diagram 2.2.8: Distribution of GLRT Statistics (n=10,c=10,p=0.8) 0 ;? 0 "' 0 0 0
PAGE 26

The simulated distributions from parameter combinations (n=JO,c=IO,p=[0.2,0.5,0.8]) provided evidence that the closer pis to 0.5, the less deviation from x',. The small deviation from the assumed distribution, when p=0.8, is thus similar to the observed deviation of parameter combination (n= 1 O,c=25,p=0.8). Decreasing the number of classes, c, and holding n and p constant, had little effect on the distribution of the test statistics. From these small pieces of pictorial information, we could draw a loose conclusion that the GLRT produces test statistics that are drawn from a distribution similar to x',_1 if the parameter combinations involve large n. Altering the number of classes (the number of proportions to be tested) created little to no difference in the test statistic's deviation in distribution from tl1e asymptotic x',_1 But most important to note is that the GLRT performs quite well, with respect to maintaining and x2-like distribution, except for small n. To move beyond the pictorial representation of the distributions' difference, using as an example the distribution simulated with (n=30,c=25,p=0.8), the distribution-free test statistic presented by the Kolmogorov-Smirnov Goodness-of-Fit test was used to test if the GLRT statistics follow a x' 24. Theorem 1 (Mood et al., p.508) LetX1 X,, ... Xn, ... be independent identically distributed random variables having common continuous c. d. f. Fx() =F(). Define Dn =dn(X, ... ,Xn) = sup IFn(x) -F(x)l -
PAGE 27

As discussed by Mood et a!., the above equation "does not depend on the c.d.f. from which the sample was drawn (other than it being continuous); that is the limiting distribution of {riD, is distribution-free_" As an illustration of the test statistics used in the Kolmogorov-Smirnov Goodness-of-Fit test (KSGFT), the following diagrams of the empirical cumulative distribution function ( c.d.f.) of the simulated GLRT and the c.d.f. of x\4 are provided in Diagram 2.2.9. The maximal vertical distance between the two aforementioned c.d.f.s, multiplied by {ri, is the KSGFT statistic. This KSGFT statistic is compared with the "Critical Values for the Kolmogorov-Smirnov Test of Goodness of Fit" table of the Handbook of Tables for Probability and Statistics, 1966. Diagram 2.2.9: Cumulative Distribution of GLRT Statistic (n=3D,e=25,p=0.8) ... .. . m .. / 0 .. .. .. // 0 .... E 0 .. 0 0 0 20 40 60 Calculation of the above K-S test statistic involved sorting the 4,000 GRLT statistics to detennine the empirical cumulative probability distribution function. For example, the l32nd order 14

PAGE 28

statistic would yield, for the 4,000 simulated GRLT statistics, a cumulative probability of 132/4,000 = 0.033. Using the l32nd variate of the sorted GLRT statistics, let's call it GRLT132, Fx( GLRT132)is calculated by referring to the cumulative distribution function of the hypothesized x' 24 distribution. The absolute difference is taken over all variates, i.e. I Fx(GLRT132) -0.033!, multiplied by the square root of the number of variates used, and a supremum is obtained. The S-Plus code that generated these cumulative probabilities can be found in Appendix C. The KSGFT value was 2.44 for the GLR T. This KSGFT value led to rejection ofthe null hypothesis at the 0.05 significance level, but the magnitude of the test statistic was most likely detennined by the size of the sample, 4,000. The estimate of the tme distribution yielded a critical region, a region that was used to conclude rejection or failure of rejection of the null hypotheses. With establishing this estimate of the critical region, it is safely stated that the test is of size n=0.05. This is of great importance when comparing the power of two testing procedures, as will be discussed in Chapter 4. The estimated tme critical value is the 95th quantile of the test statistics, per combination ofn,p, and c. (See Appendix B for S-Plus code used) Thus, the probability of rejecting the hypothesis of homogeneity, given that there is actually homogeneity, is 0.05. Diagrams 2.2.1 0 and 2.2.11 display the critical values ofthe estimated true distribution, graphed as a function ofn andp, holding c constant. Using diagrams such as these, an indication of some quadratic nature to the response surface was viewed. To try to explain the variability of the critical values with covariates n, c, and p, a polynomial regression of degree two was used as a first step, using orthogonal polynomials to minimize multicollinemity. Any error with the model is a combination oflack-ofCfit and simulation noise and is thus an estimate of the true variability of the critical values, conservatively. A quadratic model for the response surface of the GLR T' s critical values proved to be quite -sufficient. 15

PAGE 29

"' .. qt9 Diagram 2.2.1 0: GLRT Critical Values c=1 0 ... ... ... ... ... ... ... ... ... ... ... ... ... Diagram 2.2.11: GLRT Critical Values c=15 ... .... ... .... ... .... ... ... ... .... .... 16 ;. . ... ... ... ... .. ..-; .... ::)

PAGE 30

Diagram 2.2.12: Modeling GLRT Critical Values: Residuals Vs. Fitted . ... \ . ci .. : .. : : .. ... ... .. . .. . . .. : .. . 0 0 ... : . E 15 . "' : 9 0 .. : .. m .. a: ., "1 ., 'l' 10 15 20 25 30 35 40 Fitted Diagram 2.2.13: Modeling GLRT Critical Values: Q-Q Norm. Plot 0 ci 0 0 ci ::; 15 -@ ., 0 9 / " ID a: .. ., .. "1 ., 0 <)i 0 2 Quantiles of Normai(O, 1) 17

PAGE 31

Diagram 2.2.14: Modeling GLRT Critical Values: Fitted vs. Observed 'f "' "' .. "0 g 0 E g "' "' !l' 1l 0 "' "' ;! 10 15 20 25 30 35 40 Observed Crii.Va!ue Yielding no strong evidence of non-constant variance [Diagram 2.2.12] and having distribution only slightly skewed right from N(O, I) [Diagram 2.2.13], the errors provided more evidence tl1at the model was justified. The MSE of the polynomial regression was 0.5378. The precision of the prediction equation for the critical values can be witnessed through the graph of the fitted values versus the observed values. [Diagram 2.2.14] Since the model of the GLRT critical values fits so well, the MSE can be seen as simulation noise, conservatively. This margin of error can be established by moving two standard deviations away from the observed critical value. Moving two standard deviations away from the critical value would yield an approximate probability of 95% of encompassing the true critical value. It is suggested at tl1is point to use an upper bound of this estimating window to remain conservative in a hypothesis test. In conclusion, each simulated critical value should estimate the true critical value 1.5. 18

PAGE 32

The polynomial regression used for this model is: Y = 8.09417.972p + 17.829p2 -0.232n + 0.006n2 + 1.739c0.008c2 -0.006nc + 0.004p +n + 0.03p c + E A less conservative approach to estimating the simulation noise is to estimate the variance between critical values that should be the same value. Due to the symmetry of the GLRT across the hypothesized p, i.e. we would expect the same critical region for (n,c,p=0.2) and (n,c,p=0.8), we can utilize the two assumed similar critical values and detennine a variance between those two points. A 2 -sample variance can be established between the two numbers utilizing the function S 2 = '[ (Y,Y)2 i-1 where Yi s are the assumed similar critical values_ The expected value of S 2 is the variance of the two Yis, as we treated them as replicates. As expected from the limiting x2c-l dishibution, the GLRT critical values increase in size as the number of classes, c, increased. It was thus chosen to estimate the variance associated with the critical values, per each level of c. As we had 7 5 "replicated" critical values and 5 levels of c, 15 estimates for the each Class' variance were available: 15 degrees of freedom in estimating the true variance for each class. For each level, the 15 estimates for the true variance were averaged. These averages were used to obtain 95% confidence "windows,'' which is exactly plus or minus two standard deviations. These "windows" follow, designated by the level of c: c=5, .42; c= I 0, .43; c= 15, .52; c=20, .84; c=25, .71. Note that these are narrower confidence windows than those obtained through usage of the error of the polynomial model. See Appendix E for the GLRT critical values. 19

PAGE 33

3. The Traditional Test 3.1 The Test Statistic The traditional test for homogeneity of proportions involves a weighting schema using weights inversely proportional to the variance of the observed proportion. Hypothesized as a variate coming from a binomial distribution, we see that the observed propmtion P; has corresponding estimated variance where n, is the number of Bernoulli trials within the class (e.g. the number of emergency care patients treated by Practice i ) P; Number of successful Bernoulli trials of Class i Number of Bernoulli trials of Class i Number of Patients Referred to a Specialist, witlun Practice i Nllinber of Patients Treated By Practice i To obtain a representative statistic for the set of proportions, weights for the estimates for tl1e P; s are a function of : =_I_ n, w, cr2(j)) P; (1 P) Eqn. 3. Ll and the aforementioned representative proportion, utilizing these established weights, is thus: Eqn. 3.L2 20

PAGE 34

Using the computations above, we fonnulate the traditional test for homogeneity: 2 XTRAD Eqn. 3.!.3 which is asymptotically a Chi-squared variate, with c-1 degrees of freedom, as the n,s tend towards infinity. At casual observation, it should be seen that the closer the j), is to I or 0, the more weight it has in the representative statistic. With the example of the ASPN practices in Table I. I, the rate at which tl1e patients afflicted with respiratory problems were referred to pulmonary specialists within Practice 9 was 0.667, within Practice 4, the rate was 0.90. The representative propot1ion (pJ is determined more heavily by the referral rate of Practice 4 than by the rate of Practice 9, even if, which is not tl1e case with these selected two, the practices treat the same number of emergency care patients with this pru1icular condition. Snedecor and Cochran ( 1967) use the maximum likelihood estimator, derived in Section 2.1, instead of p+. Using the p+ described in Equation 3.1.2 minimizes Equation 3.!.3. Comparison between this testing procedure and that of Snedecor and Cochran's still needs to be performed. 3,2 Assessing the Probability of Failure of the Traditional Test Statistic The traditional test statistic cannot be when an observed p 1 is equal to 0 or I Observing the inverse relationship between the proposed weights and the estimated variances of the proportions, it is concluded that the weight corresponding to a small or large prop011ion tends to infinity, thus enlarging the 21

PAGE 35

Using the computations above, we formulate the traditional test for homogeneity : 2 t (p,: p+) il o(p,J Eqn. 3.1.3 which is asymptotically a Chi-squared variate, with c-1 degrees of freedom, as the n1s tend towards infinity. At casual observation, it should be seen that the closer the p1 is to 1 or 0, the more weight it has in the representative statistic. With the example of the ASPN practices in Table 1.1, the rate at which the patients afilicted with respiratory problems were referred to puhnonary specialists within Practice 9 was 0.667; within Practice 4, the rate was 0.90. The representative proportion (p.) is determined more heavily by the referral rate of Practice 4 than by the rate of Practice 9, even if, which is not the case with these selected two, the practices treat the same number of emergency care patients with this particular condition. Snedecor and Cochran (1967) present p. as the maximum likelihood estimator derived in Section 2.1. Using the p, described in Equation 3.1.2 minimizes Equation 3.1.3, thus providing smaller values of the traditional test statistic. This yields then a more conservative testing procedure than that of Snede.cor and Cochran's. 3.2 Assessing the Probability of Failure of the Traditional Test Statistic The traditional test statistic cannot be used when an observed p1 is equal to 0 or 1. Observing the inverse relationship between the proposed weights and the estimated variances of the proportions, it is concluded that the weight corresponding to a small or large proportion tends to infinity, thus enlarging the 21

PAGE 36

test statistic leading to a false rejection of the null hypothesis. Note: Since the referral rates of Table 1.1 include 0 and I, this traditional testing procedure may not be used. But how small, or large, can the hypothesized p be in order to properly use the traditional testing procedure? This question can only be answered after obtaining information about the otl1er two parameters n and c. What needs special consideration at this point is the region where the traditional test statistic can be utilized with low probability of failure, to compare to the GLRT statistic. The GLRT statistic falters nowhere, so we need to make sure, for comparison's sake, tl1at the traditional test statistic can be used as well. The probability associated with a useful combination of P;S, given the hypothesized p, tl1e number ofBemoulli trials per class, and the number of classes, is calculated by: P[Usefulcombinationofp;sl =II P[p; useful] i! =II I P [P; not useful] il = fi ] ( ll;) p 0 (1-p)n,-o ( ll;} p n, (1-p)ni-ni ""I 0 ni = II I (1p)"' p n, iJ Eqn. 3.2.1 An S-Plus program to compute this probability is given in Appendix A. This probability is graphed as a function ofn and pin Diagram 3.2.1, for c=! 0, where n; = n, i=I, .. ,IO. Easily identified by this surface is the region where we are "safe" to utilize the traditional test statistic without failure. This region is indicated by the flatness of the surface. Using the contour plot of Diagram 3.2.2, it can be viewed that the 22

PAGE 37

usage of 10 proportions coming from a binomial distribution ofh)pothesized p between .4 and .6, with each class' size of magnitude greater than 20, is warranted with risk <0.001 of failure. Diagram 3.2.1: Probability of Ability to Use Traditional Test C=10 Diagram 3.2.2: Contour Plot of Surface w ] 0 I 0 0 9 .9 10 20 30 40 50 Size of Class 23

PAGE 38

Diagram 3.2.2 displays the ineffectiveness of the traditional test statistic when the magnitude(mnnber of Bernoulli trials) of the average class is minimal (less than 10) and when we have a relatively !ow/high hypothesized number of successes per class. The issue of these problematic combinations of average class size and p will be addressed in Chapter 5, where a Bayesian modification to testing homogeneity of proportions is introduced. 3.3 Type 1 Error As a first glance at the behavior of the traditional test statistic's distribution, the parameter combination (n=30,c=25,p=0.8) was used. The S-Pius code of Appendix B generated 3,872 test statistics, formulated under the aforementioned parameter combination. [Note at this point that a simulation of 4,000 traditional test statistics was attempted. Through generation of25 binomial variates, per each iteration of the simulation, only 3,872 iterations had a useable set of variates. This was because a 0 or I was generated as one (or more) of the binomial variates, rendering the traditional test useless. See Section 3.2] For this parameter combination, the distribution's deviation from x' 24 is one oflonger-tailedness; the values observed in the traditional testing procedure are larger than what we would expect, assuming a x' 24 distribution. Diagram 3.3.3 displays the histogram representing tl1e empirical p.d.f of the traditional test statistic, for this specific parameter combination, and superimposed on it is the a;}'mptotic distribution, x',,. The distribution of the traditional test statistic was also performed for parameter combinations (n=30,c=25,p= 0.2, 0.5). Withp=0.2, we witnessed results similar to those ofp=0.8, which was to be expected because of the symmetric nature oftl1e traditional test statistic across the ps. Withp=0.5, the distribution of the test statistics followed x224 quite nicely. Diagrams 3.3. 1-3.3.3 display these observations. 24

PAGE 39

Diagram 3.3.1: Distribution of Trad. Test Statistics (n=30,c=25,p=0.2) 0 ci ro 0 0 <0 c 0 0 :a
PAGE 40

Diagram 3.3.3: Distribution of Trad. Test Statistics (n=30,c=25,p=0.8) 0 0 "' 0. 0 "' c 0. $ 0 0 :g "' -e 0. 0 a. N 0. 0 0. 0 20 40 60 Traditional To see how the traditional test statistic behaves for smaller n, the parameter combination (n=IO,c=25,p=0.8) was used to simulate, in attempt, 4,000 traditional test statistics, again using the S-Pius code of Appendix B. The simulated distribution was shifted left, drastically. from the hypothesized x'24. Thus, as we decreased n, the size of each class, the traditional test statistic's distribution deviated from the asymrtotic distribution greatly. Similar results followed, with deviation of the same magnitude and direction, with simulations using parameter combinations (n=IO,c=IO,p=0.2,0.8). Thus altering the number of classes had little effect on the behavior of the distribution of the traditional test statistic. Important to note is that with combination (n=l O,c=l O,p=0.5) the distribution was left-skewed from x'9 The skewness was in the opposite direction (right) than those distributions of (n= 1 O,c= 1 O,p=0.2,0.8), and it was more chi-squared-like. So, in conclusion, when we have smaller nand p close to 0.5, the traditional test statistic does behave somewhat like a x' 9 variate. 26

PAGE 41

Table3.3 p Size of E.ch Class (n) 0.2 10 15 20 25 30 0.3 10 15 20 25 30 0.4 10 15 20 25 30 0.5 10 15 20 25 : 30 0.6 10 15 20 25 30 Critical Values for Traditional Chi-Squared Test for HomogeneityofProportions (Xcrn2 ) P[x>x,/l =0.05 Number ofCl3SSes (c) 5 10 15 6.n3239 12.316631 17.626680 9.044598 15.824357 22.940475 10.555826 18.9ll261 27.256803 11.493894 21.372230 29.!82222 11.742557 23.168593 32.206702 9.797422 19.118012 26.593878 12.215306 22.105102 31.485198 12.47389 24.82931 31.98462 12.50450 24.01396 33.98946 11.18971 21.67130 30.68705 13.64638 24.71230 35.53848 12.63613 24.13758 34.86871 11.52461 20.90910 2924025 10.93307 19.89903 28.05386 10.53307 19.76447 27.48838 15.33199 27.40924 38.65350 11.75673 21.88062 31.26730 11.02656 19.91688 28.00401 10.59926 18.83105 26.62084 10.21207 18.31242 26.46031 13.79942 24.91524 34.59076 13.12923 24.82879 35.24545 11.80682 21.51856 30.39668 11.293TI 20.38845 28.09596 11.13045 19.70270 27.75258 27 20 25 20.658656 26.620682 29.432538 34.901556 34.383117 41.181982 37.904865 45.730485 40.693759 48.385532 35.007753 40.746125 39.774807 4!U7185 41.96835 50.10555 46-23746 54.97637 38.70%6 46.47078 45.10311 54.50412 44.95899 52.40975 37.97424 46.85223 35.70133 42.94649 34.20329 41.38230 49.20346 59.85152 39.91737 48.42441 35.4J433 42.81818 35.04673 40.16616 32.08912 39.81960 44.44536 54.67369 44.09692 51.91096 39.56237 46.86581 36.90418 44.86007 35.01562 42.91455

PAGE 42

Table 3.3 (continued) Critical Values for Traditional Chi-Squared Test for Homogeneity of Proportions (Xo,.2 ) P[x = 0.05 0.7 10 10.20962 18.66921 26.38649 34.60401 40.57658 15 IL66760 22.1606:2 31.09343 38.68213 47.36754 20 12.80923 23.251257 32.881049 41.977614 50.680730 25 11.735456 22.452635 33.076669 41.548290 52.587338 30 11.614164 21.271008 29.835853 38.850912 46.406711 0.8 10 7.006548 12.859265 17.775812 22.612319 27.200953 15 9.190978 15.7ll640 22.653064 29.143503 34.710637 20 10.380357 18.983298 26.678808 34.173714 41.685613 25 11.527602 21.193220 29.861222 37.794547 46.478983 30 12.14935 22.17020 31.60790 40.75586 49.01912 As was done with the GLRT, the distribution-free KSGFT was used to test if the traditional test statistics, for parameter combination (n=30,c=25,p=0.8) follow a x' 24. Pictorially, we observed a longertailed empirical p.d.f for this parameter combination [Diagram 3.3.3], and as a result, tire KSGFT statistic was observed to be 13.09. This led to rejection of the hypothesis tiJat the two distributions were the same, at significance level 0.05. Since we witnessed strong evidence that the traditional test statistic's distribution was not the same as the assumed asymptotic distribution, x2,_1 better estimated critical regions were established. Through each of these 175 simulations of tire true distributions, the critical value was fonmulated, defmed as the point where 5% of the test statistics lie above. By designating this critical region, we can adequately say the test is of size a=0.05. These critical values are provided in Table 3.3. As examp}es, Diagrams 3.3.4 and 3.3.5 present graphs of these critical values as functions ofn, c, and p. 28

PAGE 43

"' ... ... .. .. ... .. ... -. ... Diagram c=1 o -.. ... ... ... ... Diagram Values c=15 ....... --'j -. ... ... .... .. 29 ..

PAGE 44

Unfortunately, Table 3.3 can be used only for these convenient settings for (n, p, c), and usually analysts have at hand different sizes of classes, number of classes not equal to a multiple of 5, and have no concrete prior indication of p. In hopes of interpolating between the cells of this convenient table, a polynomial regression was performed to observe a possible functional relationship between the critical values and the three parameters: n, c, andp. With no transformation of the critical values, the mean squared error, the point estimate for the variance of the model, proved to be rather large (2.55), thus traversing two standard deviations away from the observed critical value would yield an estimation window of approximate width 6.4, which is not very precise. It was also shown, through a scatter plot of the residuals versus the fitted values of the model, that the variance was increasing. With this evidence of the inappropriateness of the model, a variance stabilizing transformation was required. The log and square root transformations were tried. Though plausible for the purpose of stabilizing the vmiance, the transformation proved to yield a large MSE as well, but smaller than that of model with no transformation of the critical values. It was concluded that the model of the traditional test's critical values had more error than would be tolerated in obtaining predicted values for given (n,p,c) and that the model was not flexible enough to fit the critical values. Thus the best that could be done was to present a margin of error for each of the critical values within Table 3.3; this is the error created by the simulation of the traditional test's critical values. The fmmulation of this error follows. As was petforrned for the critical values of the GLRT, the simulation noise was estimated, per levels of c, by utilizing two test statistics that should be the same. With these two "replicates," a sample variance was established, whose expected value is the variance of the two critical values. Per each level of c, 15 estimates for the true variance were these averages provided 95% confidence "windows," defined by plus or minus two standard deviations. These "windows" are presented by of c: 30

PAGE 45


PAGE 46

4. Power Comparison 4.1 Power Comparison: Sufficiency 1n comparing two testing procedures, it is important to assess the probability of rejecting the null hypothesis when, in fact, it is justified to reject it, given the alternative hypothesis; this defines power of a test statistic. Before comparing the power of the two test statistics, it is necessary to comply with some assumptions. Test statistics to be involved in such a comparison should be dependent on sufficient statistics only thus it would complement our study in analyzing the sufficiency of the statistics on which our test statistics depend. Sufficient statistics are statistics that condense the parameter space under the alternative hypothesis, Q, "in snch a way that no information about is lost," (Mood et al., p.301) which implies that all information about the hypothesized parameter space can be found within the sample. Definition 15 Suflicient statistic (Mood et al.) Let X,.)\,, ... Xn be a random sample from the density f ( , a) where a may be a vector. A statistic S=s(X,.)\,, ... ,Xn) is defined to be a sufficient statistic if and only if the conditional distribution of X1,)\,, ,X" given S=s does not depend one for any value of S. 32

PAGE 47

The sufficiency can be proven using the factorization method;(Mood et al., p.408) since the observed ji1 ji2 ji,, later referred to together as the vector are independent and binomially distributed, as addressed earlier. Without loss of generality, the number of Bernoulli trials per class was set constant, whereas n, = n \:liE [1, ... ,c]. Since each trial Gl within a class (i) is a Bernoulli trial: Y. = 0 if ''failure" = 1 if 11Success11 with fy(Y) = pY (l-p)1-y 110,11 (y) The joint density of the Y,1 can thus be configured, letting 11; X;= L yi( j "I fy(Y) = n -...... il n; TI Pi'(!-P)1-y' j I = IT il = n e[x,In(p,)(n-x)ln(l-p;)] il and WLOG n,=n, i=[l, .. .,c] Eqn. 4.1.1 x. where -' is simply the observed proportion, ji ofn Bernoulli trials that were successful. In n accordance with the factorization criterion for sufficiency, the joint distribution of the variate Y, 1 can be 33

PAGE 48

factored into a product of two functions: one positive and solely dependent upon the data; the other is dependent upon both the hypothesized P; s and the data, but dependent on the data only through the sufficient statistics. This proves that is sufficient for !: In conclusion, both the traditional test statistic and the GLRT statistic are dependent on these individual Pis, so we have chosen adequate test statistics to compare with regards to sufficiency. 4.2 Power Comparison: Type 1 Error For the two test statistics to be compared, their Type l errors, a, must be equivalent to deem one or the other the more powerful test of size a. "The justification for fixing the size of the Type I error to be a ... seems to arise from those testing situations where the two hypotheses are fonnulated in such a way the one type of error is more serious than the other. The hypotheses are stated so that the Type I error is the more serious, and hence one wants to be certain that it is small." (Mood eta!., p.4 I I) Logically, if one test statistic has a Type I error of a', in accordance to a critical value c, we can choose a test statistic whose distribution dictates a larger Type I error a", imposing a smaller Type 2 eJTor. This would yield a false comparison of the two test statistics. 4.3 The Final Power Comparison: Results Using these critical values established in Section 4.2 and given specitic deviations from the null hypotl1esis, power of the two tests can be established. This is the probability of rejecting the hypothesis of homogeneity when, in fact, there exists some heterogeneity; using a metric to measure the magnitude of the deviation from the null hypothesis, the power can be graphed as a function of this deviation, for graphical companson. 34

PAGE 49

GLR tests, for a given departure from the null hypothesis. A deviation from tl1e null hypothesis was characterized as a fluctuation of one of the five hypothesized p,s. That is, one of the observed ji,s comes from a different distribution than do the other ji,s. The deviation from the null hypothesis is measured by the magnitude of the deviation of that sole hypothesized P; from the others. The deviation ranged from 0.00 to 0.20, in increments of0.04. This yielded 6 points of analysis, for a given parameter combination. Diagrams 4.3.1-4.3.4 display the two estimated power functions, dependent on deviation, for the parameter combinations n=IO, c=5, and p=[0.2,0.4,0.5,0.6] Diagram 4.3.1 displays the complete domination of the traditional test's power over that of the GLRT. Diagrams 4.3.2 and 4.3.3 indicate tl1at when n=lO and c=5 andp=0.5, the GLRT is the more powerful test as deviation grows, but when p=0.4, the roles reverse. With the power comparison for p=0.4 and deviation=O.OO, the graphs do not meet. We had assumed for tl1is power comparison that the tests were of the same size. This equality of size should have been represented by the meeting of the graphs, as deviation=O.OO implies that we, in fact, have all proportions coming from the same distribution 0 a. 0 "' 0 0 0 "' 0 0 0 0 0.0 Diagram 4.3.1: Power Comparison (n=1 O,c=5,p=0.2) ,' ,' 0.05 Traditional GLRT ,' 0.10 Deviation 35 ,' .. .... ,' 0.15 .. .. , / 0.20

PAGE 50

0 a. 0 a. 0 N ci ci ;? ci "' "' 0 0 ci 0 "' 0 ci ;? ci "' 0 ci "' 0 0.0 0.0 Diagram 4.3.2: Power Comparison (n=10,c=5,p=0.4) 0.05 Traditional GLRT 0.10 Deviation ... .. ... 0.15 0.20 Diagram 4.3.3: Power Comparison (n=10,c=5,p=0.5) Traditional GLRT .-=o:..:::.::--__ 0.05 0.10 0.15 0.20 Deviation 36

PAGE 51

0 a. 0 N 0 0 0 0 "' 0 0 0 0 0.0 Diagram 4.3.4: Power Comparison (n=10,c=5,p=0.6) 0.05 Traditional GLRT ... .. 0.10 ... ... ... Deviation ... ... ... 0.15 0.20 (power and size should be equal). This difference in size could be attributed simulation noise in establishing the critical values of these two test statistics, as discussed in Section 3.3. The unexpected direction change of power seen in Diagram 4.3.3 is speculated as being a result of simulation noise. We would expect that power would only increase as deviation grew. Also produced were similar power functions for parameter combinations (n=30,c=5) and (n=30,c=25),p=0.2, 0.4, 0.5, 0.6. With witnessing the distributions for (n=30,c=5) [Diagrams 4.3.54.3.8], we see that the GLRT and the traditional test perform in the same fashion, with respect to power, particularly for p near 0.5. This similarity could be the product of the asymptotic nature of both tests, as we are using a relatively large n (30). The greater difference between the power curves for p=0.2 is interesting, leading to the conclusion the tests perform differently when we have different values of p, namely smaller or larger. This greater difference in power curves for the smaller p was witnessed also with (n=30,c=25) [Diagram 4.3.9-4.3.12], relative to those power curves for p=0.4,0.5,0.6. Similar to results of Section 2.2, decreasing c yielded no difference in behavior of the test statistic, this example 37

PAGE 52

showing the little difference of power. From these graphs we are m1able to conclude that one test is consistently better than the other. 0 a. ... ci "' 0 0 0 ... 0 "' 0 N 0 0 0 0.0 0.0 Diagram 4.3.5: Power Comparison (n=30,c=5,p=0.2} ..... 0.05 Traditional GLRT , .. .... 0.10 .. Deviation ... .. ... .... ... 0.15 .. ... _.. 0.20 Diagram 4.3.6: Power Comparison (n=30,c=5,p=0.4) 0.05 Traditional GLRT ... ... ...... -. 0.10 Deviation 38 .. .. 0.15 ... ... ... "0.20

PAGE 53

0 a. .. 0 "' 0 "' 0 q 0 .. 0 "' 0 "' 0 0 0 0 0.0 0.0 Diagram 4.3.7: Power Comparison (n=30,c=5,p=0.5) 0.05 Traditional GLRT 0.10 Deviation ... 0.15 Diagram 4.3.8: Power Comparison (n=30,c=5,p=0.6) 0.05 Traditional GLRT 0.10 Deviation 39 .. 0.15 0.20 0.20

PAGE 54

0 a. 0 a. .. 0 0 '" 0 0 0 0 '" 0 0 q 0 0.0 0.0 Diagram 4.3.9: Power Comparison (n=30,c=25,p=0.2) Traditional GLAT -.. ---0.05 0.10 Deviation ... ... ... ... 0.15 .. ... 0.20 Diagram 4.3.1 0: Power Comparison (n=30,c=25,p=0.4) 0.05 40 Traditional GLRT 0.10 Deviation ... ... 0.15 ... ... ... ... 0.20

PAGE 55

:;; 0 0.. "' 0 0 0 0 .. 0 "' 0 '" 0 0 0 0 Diagram 4.3.11: Power Comparison (n=30,c=25,p=0.5) Traditional GLAT .. ... ... ... ... ...... 0.0 0.05 0.10 0.15 0.20 Deviation Diagram 4.3.12: Power Comparison (n=30,c=25,p=0.6) 0.0 0.05 Traditional GLRT 0.10 Deviation 41 .. .. ... .. 0.15 0.20

PAGE 56

5. Bayesian-Adjustment Approach 5.1 Bayesian Theory Widespread familiarity with the traditional testing procedure indicates that analysts will continue using it, correctly or falsely. Assuming that analysts will continue to use it and knowing that there is a possibility offalse usage (see Section 3.2), let us introduce an adjustment for the p,s so that the test will be usable always. The methods addressed in tl1e previous chapter depended on the parameters being fixed. For this chapter, let us assume that these parameters, specifically p, are actually variates from their own distribution. This is called a prior distribution. The nature of p dictates a range from 0 to I, as there are never more Bernoulli successes than there are trials, and there are certainly never a negative number. These constraints are very similar to the defmed domain of the beta distribution [Potthoff and Whittinghill, Rice], so that would be an excellent place to begin with our guesses of how pis distributed. We write p's prior distribution as: where E(p)= _a_ and a+b n(p) r(a+b) ,., (!l""' rca)r(b)p p var(p)= ab (a+b)2(a+b+l) 42 Eqn. S.l.l

PAGE 57

Continuing to assume that our data are distributed binomially, with parameters nand p, we write the conditional distribution of our data dependent on those two parameters, per usual, as: 1t(xlp) ( p'(l-p)""' Eqn. 5.1.2 The joint distribution between the data and p follows: 1t (x,p)"1t(xlp)1t(P)"(n)p'(l-p)"-x r(a+b) p'-I(J-p)b-l x,p x r(a) r(b) Eqn. 5.1.3 Following Bayesian rule, the posterior distribution of p, the distribution after observing X, is: 1t(pjx)" 1t(xjp)1t(p) I J 1t(xl p) 1t(p) dp 0 Using equations 5.1.2 and 5.1.3 in the above equation 5.1.4, we see that: 43 Eqn. 5.1.4

PAGE 58

(1-p)n-x-h-1 1 J ( 1 p )n-x-b-1 dp 0 (1-p)n-x-h-1 r(x+a)r(n-x+b) r(n+a+b) Eqn. 5. 1.5 which is probability density function (p.d.f.) of beta with parameters (x+a,n-x+b). Note that whenp's prior distribution is uuifonn, its p.d.f 's parameters are (a= I, b= I). A prior distribution of uuifonn nature is most conservative, stating that there exists no indication of the size of p. With the assumed prior distribution of beta, it is shown that the expected value of p, given the data, is a function of the observed proportion of Bernoulli successes, ji, the number of Bernoulli trials, n, and the distribution's two parameters, a and b, through the following fonnulation: E (p I x) = ---"( x:....+::!a):___ (x+a)+(n-x+b) a+n+b a+b+n a+b a+b+n n 44 Eqn. 5. 1.6

PAGE 59

and with our flfS! conservative step towards specifying the distribution's parameters (a= 1, b= 1 ), we see that E(p 1 x) = -2-(0.5) + where 0.5 is the expected value of the uniform distribution of interval 2+n 2+n [0, 1]. It is important to note at this point that : E(p i,iJ=O) = 2+n E(pip=l) = l+n 2+n Egn. 5.1.7 The beauty of this derivation is tlwt tl1ese ps ofEgn. 5.1.7 are exactly those that cause problems within the use of the traditional testing procedure, as corresponding weights are infinity (Section 3.1) Justification can be sought then for adjusting these problematic ps so that the traditional testing procedure can be used freely, without violation. 5.2 Results Four thousand traditional test statistics were obtained for each of the three combinations (n,p,c)={(30,0.2,25),(30,0.5,25),(30,0.8,25)}, similar to those used in Section 3.3 analyses. Generated through simulation [see Appendix D for S-Pius code used], these test statistics are those construed using the expected values of the hypothesized p, given the observed data, as described in Chapter 5.1. These expected values simply replaced the observed ii;s within the weights of the traditional test statistic. What we would ideally like to see is that the distribution of the test statistics is similar to a known distribution; for convenience, let us compare the distribution to the Chi-squared, with c-1 degrees of freedom. This is the asymptotic distribution of the traditional test statistic, which was shown to be significantly different from the estimated true distribution, as discussed in Chapter 3.3. Comparison of 45

PAGE 60

these distributions have been done both pictorially and through the nonparametric Kolrnogorov-Smirnov Goodness-of-Fit test. Diagrams 5.2.1, 5.2.2, 5.2.3 display the estimated probability density functions (p.d.f.) of the test statistics of parameter combinations (n,p,c)={(30,0.2,25),(30,0.5,25),(30,0.8,25)}, respectively; each indicated by the histogram. The observed p.d.f.s are compared to the p.d.f. of x2 24, indicated by the solid black line. 0 0 Diagram 5.2.1: Qistributio.n of Bayesian-Adjusted Test Statistics n=30,c=25, =0.2 20 40 60 46 80

PAGE 61

0 ;; "' 0 Diagram 5.2.2: Distribution of Bayesian-Adjusted Test Statistics (n=30,c=25,p=0.5) 20 40 60 20 40 60 47 80 80

PAGE 62

As indicated by Diagrams 5.2.1-5.2.3, the estimated true distribution of the Bayesian-adjusted traditional test statistic, for relatively large n, fits the hypothesized X2 24 quite Well and better than tlJOSe of the respective Diagrams 3.3.1-3.3.3 for the non-adjusted traditional test. More interesting was the behavior of this Bayesian-adjustment technique for those problematic areas discussed in Section 3 .2. For this analysis, Diagram 3.2.1 was used again to identity the problematic regions corresponding to n=lO, c=lO, andp=[O.l, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9]. This analysis also required another generation of 4,000 test statistics using this Bayesian adjustment for each of the generated proportions, per each combination ofn, c, and p. In Diagran1s 5.2.4-5.2.10 are the histograms representing the p.d.fs of each parameter combination described above. These histograms demonstrate the Chi-squared-like nature of the Bayesian-adjusted test statistics, except for tl1e small/large proportions (p=O.l, 0. 9) The reader should keep in mind that this demonstration involves relatively small n. For further research, it would be necessary again to establish a critical value for these parameter combinations. Diagram 5.2.4: Distributio.n 9f Bayesian-Adjusted Test Staltslics n=10,c=10, =0.1) l(l ci 0 10 20 30 40 50 60 48

PAGE 63

Diagram 5.2.5: Distribution of Bayesian-Adjusted Test Statisticsjn=10,c=1 O,p=0.2) N 0 0 0 0 10 20 30 40 50 60 Diagram 5.2.6: Qistribution of Bayesian-Adjusted Test Statistics n=10,c=10, =0.3) 0 0 50 60 Diagram 5.2. 7: Distribution of Bayesian-Adjusted Test Statistics (n=10,c=10..=0.5) "' 0 0 49

PAGE 64

Diagram 5.2.8: Distribution of Bayesian-Adjusted Test Statistics n=10,c=10, =0.7) m 0 0 N 0 0 10 20 30 40 50 60 Diagram 5.2.9: DJstribution of Bayesian-Adjusted Test Statistics (n=10,e=10,p=0.8) m N 0 0 10 20 30 40 50 60 Diagram 5.2.1 0: Distribution of Bayesian-Adjusted Test Statistics (n=10,c=10, =0.9) 0 0 N 0 0 10 20 50

PAGE 65

The reader should note at this point the extreme difference between the distributions of the traditional test statistics for parameter combinations (n=l O,c=l O,p=[0.2,0.8)) [Diagrams 5.2.11, 5.2.12] compared to the distributions corresponding to the same parameter combinations of the Bayesian-adjusted test statistics. [Diagrams 5.2.5, 5.2.9] "' 0 Diagram 5.2.11: Distribution of Traditional Test Statistics (n=10,c=10, =0.2) 0 10 20 30 40 50 60 Diagram 5.2.12: Distripqtion of Traditional Test stat1st1cs (n=10,c=10, =0.8) 0 10 20 30 40 50 60 51

PAGE 66

It is shown by these Diagrams 5.2.5 and 5.2.9 that the Bayesian adjustment created a distribution that was more of x' 24 nature than the distribution of the non-adjusted. This is a remarkable observation. With the traditional testing procedure, we were only able to utilize approximately 31.5% of the test statistics where p=0.2, 0.8; 98% for p=0.5. Because of the low probability of usability for the smaller/larger ps, the true distribution of the test statistic may have been under represented. With this Bayesian adjustment, each generated proportion posed no problem when incorporated into the test statistic, thus 100% of tl1e test statistics were used to determine their distribution. Using the ASPN data of Table 1.1, where the traditional testing procedure was not used due to problematic proportions, but we could test for homogeneity with the Bayesian adjustment. The null hypotl1esis of homogeneity was rejected, using this Bayesian technique, with p-value
PAGE 67

0 "' ci "' ;:; :;; 0 ;:; 0 a. l!l 0 0 0.0 Diagram 5.3.1: Power Comparison (n=10,C=5,p=0.2) ... .... ... .... -.:.. ____ ,... 0.05 Bayesian Traditional GLRT .... .... ... .. --0.10 Deviation 53 .... .. ..... .. 0.15 ... .. / 0.20

PAGE 68

6. Conclusion 6.1 Remarks In our ability to compare the generalized likelihood ratio and the traditional tests, we stumbled upon information leading us to disbelieve their assumed hypothesized distribution, x2,.1 In Chapters 2 and 3, the true size of the tests was established, thus yielding a critical region to be utilized for both tests. Chapter 4 presented the power for each of the size a tests, represented as a function of the deviation from the null hypothesis, it was witnessed that the traditional and generalized likelihood ratio testing procedures performed equally for particular combinations of n, c, and p, when the established true regions of rejection were utilized, but with other parameter combinations, the domination of one testing procedure over tl1e other varied, leading to inconclusive results. One lingering problem with the traditional testing procedure was the inability to use it when faced with a problematic combination of proportions: tl1ose that rendered the test statistic useless. as one or more of them were equal to I or 0. To make usable the unusable, a Bayesian adjustment was proposed in Chapter 5. For specific combinations of the parameters n, p, and c, the distribution of 4,000 test statistics generated through usage of this Bayesian adjustment were displayed. It was concluded that the distribution for each combination. where p=O.Z-0.8, was similar to the hypothesized distribution. x2,.1 and closer than the distribution of those generated without the adjustment. The Bayesian-adjusted test statistic also retained the power of the non-adjusted traditional test statistic for a specific parameter combination. It is suggested to use this Bayesian technique as a testing procedure for binomial homogeneity. 54

PAGE 69

Appendix A pnnat2<-NULL Splus code used to assess probability of useful set of 10 proportions, given n and p for (m in 1:10) { for (k in 1:99) { hyp.p<-k/100 classiz<-S*m prob.p<-(1-((hyp.p)Aclassiz+(1-hyp.p)Aclassiz))A10 rnnat2<-rbind(pnnat2,c(hyp.p,classiz,prob.p)) 55

PAGE 70

Appendix B Splus code used to establish true critical regions of test statistics, given n, c and p type1err. f<-function (n, pr, numclass) { For(times=1:4000,{ if (times==1) { testmat<-numeric() )hat=qchisq(.95,degfree)) pchiglrt<-sumglrt/4000 pchitrad<-sumtrad/(4000-sum(!okl)) crittrad<-quantile(testmat[,l],probs=c(.95) ,na.rm=T) critglrt<-quantile(testmat[,2] ,probs=c(.95) ,na.rm=T) c(crittrad,critglrt,qchisq(.95,degfree),pchitrad,pch iglrt,sum(!okl) ,pr[1],n[l],numclass) } 56

PAGE 71

Appendix C Splus code used to establish power of two tests, given true critical regions powerdev.f<-function(alpha, n, pr, pwrdev, chisum, tradsum, devation, hypothp) { crtrad = crtrad) >= crglrt) sum (! ok1) )

PAGE 72

Appendix C (continued) Splus code used to establish power of two tests, given true critical regions c(crtrad, crglrt, qchisq(0.94999999999999996, degfree), pchitrad, pchiglrt, } sum(!okl), pr[2), n[l), length(n), devation) 58

PAGE 73

Appendix D Splus code used to establish distribution of Bayesian-adjusted test statistics bayes210. f<-function (n, pr, numclass) { For(times=1:4000,{ if (times==1) { bayes210<-numeric() phat< -numeric() ophat<-numeric () } ophat<-rbinom(numclass,n,pr)/n phat<-(1/(2+n))+(n/(2+n))*ophat tradtest<-NA W<-n/(phat*(1-phat)) sumwts<-sum(w) pplus<-sum(w*ophat)/sumwts tradtest<.-sum( (ophat-pplus) (ophat-pplus) *w) bayes210<-rbind(bayes210,tradtest) } ) degfree<-numclass-1 ok1<-!is.na(bayes210) sumtrad<-sum(bayes210[okl]>=qchisq(.95,degfree)) pchitrad<-sumtrad/(4000-sum(!ok1)) crittrad<-quantile(bayes210,probs=c(.95),na.rm=T) c(crittrad,qchisq(.95,degfree),pchitrad,sum(!ok1),pr [1], n [1] numclass) } 59

PAGE 74

Appendix E p Size of Eacl> Class (n) 0.2 10 15 20 25 30 0.3 10 15 20 25 30 0.4 10 15 20 25 30 10 15 20 25 30 0.6 10 15 20 25 30 Critical Values for Generalized Likelihood Ratio Test for Homogeneity of Proportions (Xrn,2 ) P[nXrn,2 ] = 0.05 Number of Classes 0 5 10 15 105532 18.8285 26.5557 33.4652 10.7294 18.6883 26.1684 33.3299 10.4266 18.0017 25.8823 32.1944 9.8970 18.0641 24.7155 31.8633 9.8148 17.9903 24.3759 31.4190 10.5264 18.5832 26.7902 33.2966 10.4127 18.2024 25.5416 32.3912 10.2386 18.0654 23.8820 30.9469 10.0308 11.9569 24.7506 31.4586 9.6962 17.6057 24.1002 31.0309 10.6821 18.2016 25.4461 32.8585 9.8004 17.4470 24.9343 31.6513 9.6003 16.7973 23.2409 29.7820 9.3750 16.8195 23.3728 29.8742 9.2906 17.0800 23.9814 29.6477 10.5346 18.2338 25.2350 32.3394 9.3004 16.2619 23.5884 29.3358 9.2580 16.3133 22.9261 29.2166 9.3183 163440 22.9175 30.1496 9.1651 16.2428 23.1305 28.4271 10.7047 18.3665 25.6470 33.1218 10.1305 17.7842 24.5145 31.1136 9.8078 17.1956 23.6380 30.8959 9.6516 17.2464 23.4157 30.2821 9.8854 17.1970 23.897! 30.1222 60 20 25 40.3242 39.7710 38.9742 38.6316 37.5747 40.1662 38.2954 37.7438 38.0930 36.6498 39.0980 38.1341 36.3268 35.8151 35.6353 39.1347 35.8746 35.2785 34.9830 35.4041 39.4052 37.6647 36.4060 36.7025 36.7249

PAGE 75

AppendixE 0.7 10 15 20 25 30 0.8 10 15 20 25 30 (continued) Critical Values for Generalized Likelihood Ratio Test for Homogeneity of Proportions Cx,ri/) P[x >x,;/1 = 0.05 10.5355 18.5832 25.8818 33.3677 9.9993 }8.0193 25.3698 31.4183 10.4175 17.6339 24.2437 31.3704 9.639). 17.3387 24.9628 30.6490 10.0031 17.7484 24.2586 30.0357 10.52641 18.7236 26.5587 33.8362 10.6803 18.8908 26.2571 33.0036 10.0553 18.1507 25.2396 32.5497 10.1826 18.0175 25.1527 31.5435 9.9308 17.5195 24.4648 31.1350 61 40.0863 38.3753 37.5943 36.7276 40.5709 39.5356 39.1177 37.9807 37.9958

PAGE 76

Bibliography Bardsley, W. E., "An algorithm for calculating an upper bound to x' for testing binomial homogeneity with some observations uncategorized." Computers and Geosciences Volume 6 (1980); pp. 315-319. Beyer, William H. CRC handbook of tables for probability and statistics. Cleveland, Ohio: Chemical Rubber Company, 1966. Fleiss, J. L. Statistical methods for rates and proportions. New York, New York: John Wiley and Sons; 1973. Green, Larry et al., "How representative of typical practice are practice-based networks?" Arch. Family Medicine Volume 2 (September, 1993); pp. 939-949. Hedges, L; Olkin, I. Statistical methods for metaanalysis. San Diego, California: Academic Press, Inc., 1985. Larsen, Richard J.; Marx, Morris L. Introduction to mathematical statistics and its applications. Englewood Cliffs, New Jersey: Prentice Hall, 1986. Mood, A. M.; Graybill, F. A.; Boes, D. C. Introduction to -the theory of statistics. McGraw-Hill; 1974. Potthoff, R.F.; Whittinghill, M. "Testing for homogeneity." Biometrika Volume 53 (1966); pp.167, 182. Rice, John A. Mathematical statistics and data analysis. Belmont, California: International Thomson Publishing; 1995. Snedecor, G. W.; Cochran, W. G. Statistical methods. Ames, Iowa: Iowa State University Press; 1967. 62