COMPARING THE GENERALIZED LIKELIHOOD RATIO
AND VARIANCE TESTS FOR BINOMIAL HOMOGENEITY
WITH OBSERVATION OF BAYESIAN ADJUSTMENT
B.S., University of Michigan, 1993
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Applied Mathematics
by
Alissa Ruelle
1996
1996 by Alissa Ruelle
All rights reserved.
This thesis for the Master of Science
degree by
Alissa Ruelle
has been approved
by
Date
Ruelle, Alissa (M.S., Applied Mathematics)
Comparing the Generalized Likelihood Ratio and Variance Tests for Binomial
Homogeneity with Observation of Bayesian Adjustment
Thesis directed by Assistant Professor James Koehler
ABSTRACT
H0: Pi =P2 = =PC =P
Ha: pj Pj for some i *j
Two testing procedures for homogeneity of c proportions are used commonly:
the generalized likelihood ratio and the variance tests for binomial homogeneity. Both
testing procedures produce statistics that asymptotically follow a x2 distribution, with
c-1 degrees of freedom. The asymptotic case, i.e. having a large number of Bernoulli
trials (n) behind each proportion, poses few problems, if any, for both testing
procedures. With small n, though, problems exist with respect to assuming that the
test statistics follow the appropriate x2 distribution.
Using Splus, proportions were generated and tested, using both testing
procedures, to estimate the test statistics true distributions. The power of each test
was established with these generated distributions and their respective critical regions.
Comparison of the power functions followed.
IV
In addition to having problems with small n, the variance test cannot be used
when one or more proportions equal 1 or 0. As a plausible remedy, an adjustment to
the variance test was proposed, incorporating Bayesian mathematics. This proposed
testing procedure was compared back to the generalized likelihood ratio and variance
tests.
This abstract accurately represents the content of the candidates thesis. I recommend
its publication.
Signed
v
In gratitude to Danielle, whose spirit offered me the energy,
to Karen, whose constant inquiry gave the energy direction,
and to Mom, who provided eveiything else.
vi
CONTENTS
Chapter
1. Introduction...................................................................1
1.1 Introduction: The Generalized Likelihood Ratio and Traditional Tests.......1
1.2 A Bayesian Approach........................................................4
2. The Generalized Likelihood Ratio Test..........................................6
2.1 The Test Statistic and Parameter Spaces of Hypotheses......................6
2.2 Type 1 Error...............................................................8
3. The Traditional Test..........................................................20
3.1 The Test Statistic........................................................20
3.2 Assessing the Probability of Failure of the Traditional Test Statistic....21
3.3 Type 1 Error..............................................................24
4. Power Comparison..............................................................32
4.1 Power Comparison: Sufficiency.............................................32
4.2 Power Comparison: Type 1 Error............................................34
4.3 The Final Power Comparison: Results.......................................34
5. Bayesian-Adjustment Approach..................................................42
5.1 Bayesian Theory...........................................................42
5.2 Results...................................................................45
5.3 Power Comparison with GLRT and Traditional................................52
6. Conclusion....................................................................54
6.1 Remarks...................................................................54
vii
Appendix
A. Splus code used to assess probability of usefiil set of ten proportions, given n and p.. .55
B. Splus code used to establish true critical regions of test statistics, given n, c, p.56
C. Splus code used to establish power of two tests, given true critical regions...........57
D. Splus code used to establish distribution of Bayesian-adjusted test statistics.........59
E. Critical values for generalized likelihood ratio test for homogeneity of proportions....60
Bibliography.......:......................................................................62
vm
DIAGRAMS
Diagram
2.2.1 Distribution of GLRT statistics (n=30,c=25,p=0.2)............................9
2.2.2 Distribution of GLRT statistics (n=30,c=25,p=0.5)............................9
2.2.3 Distribution of GLRT statistics (n=30,c=25,p=0.8)...........................10
2.2.4 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.2)..............................10
2.2.5 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.5)..............................11
2.2.6 Q-Q plot for GLRT Statistics (n=30,c=25,p=0.8)..............................11
2.2.7 Distribution of GLRT statistics (n=10,c=25,p=0.8)...........................12
2.2.8 Distribution of GLRT statistics (n=10,c=10,p=0.8)...........................12
2.2.9 Cumulative distribution of GLRT statistic (n=30,c=25,p=0.8).................14
2.2.10 GLRT critical values c=l 0..................................................16
2.2.11 GLRT critical values c=l5...................................................16
2.2.12 Modeling GLRT critical values: residuals vs. fitted.........................17
2.2.13 Modeling GLRT critical values: Q-Q norm, plot...............................17
2.2.14 Modeling GLRT critical values: fitted vs. observed..........................18
3.2.1 Probability of ability to use traditional test c=10........................23
3.2.2 Contour plot of surface.....................................................23
3.3.1 Distribution of trad. Test statistics (n=30,c=25,p=0.2).....................25
3.3.2 Distribution of trad. Test statistics (n=30,c=25,p=0.5).....................25
3.3.3 Distribution of trad. Test statistics (n=30,c=25,p=0.8).....................26
3.3.4 Traditional critical values c=10............................................29
IX
Diagram ('continued')
3.3.5 Traditional critical values c=15..............................................29
4.3.1 Power comparison (n=10,c=5,p=0.2).............................................35
4.3.2 Power comparison (n=10,c=5,p=0.4).............................................36
4.3.3 Power comparison (n=10,c=5,p=0.5)........................................... 36
4.3.4 Power comparison (n=10,c=5,p=0.6).............................................37
4.3.5 Power comparison (n=30,c=5,p=0.2).............................................38
4.3.6 Power comparison (n=30,c=5,p=0.4).............................................38
4.3.7 Power comparison (n=30,c=5,p=0.5).............................................39
4.3.8 Power comparison (n=30,c=5,p=0.6).............................................39
4.3.9 Power comparison (n=30,c=25,p=0.2)............................................40
4.3.10 Power comparison (n=30,c=25,p=0.4)............................................40
4.3.11 Power comparison (n=30,c=25,p=0.5)............................................41
4.3.12 Power comparison (n=30,c=25,p=0.6)............................................41
5.2.1 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.2)...........46
5.2.2 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.5)...........47
5.2.3 Distribution of Bayesian-adjusted test statistics (n=30,c=25,p=0.8)...........47
5.2.4 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.1)............48
5.2.5 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.2)............49
5.2.6 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.3)............49
5.2.7 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.5)............49
5.2.8 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.7)............50
5.2.9 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.8)............50
x
Diagram (continued')
5.2.10 Distribution ofBayesian-adjusted test statistics (n=10,c=10,p=0.9)............50
5.2.11 Distribution of traditional test statistics (n=10,c=10,p=0.2).................51
5.2.11 Distribution of traditional test statistics (n=10,c=10,p=0.8).................51
5.3.1 Power comparison (n=10,c=5,p=0.2).............................................53
XI
TABLES
Table
1.1 ASPN: Referral care rates.........................................................2
3.3 Critical region for traditional chi-squared test for homogeneity of proportions 27
xii
ACKNOWLEDGMENTS
Dr. Jim Koehler for being my dedicated, sincerely interested, and caring educator, mentor,
and friend.
Dr. Karen Kafadar and Dr. Ram Shanmugam for participating on my graduate committee
and for their timely review of my thesis, followed by interesting and educating inquiry.
Biostatistics Division of AMC Cancer Research Center for providing employment and
invaluable experience throughout my graduate studies.
Ambulatory Sentinel Practice Network, specifically Jim Werner, for supplying data to
analyze.
Fellow students/friends at University of Colorado at Denver for offering relentless support
during my research and at all other times.
Xlll
1.
Introduction
1.1 Introduction: The Generalized Likelihood Ratio and Traditional Tests
Two testing procedures for homogeneity of proportions are used commonly: the generalized
likelihood ratio test (GLRT) and the traditional testing procedure (Fleiss, 1973; Hedges and Olkin,
1985). This traditional test is of similar format to a test termed the variance test of binomial
homogeneity (Snedecor and Cochran, 1967). The traditional testing procedure appeals to the
asymptotic nature of sum of squared standard normal variates, which is Chi-squared, with degrees of
freedom equal to the number of proportions minus one; the GLRT utilizes the ratio of the joint likelihood
functions for the sample of proportions under the null and alternative hypotheses parameter spaces. This
is also, asymptotically, a Chi-squared variate with degrees of freedom again equal to the number of
proportions minus 1. The performance of both tests is dependent upon 1) n, the size of the classes from
whence the proportions come, 2) p, the expected value of the hypothesized distribution of the proportions,
and 3) c, the number of proportions at hand.
An example of the necessity for a homogeneity test for proportions was found in the Ambulatory
Sentinel Practice Network (ASPN) database. ASPNs question concerned referral rates for follow-up
care, the percentage of times that patients were referred to a specialist after being treated at an emergency-
care level. To test the homogeneity of referral care rates, classes must be established. As case-mixing
most likely existed within each individual practice, (as a hypothetical example, within Practice H, 74% of
the emergency care patients were treated for respiratory complications; at Practice K, only 13%) it was
necessary to designate one type of patient, ailing from a particular illness. After type of patient was
1
selected, the class (practice) was used, thus a referral rate was calculated for each medical practice. A
listing of referral rates for ailments identified as respiratory complications follow in Table 1.1. The data of
Table 1.1 will be summoned throughout this document.
Table 1.1 ASPN: Referral Care Rates
P=Practice code N=Number of Patient R=Referral Rate
p N RR 1 iB p N RR 111 P N RR
1 10 1.000 1 Wm 24 8 0.875 PHI 47 20 0.900
2 16 0.875 | ppj 25 6 0.833 iijj 48 9 1.000
3 8 0.250 1 26 11 1.000 1111 49 55 0.818
4 20 0.900 1 flllj 27 23 0.913 IJlIj 50 4 1.000
5 11 0.818 I illi 28 16 0.563 1111 31 29 0.897
6 1 1.000 1 111] 29 23 0.957 1111 52 11 0.909
7 6 0.833 | Si 30 16 1.000 Mil 53 6 0.500
8 10 0.800 i Wm 31 15 0.667 1 34 6 0.167
9 6 0.667 I 1111 32 7 0.286 1119 35 4 0.750
10 15 1.000 i IBl 33 15 0.933 If 56 19 0.842
11 6 1.000 1 Big 34 23 0.957 III M 57 4 1.000
12 5 1.000 f 111 35 36 0.833 II II 38 18 1.000
13 36 0.889 I lli 36 3 1.000 111111 59 6 0.667
14 2 1.000 I 111 37 11 0.909 11 60 33 0.848
15 10 0.900 | H 38 31 0.710 HI 61 13 0.769
16 22 0.864 | 8 39 2 0.000 |Â§ 62 6 0.833
17 8 0.500 | M 40 11 0.636 11 63 5 0.600
18 40 0.825 | Bill 41 25 0.800 llil 64 8 0.875
19 21 0.952 | 1111 42 2 1.000 1111 65 17 0.764
20 12 0.500 | illi 43 17 0.941 3 66 5 1.000
21 13 0.769 1 1111 44 7 0.857 3 67 5 0.800
22 40 0.425 1 45 20 0.950 111 68 4 1.000
23 9 0.667 | illi 46 9 0.556
2
Using the traditional test for homogeneity of proportions, we calculate for each class a
proportion, followed by an additional formulation of a representative proportion for the c classes, which is
a weighted average of the individual proportions, using weights inversely proportional to the variance of
the proportion. The characteristics of the binomial distribution lead us to the weighting schema used in the
traditional test statistic, and it will be shown in Chapter 3 that when the observed proportions near 0 or 1,
the weights tend towards infinity. In fact, the traditional test statistic falters when we utilize some specific
combinations of n, p, and c. Of interest is how the traditional test statistic behaves when within a region of
usable n, p, and c.
Since each patient in each class can be viewed as a Bernoulli trial, each with hypothesized
parameter p;, each proportion is a fimction of a binomial variate, with parameters (ni( p;). With this
traditional testing mechanism, we test the hypotheses, assuming c classes:
H0:p,=p2 = ...=pc=^ VS. H^p^pj forsomei*j
This same hypothesis is also tested through the generalized likelihood ratio test, which utilizes the joint
distribution of the observed proportions, pjS, under the parameter spaces of the null and alternative
hypotheses. Unbiased and attaining minimum variance, the best estimator for an individual hypothesized
pt is the observed proportion of Bernoulli successes, which is the maximum likelihood estimator,:
Number of successful Bernoulli trials of Class i
p. = -------------------------------------------------------
1 Number of Bernoulli trials of Class i
Number of Patients Referred to a Specialist, within Practice i
Number of Patients Treated By Practice i
3
It will be shown that the traditional and generalized likelihood ratio tests are not of the same size,
when using critical regions of the hypothesized Chi-squared distribution. To accommodate for the
inequality of size using those hypothesized regions, the estimated true critical regions of the traditional test
statistic, for particular combinations of number of Bernoulli trials per class (n), hypothesized expected
value of the set of p;s (p), and the number of classes (c), will be provided. These estimated true critical
regions are obtained via the empirical distribution via simulation and are used through any hypothesis
testing within the thesis research.
In further analysis of these two testing procedures, in Chapter 4 it is shown that both test statistics
are dependent on sufficient statistics. Using estimated critical values, the Type 1 errors were forced to be
equal for the purpose of power comparison. For the power comparisons, inhomogeneity was forced into
simulated proportions, with the proportions as realizations from differing distributions, and power of each
test statistic is established. Power is graphed as a function of the deviation from the null hypothesis, per
testing procedure, granting pictorial comparison.
1.2 A Bayesian Approach
In Chapter 5, we entertain a totally different avenue in examining smaller and larger proportions,
as with proportions equal to 0 or 1 we are unable to test homogeneity adequately by the traditional chi-
squared test. We take a look at our ability to tweak these problematic proportions with an amount
basically determined by the size of the sample the proportion describes. This tweaking alleviates the
problem of failure of the traditional test, as we have no weights approaching infinity. Bayesian theory is
used, recognizing that the hypothesized p could be considered a variate from its own distribution, called a
prior distribution.
4
An estimate for the critical region for this Bayesian-adj usted test statistic is formulated, using
procedures similar to that used in Section 1.1. Estimation of power, again graphed as a function of the
deviation from the null hypothesis, followed, and was compared back to that of the traditional test and
GLRT.
5
2.
The Generalized Likelihood Ratio Test
2.1 The Test Statistic and Parameter Spaces of Hypotheses
In testing for homogeneity of proportions, we test the null hypothesis Hq: p, = p2 =... = pc = p
against the alternative HA: p; t p. for some i j. The distribution of an observed^ is derived from the
distribution of the number of successful Bernoulli trials within the ith class, termed X;. The distribution of
X; is binomial with parameters n and p;. We write the cumulative distribution function of X as
P(X^ x) = Fx(x), so utilizing change of variables, we see that the cumulative distribution function of pf
is:
Fp(Pi) = P(psPi) = Pi) = P(X^n^Pj) = F^vp.)
By the chain rule, we can write
Eqn. 2.1.1
5Fn
Pp(Pj) = -^(m^Pj) = n^^Cn^P;)
dPi
and the likelihood function of the individual hypothesized proportion is then
SÂ£(Pi) = n;*
/
If
I VPi
Eqn. 2.1.2
pVPid.p^O-r,)
Eqn. 2.1.3
The generalized likelihood ratio test utilizes the distributions of the individual p; s to form
likelihood functions over the two parameters spaces
6
= {& : 0
Q = {pi : O^Pi^l i = }
hypothesized and with no constraints, respectively. The generalized likelihood test statistic (Larsen and
Marx, 1986), using the joint likelihood function of the individual hypothesized P| s, is thus -2 multiplied
by the natural logarithm of:
suPmSE(bIb)
supflSÂ£(ele)
II",
"iPi
PmLE ' ( 1 P.MLE )
fi ni
i-l
"i
n;*p.
1
Eqn. 2.1.4
where p,^ is the maximum likelihood estimator for />, which can be calculated easily through
maximizing the natural log of the joint distribution of the p ;s. The estimator p^g is exactly the number
of Bernoulli successes divided by the number of Bernoulli trials, over all c classes.
E"iPi
_ i-l
Pmle "
^2 n: Eqn. 2.1.5
i-l
This statistic is a function of the individual p( s, weighted by their corresponding sample sizes, and is
simply the proportion of successful Bernoulli trials over all c classes. With it, we formulated
arepresentative statistic that attains small variance, and it can be checked that it attains the Cramer-Rao
Lower Bound (CRLB) for variance (n^n, i= 1,2.....c): (p,,p2......pc) ~ fp(p)= nfx(np)
7
where X is a binomial variate representing the number of
Bernoulli successes within a class. Thus fp(p) can be
expressed as
(
n
l np
,np(l -p)nO-P)
A[lnfp(p;p)]
dp
(P -P)
P(l ~P)
n
The CRLB can thus be written as
1
1
~P) )2
n
EpKp-p)2]
p(l -p)
n
Which is exactly the estimated variance of the pjS, under
the null hypothesis.
2.2 Type 1 Error
- The code of Appendix B generated 4,000 GLRT statistics. This simulation was performed for
175 combinations ofp, n, and c. Using the simulation for the parameter combinations (n=30, c=25,
/j=[0.2,0.5,0.8]), we witness that GLRT statistics do follow the assumed y~ distribution with 24 degrees
of freedom when p=0.2,0.5, and 0.8. Pictorialiy, the distribution of the GLRT statistics, for each
parameter combination, is demonstrated by the following histograms, representing the probability density
function. [Diagrams 2.2.1 -2.2.3] Superimposed on the histograms is the Chi-squared probability density
function, with 24 degrees of freedom, the asymptotic distribution of test statistic.
8
Probability Density Probability Density
Diagram 2.2.1: Distribution of GLRT Statistics (n=30,c=25,p=0.2)
Diagram 2.2.2: Distribution of GLRT Statistics (n=30,c=25,p=0.5)
GLRT
9
Diagram 2.2.3: Distribution of GLRT Statistics (n=30,c=25,p=0.8)
When comparing the quantiles of the above distributions to those of the x224 we see GLRT
produces test statistics that deviate slightly x2^. but it seems that the x224 fits the GLRT statistics
distribution quite well, in accordance with the empirical p.d.f. results. See Diagrams 2.2.4- 2.2.6.
Diagram 2.2.4: Q-Q Plot for GLRT Statistics (n=30,c=25,p=0.2)
Chi-sq(24) Quantiles
10
GLRT GLRT
Diagram 2.2.5: QQ-Plot for GLRT Statistics (n=30,c=25,p=0.5)
10 20 30 40 50
Chi-sq(24) Quantiles
Diagram 2.2.6: QQ-Plot for GLRT Statistics (n=30,c=25,p=0.8)
Chi-sq(24) Quantiles
Also simulated was the distribution for parameter combination (n=l 0,c=25,p=0.8). Varying the
size of the class from 25 to 10 yielded an empirical p.d.f. to the right of x224 Thus as we move into
11
0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.02 0.04 0.06 0.08 0.10
smaller-sized classes (smaller n), the distribution appears to deviate from the assumed distribution. See
Diagrams 2.2.7 and 2.2.8. This directly reflects that it is with asymptotic nature, i.e. as n grows large, that
the GLRT produces statistics that are chi-squared.
Diagram 2.2.7: Distribution of GLRT Statistics (n=10,c=25,p=0.8)
10 20 30 40 50 60
Diagram 2.2.8: Distribution of GLRT Statistics (n=10,c=10,p=0.8)
0 10 20 30
12
The simulated distributions from parameter combinations (n=l 0,c=l 0,/?=[0.2,0.5,0.8]) provided
evidence that the closer p is to 0.5, the less deviation from x29- The small deviation from the assumed
distribution, when p=0.8, is thus similar to the observed deviation of parameter combination
(n=10,c=25,/?=0.8). Decreasing the number of classes, c, and holding n and p constant, had little effect on
the distribution of the test statistics.
From these small pieces of pictorial information, we could draw a loose conclusion that the
GLRT produces test statistics that are drawn from a distribution similar to x2c_,, if the parameter
combinations involve large n. Altering the number of classes (the number of proportions to be tested)
created little to no difference in the test statistics deviation in distribution from the asymptotic x2c_, But
most important to note is that the GLRT performs quite well, with respect to maintaining and x2 -like
distribution, except for small n.
To move beyond the pictorial representation of the distributions difference, using as an example
the distribution simulated with (n=30,c=25,p=0.8), the distribution-free test statistic presented by the
Kolmogorov-Smimov Goodness-of-Fit test was used to test if the GLRT statistics follow a x224
Theorem 1 (Mood et al., p.508) LetX,, Xj,..., Xn,... be independent identically
distributed random variables having common continuous c.d.f.
FX(-)=F(-). Define DB = dn(X1,...,Xn)= sup |Fn(x)-F(x)| ,
-CO <\
where Fn(x) is the sample c.d.f.. Then
Â£?JWX) = n"l P[v/KD""X]
= Il-2E(-iy-,e^v] I(CH(X)
j-1
= H(x)
13
As discussed by Mood et al., the above equation does not depend on the c.d.f. from which the
sample was drawn (other than it being continuous); that is the limiting distribution of ^/n Dn is
distribution-free.
As an illustration of the test statistics used in the Kolmogorov-Smimov Goodness-of-Fit test
(KSGFT), the following diagrams of the empirical cumulative distribution function (c.d.f.) of the simulated
GLRT and the c.d.f. of %224 are provided in Diagram 2.2.9. The maximal vertical distance between the
two aforementioned c.d.f.s, multiplied by Jn, is the KSGFT statistic. This KSGFT statistic is compared
with the Critical Values for the Kolmogorov-Smimov Test of Goodness of Fit table of the Handbook of
Tables for Probability and Statistics, 1966.
Diagram 2.2.9: Cumulative Distribution of GLRT Statistic (n=30,c=25,p=0.8)
Calculation of the above K-S test statistic involved sorting the 4,000 GRLT statistics to
determine the empirical cumulative probability distribution function. For example, the 132nd order
14
statistic would yield, for the 4,000 simulated GRLT statistics, a cumulative probability of 132/4,000 =
0.033. Using the 132nd variate of the sorted GLRT statistics, lets call it GRLT]32, Fx(GLRT|32)is
calculated by referring to the cumulative distribution function of the hypothesized distribution. The
absolute difference is taken over all variates, i.e. | FX(GLRT132) 0.0331, multiplied by the square root of
the number of variates used, and a supremum is obtained. The S-Plus code that generated these
cumulative probabilities can be found in Appendix C. The KSGFT value was 2.44 for the GLRT. This
KSGFT value led to rejection of the null hypothesis at the 0.05 significance level, but the magnitude of the
test statistic was most likely determined by the size of the sample, 4,000.
The estimate of the true distribution yielded a critical region, a region that was used to conclude
rejection or failure of rejection of the null hypotheses. With establishing this estimate of the critical
region, it is safely stated that the test is of size a=0.05. This is of great importance when comparing the
power of two testing procedures, as will be discussed in Chapter 4. The estimated true critical value is the
95th quantile of the test statistics, per combination of n, p, and c. (See Appendix B for S-Plus code used)
Thus, the probability of rejecting the hypothesis of homogeneity, given that there is actually homogeneity,
is 0.05.
Diagrams 2.2.10 and 2.2.11 display the critical values of the estimated true distribution, graphed
as a function of n and p, holding c constant. Using diagrams such as these, an indication of some quadratic
nature to the response surface was viewed. To try to explain the variability of the critical values with
covariates n, c, and p, a polynomial regression of degree two was used as a first step, using orthogonal
polynomials to minimize multicollinearity. Any error with the model is a combination of lack-of-fit and
simulation noise and is thus an estimate of the true variability of the critical values, conservatively. A
quadratic model for the response surface of the GLRTs critical values proved to be quite sufficient.
15
Diagram 2.2.10: GLRT Critical Values c=10
Diagram 2.2.11: GLRT Critical Values c=15
16
Residuals of Model Residuals of model
Diagram 2.2.12: Modeling GLRT Critical Values: Residuals Vs. Fitted
10 15 20 25 30 35 40
Fitted
Diagram 2.2.13: Modeling GLRT Critical Values: Q-Q Norm. Plot
-2-10 1 2
Quantiles of Normal(0,1)
17
Diagram 2.2.14: Modeling GLRT Critical Values: Fitted vs. Observed
10 15 20 25 30 35 40
Observed Crit. Value
Yielding no strong evidence of non-constant variance [Diagram 2.2.12] and having distribution only
slightly skewed right from N(0,1) [Diagram 2.2.13], the errors provided more evidence that the model was
justified. The MSEofthe polynomial regression was 0.5378. The precision of the prediction equation
for the critical values can be witnessed through the graph of the fitted values versus the observed values.
[Diagram 2.2.14]
Since the model of the GLRT critical values fits so well, the MSE can be seen as simulation
noise, conservatively. This margin of eiTor can be established by moving two standard deviations away
from the observed critical value. Moving two standard deviations away from the critical value would yield
an approximate probability of 95% of encompassing the true critical value. It is suggested at this point to
use an upper bound of this estimating window to remain conservative in a hypothesis test. In conclusion,
each simulated critical value should estimate the true critical value1.5.
18
The polynomial regression used for this model is:
Y = 8.094 17.972/? + 17.829/?2 0.232n + 0.006n2 + 1.739c 0.008c2
- 0.006n*c + 0.004p *n + 0.03/7 *c + e
A less conservative approach to estimating the simulation noise is to estimate the variance
between critical values that should be the same value. Due to the symmetry of the GLRT across the
hypothesized p, i.e. we would expect the same critical region for (n,c,/?=0.2) and (n,c,/j=0.8), we can
utilize the two assumed similar critical values and determine a variance between those two points. A
2
sample variance can be established between the two numbers utilizing the function S2 = (Yj Y)2,
i-l
where YjS are the assumed similar critical values. The expected value of S 2 is the variance of the two
YjS, as we treated them as replicates.
As expected from the limiting x2c.\ distribution, the GLRT critical values increase in size as the
number of classes, c, increased. It was thus chosen to estimate the variance associated with the critical
values, per each level of c. As we had 75 replicated critical values and 5 levels of c, 15 estimates for the
each class variance were available: 15 degrees of freedom in estimating the true variance for each class.
For each level, the 15 estimates for the true variance were averaged. These averages were used to obtain
95% confidence windows, which is exactly plus or minus two standard deviations. These windows
follow, designated by the level of c: c=5, 0.42; c= 10, 0.43; c= 15, 0.52; c=20, 0.84; c=25, 0.71.
Note that these are narrower confidence windows than those obtained through usage of the error of the
polynomial model. See Appendix E for the GLRT critical values.
19
3.
The Traditional Test
3.1 The Test Statistic
The traditional test for homogeneity of proportions involves a weighting schema using weights
inversely proportional to the variance of the observed proportion. Hypothesized as a variate coming from
a binomial distribution, we see that the observed proportion p; has corresponding estimated variance
Pi ^ Pi^ where n, is the number of Bernoulli trials within the class (e.g. the number of
ni
emergency care patients treated by Practice i)
- Number of successful Bernoulli trials of Class i
Pi = --------------------------------------------------------
Number of Bernoulli trials of Class i
Number of Patients Referred to a Specialist, within Practice i
Number of Patients Treated By Practice i
To obtain a representative statistic for the set of proportions, weights for the estimates for the p s are a
function of:
W: = --- = -------!
a2(Pi) P; (1 Pi)
Eqn. 3.1.1
and the aforementioned representative proportion, utilizing these established weights, is thus:
i-l
Wj Pj
I> i
Eqn. 3.1.2
20
Using the computations above, we formulate the traditional test for homogeneity :
Xtrad
2 _
^ (Pi P+)2
i-l o-(Pj)
Eqn. 3.1.3
which is asymptotically a Chi-squared variate, with c-1 degrees of freedom, as the n;s tend towards
infinity.
At casual observation, it should be seen that the closer the p; is to 1 or 0, the more weight it has
in the representative statistic. With the example of the ASPN practices in Table 1.1, the rate at which the
patients afflicted with respiratory problems were referred to pulmonary specialists within Practice 9 was
0.667; within Practice 4, the rate was 0.90. The representative proportion (p) is determined more
heavily by the referral rate of Practice 4 than by the rate of Practice 9, even if, which is not the case with
these selected two, the practices treat the same number of emergency care patients with this particular
condition.
Snedecor and Cochran (1967) use the maximum likelihood estimator, derived in Section 2.1,
instead of p. Using the p^ described in Equation 3.1.2 minimizes Equation 3.1.3. Comparison between
this testing procedure and that of Snedecor and Cochran's still needs to be performed.
3.2 Assessing the Probability of Failure of the Traditional Test Statistic
The traditional test statistic cannot be used when an observed p; is equal to 0 or 1. Observing
the inverse relationship between the proposed weights and the estimated variances of the proportions, it is
concluded that the weight corresponding to a small or large proportion tends to infinity, thus enlarging the
21
Using the computations above, we formulate the traditional test for homogeneity :
(Pi ~ P+)2
o2(P,)
Eqn. 3.1.3
which is asymptotically a Chi-squared variate, with c-1 degrees of freedom, as the as tend towards
infinity.
At casual observation, it should be seen that the closer the p; is to 1 or 0, the more weight it has
in the representative statistic. With the example of the ASPN practices in Table 1.1, the rate at which the
patients afflicted with respiratory problems were referred to pulmonary specialists within Practice 9 was
0.667; within Practice 4, the rate was 0.90. The representative proportion (p J is determined more
heavily by the referral rate of Practice 4 than by the rate of Practice 9, even if, which is not the case with
these selected two, the practices treat the same number of emergency care patients with this particular
condition.
Snedecor and Cochran (1967) present p as the maximum likelihood estimator derived in
Section 2.1. Using the p. described in Equation 3.1.2 minimizes Equation 3.1.3, thus providing smaller
values of the traditional test statistic. This yields then a more conservative testing procedure than that of
Snedecor and Cochrans.
3.2 Assessing the Probability of Failure of the Traditional Test Statistic
The traditional test statistic cannot be used when an observed p; is equal to 0 or 1. Observing
the inverse relationship between the proposed weights and the estimated variances of the proportions, it is
concluded that the weight corresponding to a small or large proportion tends to infinity, thus enlarging the
21
test statistic leading to a false rejection of the null hypothesis. Note: Since the referral rates of Table 1.1
include 0 and 1, this traditional testing procedure may not be used.
But how small, or large, can the hypothesized p be in order to properly use the traditional testing
procedure? This question can only be answered after obtaining information about the other two
parameters n and c.
What needs special consideration at this point is the region where the traditional test statistic can
be utilized with low probability of failure, to compare to the GLRT statistic. The GLRT statistic falters
nowhere, so we need to make sure, for comparisons sake, that the traditional test statistic can be used as
well.
The probability associated with a useful combination of PjS, given the hypothesized p, the
number of Bernoulli trials per class, and the number of classes, is calculated by:
PfUsefiilcombinationofpjS] =fj P[p;useful]
i-l
= fj[ 1 -P[p; not useful]
i-l
=n i-h p(i-Jp)nr- ( \ n>
i-i voJ c
=n i-(i-/oni-pni
p^'V-pf'-"'
i-l
Eqn. 3.2.1
An S-Plus program to compute this probability is given in Appendix A. This probability is graphed as a
function of n andp in Diagram 3.2.1, for c=10, where a = n, i=l,..,10. Easily identified by this surface is
the region where we are safe to utilize the traditional test statistic without failure. This region is
indicated by the flatness of the surface. Using the contour plot of Diagram 3.2.2, it can be viewed that the
22
Hypothesized p
usage of 10 proportions coming from a binomial distribution of hypothesized p between .4 and .6, with
each class size of magnitude greater than 20, is warranted with risk <0.001 of failure.
Diagram 3.2.1: Probability of Ability to Use Traditional Test c=10
Diagram 3.2.2: Contour Plot of Surface
Size of Class
23
Diagram 3.2.2 displays the ineffectiveness of the traditional test statistic when the
magnitude(number of Bernoulli trials) of the average class is minimal (less than 10) and when we have a
relatively low/high hypothesized number of successes per class. The issue of these problematic
combinations of average class size and p will be addressed in Chapter 5, where a Bayesian modification to
testing homogeneity of proportions is introduced.
3.3 Type 1 Error
As a first glance at the behavior of the traditional test statistics distribution, the parameter
combination (n=30,c=25,/?=0.8) was used. The S-Plus code of Appendix B generated 3,872 test statistics,
formulated under the aforementioned parameter combination. [Note at this point that a simulation of 4,000
traditional test statistics was attempted. Through generation of 25 binomial variates, per each iteration of
the simulation, only 3,872 iterations had a useable set of variates. This was because a 0 or 1 was
generated as one (or more) of the binomial variates, rendering the traditional test useless. See Section 3.2]
For this parameter combination, the distributions deviation fromx224 is one of longer-tailedness; the
values observed in the traditional testing procedure are larger than what we would expect, assuming a x224
distribution Diagram 3.3.3 displays the histogram representing the empirical p.d.f. of the traditional test
statistic, for this specific parameter combination, and superimposed on it is the asymptotic distribution,
X224. The distribution of the traditional test statistic was also performed for parameter combinations
(n=30,c=25,p= 0.2,0.5). Withp-0.2, we witnessed results similar to those ofp=0.8, which was to be
expected because of the symmetric nature of the traditional test statistic across the ps. With p-0.5, the
distribution ofthe test statistics followedx224 quite nicely. Diagrams 3.3.1-3.3.3 display these
observations.
24
Probability Density Probability Density
Diagram 3.3.1: Distribution of Trad. Test Statistics (n=30,c=25,p=0.2)
Traditional
Diagram 3.3.2: Distribution of Trad. Test Statistics (n=30,c=25,p=0.5)
25
Diagram 3.3.3: Distribution of Trad. Test Statistics (n=30,c=25,p=0.8)
O
o
CO
o
o
20 40 60
Traditional
To see how the traditional test statistic behaves for smaller n, the parameter combination
(n=10,c=25,/>=0.8) was used to simulate, in attempt, 4,000 traditional test statistics, again using the S-Plus
code of Appendix B. The simulated distribution was shifted left, drastically, from the hypothesized x224
Thus, as we decreased n, the size of each class, the traditional test statistics distribution deviated from the
asymptotic distribution greatly. Similar results followed, with deviation of the same magnitude and
direction, with simulations using parameter combinations (n= 10,c= 10,/?=0.2,0.8). Thus altering the
number of classes had little effect on the behavior of the distribution of the traditional test statistic.
Important to note is that with combination (n=10,c=10,/>=0.5) the distribution was left-skewed from xV
The skewness was in the opposite direction (right) than those distributions of (n=10,c=10,/>=0.2,0.8), and
it was more chi-squared-like. So, in conclusion, when we have smaller n and p close to 0.5, the traditional
test statistic does behave somewhat like a x29 variate.
26
Table 33 Critical Values for Traditional Chi-Squared Test for
Homogeneity of Proportions (Z^2) P [ x i x^2 ] = 0.05
p Size of Each Class (n) Number of Classes (c)
5 10 15 20 25
0.2 10 6.773239 12.316631 17.626680 20.658656 26.620682
15 9.044598 15.824357 22.940475 29.432538 34.901556
20 10.555826 18.911261 27.256803 34.383117 41.181982
25 11.493894 21.372230 29.182222 37.904865 45.730485
30 11.742557 23.168593 32.206702 40.693759 48.385532
0.3 10 9.797422 19.118012 26.593878 35.007753 40.746125
15 12.215306 22.105102 31.485198 39.774807 48.17185
20 12.47389 24.82931 31.98462 41.96835 50.10555
25 12.50450 24.01396 33.98946 46.23746 54.97637
30 11.18971 21.67130 30.68705 38.70966 46.47078
0.4 10 13.64638 24.71230 35.53848 45.10311 54.50412
15 12.63613 24.13758 34.86871 44.95899 52.40975
20 11.52461 20.90910 29.24025 37.97424 46.85223
25 10.93307 19.89903 28.05386 35.70133 42.94649
30 10.53307 19.76447 27.48838 34.20329 41.38230
0.5 10 15.33199 27.40924 38.65350 49.20346 59.85152
15 11.75673 21.88062 31.26730 39.91737 48.42441
20 11.02656 19.91688 28.00401 35.43433 42.81818
- 25 ! 10.59926 18.83105 26.62084 35.04673 40.16616
30 10.21207 18.31242 26.46031 32.08912 39.81960
0.6 10 13.79942 24.91524 34.59076 44.44536 54.67369
15 13.12923 24.82879 35.24545 44.09692 51.91096
20 11.80682 21.51856 30.39668 39.56237 46.86581
25 'l 1.29377 20.38845 28.09596 36.90418 44.86007
30 11.13045 19.70270 27.75258 35.01562 42.91455
27
Table 3.3 (continued) Critical Values for Traditional Chi-Squared Test for
Homogeneity of Proportions (x^2) =0.05
0.7 10 10.20962 18.66921 26.38649 34.60401 40.57658
IS 11.66760 22.16062 31.09343 38.68213 47.36754
20 1280923 23.251257 32.881049 41.977614 50.680730
25 11.735456 22.452635 33.076669 41.548290 52.587338
30 11.614164 21.271008 29.835853 38.850912 46.406711
0.8 10 7.006548 12.859265 17.775812 22612319 27.200953
IS 9.190978 15.711640 22.653064 29.143503 34.710637
20 10.380357 18.983298 26.678808 34.173714 41.685613
25 11.527602 21.193220 29.861222 37.794547 46.478983
30 12.14935 22.17020 31.60790 40.75586 49.01912
As was done with the GLRT, the distribution-free KSGFT was used to test if the traditional test
statistics, for parameter combination (n=30 ,c=2 5 ,p=0.8) follow a x224- Pictorially, we observed a longer-
tailed empirical p.d.f. for this parameter combination [Diagram 3.3.3], and as a result, the KSGFT statistic
was observed to be 13.09. This led to rejection of the hypothesis that the two distributions were the same,
at significance level 0.05.
Since we witnessed strong evidence that the traditional test statistics distribution was not the
same as the assumed asymptotic distribution, x2c.]. better estimated critical regions were established.
Through each of these 175 simulations of the true distributions, the critical value was formulated, defined
as the point where 5% of the test statistics lie above. By designating this critical region, we can adequately
say the test is of size a=0.05. These critical values are provided in Table 3.3. As examples, Diagrams
3.3.4 and 3.3.5 present graphs of these critical values as functions of n, c, and p.
28
29
Unfortunately, Table 3.3 can be used only for these convenient settings for (n, p, c), and usually
analysts have at hand different sizes of classes, number of classes not equal to a multiple of 5, and have no
concrete prior indication of p. In hopes of interpolating between the cells of this convenient table, a
polynomial regression was performed to observe a possible functional relationship between the critical
values and the three parameters: n, c, and p.
With no transformation of the critical values, the mean squared error, the point estimate for the
variance of the model, proved to be rather large (2.55), thus traversing two standard deviations away from
the observed critical value would yield an estimation window of approximate width 6.4, which is not very
precise. It was also shown, through a scatter plot of the residuals versus the fitted values of the model, that
the variance was increasing. With this evidence of the inappropriateness of the model, a variance-
stabilizing transformation was required. The log and square root transformations were tried. Though
plausible for the purpose of stabilizing the variance, the transformation proved to yield a large MSE as
well, but smaller than that of model with no transformation of the critical values. It was concluded that the
model of the traditional tests critical values had more error than would be tolerated in obtaining predicted
values for given (n,p,c) and that the model was not flexible enough to fit the critical values. Thus the best
that could be done was to present a margin of error for each of the critical values within Table 3.3; this is
the error created by the simulation of the traditional tests critical values. The formulation of this error
follows.
As was performed for the critical values of the GLRT, the simulation noise was estimated, per
levels of c, by utilizing two test statistics that should be the same. With these two replicates, a sample
variance was established, whose expected value is the variance of the two critical values. Per each level of
c, 15 estimates for the true variance were averaged; these averages provided 95% confidence windows,
defined by plus or minus two standard deviations. These windows are presented by level of c:
30
c=5,0.57; c=10, 1.02; c=15, 0.91; c=20, 2.10; c=25, 1.39. With the size of these confidence
windows, it isnt quite clear that the behavior of the critical values is predictable, but it is the best the
simulations have to offer.
31
4.
Power Comparison
4.1 Power Comparison: Sufficiency
In comparing two testing procedures, it is important to assess the probability of rejecting the null
hypothesis when, in fact, it is justified to reject it, given the alternative hypothesis; this defines power of a
test statistic. Before comparing the power of the two test statistics, it is necessary to comply with some
assumptions.
Test statistics to be involved in such a comparison should be dependent on sufficient statistics
only, thus it would complement our study in analyzing the sufficiency of the statistics on which our test
statistics depend. Sufficient statistics are statistics that condense the parameter space under the alternative
hypothesis, fi, in such a way that no information about is lost,
(Mood et al., p.301) which implies that all information about the hypothesized parameter space can be
found within the sample.
Definition 15 Sufficient statistic (Mood et al.1
Let X,, Xj....Xn be a random sample from the density f (, 0) where 0 may be a
vector. A statistic S=s(X,,X2,...,Xn) is defined to be a sufficient statistic if and
only if the conditional distribution of X^Xj,... ,Xn given S=s does not depend on 0
for any value of S.
32
The sufficiency can be proven using the factorization method;(Mood et al., p.408) since the
observed p,, p2,..pc, later referred to together as the vector p are independent and binomially
distributed, as addressed earlier. Without loss of generality, the number of Bernoulli trials per class was
set constant, whereas n; = n Vie[l,...,c]. Since each trial (j) within a class (i) is a Bernoulli trial:
Yjj = 0 if failure"
= 1 if "success"
with fy(y)=/>y(l-/>)1yI,011(y) Thejoint density of the Y;. can thus be configured, letting
i
X.-EY,:
j=l
fy(y) = n ft P,yij (1 -/>;)' 'yj and WLOG n-n, i=[l,...,c]
- ~ i-l j-1
= n ^(i^i)n'Xi
i-l
= n eHr?~Pi'r">
i-l
-n*
i-l
[X| ln(P|) (n -Xj)ln(l />;)]
i-l
- nln(l -pi)
*
n
nln(l-j>i)
Eqn. 4.1.1
X; _
where is simply the observed proportion, p. of n Bernoulli trials that were successful. In
n
accordance with the factorization criterion for sufficiency, thejoint distribution of the variate Yy can be
33
factored into a product of two functions: one positive and solely dependent upon the data; the other is
dependent upon both the hypothesized pK s and the data, but dependent on the data only through the
sufficient statistics. This proves that p is sufficient for p. In conclusion, both the traditional test statistic
and the GLRT statistic are dependent on these individual p; s, so we have chosen adequate test statistics to
compare with regards to sufficiency.
4.2 Power Comparison: Type 1 Error
For the two test statistics to be compared, their Type 1 errors, a must be equivalent to deem
one or the other the more powerful test of size a. The justification for fixing the size of the Type 1 error
to be a...seems to arise from those testing situations where the two hypotheses are formulated in such a
way the one type of error is more serious than the other. The hypotheses are stated so that the Type 1 error
is the more serious, and hence one wants to be certain that it is small. (Mood et al., p.411) Logically, if
one test statistic has a Type 1 error of a', in accordance to a critical value c, we can choose a test statistic
whose distribution dictates a larger Type 1 error a", imposing a smaller Type 2 error. This would yield a
false comparison of the two test statistics.
4.3 The Final Power Comparison: Results
Using these critical values established in Section 4.2 and given specific deviations from the null
hypothesis, power of the two tests can be established. This is the probability of rejecting the hypothesis of
homogeneity when, in fact, there exists some heterogeneity; using a metric to measure the magnitude of the
deviation from the null hypothesis, the power can be graphed as a function of this deviation, for graphical
comparison.
34
GLR tests, for a given departure from the null hypothesis. A deviation from the null hypothesis was
characterized as a fluctuation of one of the five hypothesized pK s. That is, one of the observed p;s comes
from a different distribution than do the other p; s. The deviation from the null hypothesis is measured by
the magnitude of the deviation of that sole hypothesized px from the others. The deviation ranged from
0.00 to 0.20, in increments of 0.04. This yielded 6 points of analysis, for a given parameter combination.
Diagrams 4.3.1-4.3.4 display the two estimated power functions, dependent on deviation, for the
parameter combinations n=10, c=5, and p=[0.2,0.4,0.5,0.6]
Diagram 4.3.1 displays the complete domination of the traditional tests power over that of the
GLRT. Diagrams 4.3.2 and 4.3.3 indicate that when n=10 and c=5 andp=0.5, the GLRT is the more
powerful test as deviation grows, but when /t=0.4, the roles reverse. With the power comparison for
/>=0.4 and deviation=0.00, the graphs do not meet. We had assumed for this power comparison that the
tests were of the same size. This equality of size should have been represented by the meeting of the
graphs, as deviation=0.00 implies that we, in fact, have all proportions coming from the same distribution
Diagram 4.3.1: Power Comparison (n=10,c=5,p=0.2)
Deviation
35
Power Power
Diagram 4.3.2: Power Comparison (n=10,c=5,p=0.4)
Diagram 4.3.3: Power Comparison (n=10,c=5,p=0.5)
36
Diagram 4.3.4: Power Comparison (n=10,c=5,p=0.6)
Deviation
(power and size should be equal). This difference in size could be attributed simulation noise in
establishing the critical values of these two test statistics, as discussed in Section 3.3. The unexpected
direction change of power seen in Diagram 4.3.3 is speculated as being a result of simulation noise. We
would expect that power would only increase as deviation grew.
Also produced were similar power functions for parameter combinations (n=30,c=5) and
(n=30,c=25),p=0.2,0.4,0.5,0.6. With witnessing the distributions for (n=30,c=5) [Diagrams 4.3.5-
4.3.8], we see that the GLRT and the traditional test perform in the same fashion, with respect to power,
particularly for p near 0.5. This similarity could be the product of the asymptotic nature of both tests, as
we are using a relatively large n (30). The greater difference between the power curves for p=0.2 is
interesting, leading to the conclusion that the tests perform differently when we have different values of p,
namely smaller or larger. This greater difference in power curves for the smaller p was witnessed also
with (n=30,c=25) [Diagram 4.3.9-4.3.12], relative to those power curves for p=0.4,0.5,0.6. Similar to
results of Section 2.2, decreasing c yielded no difference in behavior of the test statistic, this example
37
Power Powar
showing the little difference of power. From these graphs we are unable to conclude that one test is
consistently better than the other.
Diagram 4.3.5: Power Comparison (n=30,c=5,p=0.2)
38
Power Power
Diagram 4.3.7: Power Comparison (n=30,c=5,p=0.5)
0.0 0.05 0.10 0.15 0.20
Deviation
Diagram 4.3.8: Power Comparison (n=30,c=5,p=0.6)
Deviation
39
Power Power
Diagram 4.3.9: Power Comparison (n=30,c=25,p=0.2)
Deviation
Diagram 4.3.10: Power Comparison (n=30,c=25,p=0.4)
Deviation
40
Power
Diagram 4.3.11: Power Comparison (n=30,c=25,p=0.5)
Diagram 4.3.12: Power Comparison (n=30,c=25,p=0.6)
41
5.
Bayesian-Adjustment Approach
S.1 Bayesian Theory
Widespread familiarity with the traditional testing procedure indicates that analysts will continue
using it, correctly or falsely. Assuming that analysts will continue to use it and knowing that there is a
possibility of false usage (see Section 3.2), let us introduce an adjustment for the p j s so that the test will
be usable always.
The methods addressed in the previous chapter depended on the parameters being fixed. For this
chapter, let us assume that these parameters, specifically p, are actually variates from their own
distribution. This is called a prior distribution.
The nature of p dictates a range from 0 to 1, as there are never more Bernoulli successes than
there are trials, and there are certainly never a negative number. These constraints are very similar to the
defined domain of the beta distribution [Potthoff and Whittinghill, Rice], so that would be an excellent
place to begin with our guesses of how p is distributed. We write ps prior distribution as:
p*-'(l-p)b-'
Eqn. 5.1.1
o
where E(p)=------ and var(/?)=
a+b
ab
(a+b)2 (a+b+1)
42
Continuing to assume that our data are distributed binomially, with parameters n and p, we write the
conditional distribution of our data dependent on those two parameters, per usual, as:
7t(x|/7)
Eqn. 5.1.2
The joint distribution between the data and p follows:
= :t(x!/>)*(/>) = (^jp*{\-pT* p^rcb)^ (1 _/7)b''
Eqn. 5.1.3
Following Bayesian rule, the posterior distribution ofp, the distribution after observing X, is:
7C(^|X) = t
f tt(x|p)7i(p)dp
Eqn. 5.1.4
Using equations 5.1.2 and 5.1.3 in the above equation 5.1.4, we see that:
43
7r(/7 ( X) =
n
x
px(i-prx
px(i-prx
T(a+b)
r(a)T(b)
T(a+b)
r(a)T(b)
p-'(l-p)b-'
p-' (l-p^dp
p-1 (l -p)n-x,b'1
_
J px
o
/7X)"-X*b~1
r(x + a)r(n-x+b)
T(n+a+b)
____r(n + a + b)__ xa-l / J _p\n-x-*b-l
r(x + a)T(n-x+b)
Eqn. 5.1.5
which is probability density function (p.d.f.) of beta with parameters (x+a,n-x+b). Note that when ps
prior distribution is uniform, its p.d.f.s parameters are (a=l, b=l). A prior distribution of uniform nature
is most conservative, stating that there exists no indication of the size of p.
With the assumed prior distribution of beta, it is shown that the expected value of p, given the
data, is a function of the observed proportion of Bernoulli successes, p, the number of Bernoulli trials, n,
and the distributions two parameters, a and b, through the following formulation:
E(p| x)
(x+a)
(x+a)+(n-x+b)
x+a
a+n+b
a..(_a_)+ n. (*)
a+b+n a+b a+b+n n
Eqn. 5.1.6
44
and with our first conservative step towards specifying the distributions parameters (a=l, b=l), we see
The beauty of this derivation is that these ps of Eqn. 5.1.7 are exactly those that cause problems
within the use of the traditional testing procedure, as corresponding weights are infinity (Section 3.1)
Justification can be sought then for adjusting these problematic ps so that the traditional testing procedure
can be used freely, without violation.
5.2 Results
Four thousand traditional test statistics were obtained for each of the three combinations
(n,p,c)={(30,0.2,25),(30,0.5,25),(30,0.8,25)}, similar to those used in Section 3.3 analyses. Generated
through simulation [see Appendix D for S-Plus code used], these test statistics are those construed using
the expected values of the hypothesized p, given the observed data, as described in Chapter 5.1. These
expected values simply replaced the observed p;s within the weights of the traditional test statistic.
What we would ideally like to see is that the distribution of the test statistics is similar to a known
distribution; for convenience, let us compare the distribution to the Chi-squared, with c-1 degrees of
freedom. This is the asymptotic distribution of the traditional test statistic, which was shown to be
significantly different from the estimated true distribution, as discussed in Chapter 3.3. Comparison of
that E(p | x) = ^(0.5) + -2-(p) where 0.5 is the expected value of the uniform distribution of interval
2+n 2+n
[0,1].
It is important to note at this point that:
E(/,|/?=0)=
2+n
Eqn. 5.1.7
45
these distributions have been done both pictorially and through the non-parametric Kolmogorov-Smimov
Goodness-of-Fit test
Diagrams 5.2.1,5.2.2, 5.2.3 display the estimated probability density functions (p.d.f.) of the test
statistics of parameter combinations (n,/>,c)={(30,0.2,25),(30,0.5,25),(30,0.8,25)}, respectively; each
indicated by the histogram. The observed p.d.f.s are compared to the p.d.f. of x224 indicated by the solid
black line.
46
Diagram 5.2.3: Distribution of Bayesian-Adjusted
Test Statistics (n=30.c=25.p=0.r
47
As indicated by Diagrams 5.2.1 -5.2.3, the estimated true distribution of the Bayesian-adjusted traditional
test statistic, for relatively large n, fits the hypothesized y?2A quite well and better than those of the
respective Diagrams 3.3.1-3.3.3 for the non-adjusted traditional test.
More interesting was the behavior of this Bayesian-adjustment technique for those problematic
areas discussed in Section 3.2. For this analysis, Diagram 3.2.1 was used again to identify the problematic
regions corresponding to n= 10, c=10, andp=[0.1,0.2,0.3,0.5,0.7,0.8,0.9]. This analysis also required
another generation of4,000 test statistics using this Bayesian adjustment for each of the generated
proportions, per each combination of n, c, and p. In Diagrams 5.2.4-5.2.10 are the histograms
representing the p.d.f.s of each parameter combination described above. These histograms demonstrate
the Chi-squared-like nature of the Bayesian-adjusted test statistics, except for the small/large proportions
(p=0.1,0.9) The reader should keep in mind that this demonstration involves relatively small n. For
further research, it would be necessary again to establish a critical value for these parameter combinations.
48
49
o
10
20
40
50 60
50
The reader should note at this point the extreme difference between the distributions of the
traditional test statistics for parameter combinations (n=10,c=10^=[0.2,0.8]) [Diagrams 5.2.11, 5.2.12]
compared to the distributions corresponding to the same parameter combinations of the Bayesian-adjusted
test statistics. [Diagrams 5.2.5,5.2.9]
51
It is shown by these Diagrams 5.2.5 and 5.2.9 that the Bayesian adjustment created a distribution that was
more of y?2A nature than the distribution of the non-adjusted. This is a remarkable observation. With the
traditional testing procedure, we were only able to utilize approximately 31.5% of the test statistics where
p=0.2,0.8; 98% for p=0.5. Because of the low probability of usability for the smaller/larger ps, the true
distribution of the test statistic may have been under represented. With this Bayesian adjustment, each
generated proportion posed no problem when incorporated into the test statistic, thus 100% of the test
statistics were used to determine their distribution. Using the ASPN data of Table 1.1, where the
traditional testing procedure was not used due to problematic proportions, but we could test for
homogeneity with the Bayesian adjustment. The null hypothesis of homogeneity was rejected, using this
Bayesian technique, with p-value<0.001.
5.3 Power Comparison with GLRT and Traditional
Comparison of the Bayesian-adjusted traditional testing procedures, the non-adjusted traditional
procedures, and the GLRT procedures power was performed. The simulation used only the parameter
combination (n=10,c=5,p=0.2). Diagram 5.3.1 displays this power function comparison, with power as a
function of the deviation from the null hypothesis of homogeneity. The deviation is defined in the fashion
described in Section 4.3, with one of the generated proportions as a variate from a different binomial
distribution than that of the others. The difference in the distribution is measured by the difference
between the hypothesized p;s. We see from Diagram 5.3.1 that the Bayesian-adjusted traditional test
performs similarly to the unadjusted traditional test statistic, with respect to power. Both tests perform
better than the GLRT, as indicated by the vertical shifts in the graphs. This observation is coupled with
the fact that the Bayesian-adjusted test statistic may be used for all proportions.
52
Power
Diagram 5.3.1: Power Comparison (n=10,c=5,p=0.2)
Deviation
53
6.
Conclusion
6.1 Remarks
In our ability to compare the generalized likelihood ratio and the traditional tests, we stumbled
upon information leading us to disbelieve their assumed hypothesized distribution, x2c_i Chapters 2
and 3, the true size of the tests was established, thus yielding a critical region to be utilized for both tests.
Chapter 4 presented the power for each of the size a tests, represented as a function of the deviation from
the null hypothesis, it was witnessed that the traditional and generalized likelihood ratio testing procedures
performed equally for particular combinations of n, c, and p, when the established true regions of rejection
were utilized, but with other parameter combinations, the domination of one testing procedure over the
other varied, leading to inconclusive results. One lingering problem with the traditional testing procedure
was the inability to use it when faced with a problematic combination of proportions: those that rendered
the test statistic useless, as one or more of them were equal to 1 or 0.
To make usable the unusable, a Bayesian adjustment was proposed in Chapter 5. For specific
combinations of the parameters n,p, and c, the distribution of4,000 test statistics generated through usage
of this Bayesian adjustment were displayed. It was concluded that the distribution for each combination,
where p=0.2-0.8, was similar to the hypothesized distribution, XV,, and closer than the distribution of
those generated without the adjustment. The Bayesian-adjusted test statistic also retained the power of the
non-adjusted traditional test statistic for a specific parameter combination. It is suggested to use this
Bayesian technique as a testing procedure for binomial homogeneity.
54
Appendix A Splus code used to assess probability
of useful set of 10 proportions, given
n and p
prmat2<-NULL
for (m in 1:10) (
for (k in 1:99) {
hyp.p<-k/100
classiz<-5*m
prob.p<- (1- ((hyp.p) ,'classiz+ (1-hyp.p) ''classiz)) A10
^rxnat2<-rbind(prmat2,c(hyp.p,classiz,prob.p))
55
Appendix B Splus code used to establish time
critical regions of test statistics,
given n, c and p
typelerr.f<-function(n,pr,numclass) {
For(times=l:4000,{
if (times==l) {
testmatc-numeric()
]Dhat< numeric ()
phat<-rbinom(length(n),n,pr)/n
pmle<-sum(phat)/length(n)
lnpxnle<-log (pmle)
lnlmmlec-log(1-pmle)
sumlc-siim (n*phat*lnpmle)
sum2<-sum(n*(1-phat)*lnlmmle)
sum3<-(-1)*sum((n*phat)[phat!=0]*log(phat[phat!=0]))
sum4<- (-1) *sum( (n* (1-phat)) [phat!=l] *log(l-phat [phat
, l-l]))
glrtc- (suml+sum2+sum3+sum4) (- 2)
tradtest<-NA
if (sum(bade-phat==1|phat==0)==0) {
w<-n/(phat*(1-phat))
sumwts<-sum(w)
pplus<-sum(w*phat)/sumwts
tradtest<-sum((phat-pplus)*(phat-pplus)*w)
testmat<-rbind(testmat,c(tradtest,girt))
}
)
degfree<-numclass-1
okl<-!is.na(testmat[,1))
sumtradc-sum(testmat[okl,1]>=qchisq(.95,degfree))
sumglrtc-sum(testmat[,2]>=qchisq(.95,degfree))
pchiglrt<-sumglrt/4000
pchitrad<-sumtrad/(4000-sum(!okl))
crittrad<-quantile(testmat[,1],probs=c(.95),na.rm=T)
critglrt<-quantile(testmat[,2],probs=c(.95),na.rm=T)
c (crittrad, critglrt,qchisq( .95,degfree) ,pchitrad,pch
iglrt,sum(!okl),pr[1],n[l],numclass)
56
Appendix C Splus code used to establish power of
two tests, given true critical regions
powerdev.fc-function(alpha, n, pr, pwrdev, chisum,
tradsum, devation, hypothp) {
crtrad <- crtr[round(length(n)/5), round(n[1]/5) ,
round(10 pr[2] 1)]
crglrt <- crglr[round(length(n)/5),
round(n[1]/5), round(10 pr[2] 1)]
For(times = 1:4000, (
if(times == 1) {
phat <- numeric()
pwrmat <- numeric()
phat <- rbinom(length(n), n, pr)/n
pmle <- sum(phat)/length(n)
lnpmle <- log(pmle)
. lnlmmle <- log(l pmle)
suml <- sum(n phat lnpmle)
sum2 <- sum(n (1 phat) lnlmmle)
sum3 <- (-1) sum((n phat)[phat != 0] *
log (phat [phat != 0])
)
sum4 <- (-1) sum((n (1 phat)) [phat =
1] log(l phat[
phat != l]))
girt <- (suml + sum2 + sum3 + sum4) (-2)
tradtest <- NA
if(sum(bad <- phat == 1 | phat == 0) == 0)
w <- n/(phat (1 phat))
sumwts <- sum(w)
* pplus <- sum(w phat)/sumwts
tradtest <- sum( (phat pplus) (phat
- pplus) w)
}
pwrmat <- rbind(pwrmat, c(tradtest, girt))
)
degfree <- length(n) 1
okl <- !is.na(pwrmat[, 1])
sumtrad <- sum(pwrmat[okl, 1] >= crtrad)
sumglrt <- sum(pwrmat[, 2] >= crglrt)
pchiglrt <- sumglrt/4000
pchitrad <- sumtrad/(4000 sum(!okl))
57
Appendix C (continued)
Splus code used to establish power of
two tests, given true critical regions
c(crtrad, crglrt, qchisq(0.94999999999999996,
degfree), pchitrad, pchiglrt,
sum(!okl), pr[2], n[l], length(n) devation)
58
Appendix D Splus code used to establish
distribution of Bayesian-adjusted test
statistics
bayes210.f<-function(n,pr,numclass) {
For(times=l:4000f{
if (times==l) {
bayes210<-numeric ()
phat<-numeric()
ophat<-numeric()
ophat<-rbinom(numclass,n,pr)/n
phat<-(1/(2+n))+(n/(2+n))*ophat
tradtest<-NA
w<-n/(phat*(1-phat))
sumwts< sum (w)
pplus<-sum(w*ophat)/sumwts
tradtest<-sum( (ophat-pplus) (ophat-pplus) *w)
bayes210<-rbind(bayes210,tradtest)
}
)
degfree<-numclass-l
okl<-!is.na(bayes210)
sumtrad<-sum(bayes210[okl]>=qchisq(.95,degfree))
pchitrad<-sumtrad/(4000-sum(!okl))
crittradc-quantile(bayes210,probs=c(.95),na.rm=T)
c (crittrad, qchisq( ,95,degfree) ,pchitrad, sum( !okl) ,pr
[1] ,n[l] ,numclass)
}
59
Appendix E
Critical Values for Generalized Likelihood Ratio Test for
Homogeneity of Proportions ( x^2) P [ x ^ Xo*21 = 0 05
p Size of Each Class (n) Number of Classes O
5 10 15 20 25
0.2 10 10.5532 18.8285 26.5557 33.4652 40.3242
15 10.7294 18.6883 26.1684 33.3299 39.7710
20 10.4266 18.0017 25.8823 32.1944 38.9742
25 9.8970 18.0641 24.7155 31.8633 38.6316
30 9.8148 17.9903 24.3759 31.4190 37.5747
0.3 10 10.5264 18.5832 26.7902 33.2966 40.1662
15. 10.4127 18.2024 25.5416 323912 38.2954
20 10.2386 18.0654 23.8820 30.9469 37.7438
25 10.0308 17.9569 24.7506 31.4586 38.0930
30 9.6962 17.6057 24.1002 31.0309 36.6498
0.4 10 10.6821 18.2016 25.4461 32.8585 39.0980
15 9.8004 17.4470 24.9343 31.6513 38.1341
20 9.6003 16.7973 23.2409 29.7820 36.3268
25 9.3750 16.8195 23.3728 29.8742 35.8151
30 9.2906 17.0800 23.9814 29.6477 35.6353
0.5 10 10.5346 18.2338 25.2350 32.3394 39.1347
15 9.3004 16.2619 23.5884 29.3358 35.8746
20 9.2580 16.3133 22.9261 29.2166 35.2785
25 9.3183 16.3440 22.9175 30.1496 34.9830
30 9.1651 16.2428 23.1305 28.4271 35.4041
0.6 10 10.7047 18.3665 25.6470 33.1218 39.4052
IS 10.1305 17.7842 24.5145 .31.1136 37.6647
20 9.8078 17.1956 23.6380 30.8959 36.4060
25 9.6516 17.2464 23.4157 30.2821 36.7025
30 9.8854 17.1970 23.8971 30.1222 36.7249
60
Appendix E (continued)
Critical Values for Generalized Likelihood Ratio Test for
Homogeneity of Proportions (x^2) Plx^x^2] =0.05
0.7 10 10.5355 18.5832 25.8818 33.3677 40.0863
IS 9.9993 18.0193 25.3698 31.4183 38.652$
20 10.4775 17.6339 24.2437 31.3704 38.3753
25 9.639J 17.3387 24.9628 30.6490 37.5943
30 10.0031 17.7484 24.2586 30.0357 36.7276
0.8 10 10.52641 18.7236 26.5587 33.8362 40.5709
IS 10.6803 18.8908 26.2571 33.0036 39.5356
20 10.0553 18.1507 25.2396 32.5497 39.1177
25 10.1826 18.0175 25.1527 31.5435 37.9807
30 9.9308 17.5195 24.4648 31.1350 37.9958
61
Bibliography
Bardsley, W. E., An algorithm for calculating an upper
bound to x2 for testing binomial homogeneity with
some observations uncategorized." Computers and
Geosciences Volume 6 (1980); pp. 315-319.
Beyer, William H. CRC handbook of tables for probability
and statistics. Cleveland, Ohio: Chemical Rubber
Company, 1966.
Fleiss, J. L. Statistical methods for rates and
proportions. New York, New York: John Wiley and
Sons; 1973.
Green, Larry et al., "How representative of typical
practice are practice-based networks? Arch. Family
Medicine Volume 2 (September, 1993); pp. 939-949.
Hedges, L; Olkin, I. Statistical methods for meta-
analysis San Diego, California: Academic Press,
Inc., 1985.
Larsen, Richard J.; Marx, Morris L. Introduction to
mathematical statistics and its applications.
1 Englewood Cliffs, New'Jersey: Prentice Hall, 1986.
Mood, A. M.; Graybill, F. A.; Boes, D. C. Introduction to
the theory of statistics. McGraw-Hill; 1974.
Potthoff, R.F.; Whittinghill, M. Testing for
homogeneity." Biometrika Volume 53 (1966); pp.167-
' 182.
Rice, John A. Mathematical statistics and data analysis.
Belmont, California: International Thomson
Publishing; 1995.
Snedecor, G. W.; Cochran, W. G. Statistical methods.
Ames, Iowa: Iowa State University Press; 1967.
62