THE ASSESSMENT OF TEST BIAS: APPRAISAL OF
THE HOMOGENEITY OF SUBGROUP ERROR
VARIANCE ASSUMPTION AND ALTERNATIVES TO
MODERATED MULTIPLE REGRESSION
by
Scott Alan Petersen
B.S., United States Military Academy at West Point, 1989
A thesis submitted to the University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Arts
Psychology
1998
This thesis for the Master of Arts
degree by
Scott Alan Petersen
has been approved
by
Hfinfa 6
Date
Willjam Wolfe
Petersen, Scott A. (M.A., Psychology)
The Assessment of Test Bias: Appraisal of the Homogeneity of SubGroup Error
Variance Assumption and Alternatives to Moderated Multiple Regression
Thesis directed by Professor Herman Aguinis
ABSTRACT
This investigates problems associated with using moderated multiple
regression (MMR) for the assessment of test bias (i.e., differential prediction).
Specifically, the assumption that withinsubgroup (e.g., gender, race) error variances
are equal is a necessary, but often misunderstood and violated assumption in MMR
analysis. A review of the literature in the employment testing, legal, and statistical
domains regarding this subject is presented. First, the psychometric definition of test
bias is differentiated from the sociological definitions of fairness. Secondly, the
legal requirements and implications for assessing test bias are examined. Third, the
popular MMR procedure and the impact of violating this assumption is conveyed
demonstrating that test evaluators should be concerned with making erroneous
conclusions using this method. That is, the probability of wrongly concluding that a
in
conclusions using this method. That is, the probability of wrongly concluding that a
test is unbiased can be exacerbated by violation of this assumption. Thus, methods
of assessing compliance with the homogeneity of error variance assumption are
evaluated, as well as appropriate alternative statistics to be employed when the
assumption is violated (e.g., Jamess second order approximation and Alexanders
normalizedt approximation). Fourth, a userfriendly (Java) computer program to
assess the homogeneity assumption and compute alternative statistics for most
personal computer users is presented. The utility of this program is demonstrated by
applying it to four test bias studies published in leading journals. Finally,
implications and recommendations for test bias research are presented.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
IV
ACKNOWLEDGMENTS
The author wishes to thank the Graduate Council of the CUDenver Graduate school
for financial support in completing this research through a grant from the Graduate
Research Opportunities Program (GROP). Additional sincere thanks are extended
to Dr. Richard DeShon for providing the SAS Computer program for reference as
well as his useful comments regarding the validation of calculations. Also of great
assistance was John Callender in providing the March 1997 Draft and Equal
..Employment Advisory Council comments regarding the new Standards for
Psychological and Educational Testing. Several people were also of great
assistance in helping me learn and use JAVA: Arun Giddu sent me some sample
code to round numbers (which I used almost verbatim); John Pezullo explained the
JavaScript code on his website (http://members.aol.com/johnp71/pdfs.html) to
reference the chisquare distribution (which I adapted to Java); Ford McKinstry of
Microsoft responded to my question regarding what files I need to (and can legally)
distribute with my program; G. Walters of IBM explained how to use the
JEXEGEN application to create a standalone package over the experts exchange
website (http://www.expertsexchange.com); Andrew Bono at CUDenver helped
me to understand the basic structure of a JAVA program, and provided some useful
places for me to find programming references (e.g., http://javasoft.com), and last
but not least, Stephen Daviss book, Leam Java Now (part of the Microsoft Visual
J++ software package), made it possible for me to do just that. Most importantly, I
thank my familyLaura, Brandon and Bryce, for their patience, support, love, and
understanding in allowing me to complete this work. I am also very grateful to Dr.
Kurt Kraiger who provided many helpful suggestions to make the computer
program more userfriendly, and effective revisions to several sections. Last but
not least, my sincere thanks to Dr. Herman Aguinisfor without his patience,
expertise, and assistance, I would not have learned so much about this topic.
CONTENTS
Page
Figures.........................................................ix
Tables..........................................................x
CHAPTER
I. STATEMENT OF THE PROBLEM..................................1
II. REVIEW OF THE LITERATURE.................................4
Test Bias.................................................4
What Is Test Bias?....................................4
Fairness vs. Bias.....................................5
Pervasiveness of Test Bias............................7
Legal Interpretations................................12
The Assessment of Bias: Moderated Multiple Regression....16
Statistical Analysis Procedure.......................17
Type I and II Error..................................21
The Homogeneity of Error Variance Assumption.............24
The Impact of Violation..............................31
Estimation of Homogeneity of Error Variance..........35
Alternative Statistics for Assessment of Bias.
37
Chisquare Test (%2).................................38
WelchAspin Approximation (F*).......................38
Jamess Second Order Approximation 0..............,..39
Alexanders Normalizedt Approximation (A)...........39
Comparison of the Alternative Statistics.............40
III. METHOD..................................................44
Computer Program for Statistical Assessment of Test Bias.44
Check for Homogeneity of Variance....................44
Calculation of Jamess and Alexanders Statistics....48
Accuracy Checks......................................49
IV. RESULTS.................................................51
The Program..............................................51
Sample Test Bias Analyses................................54
Qualls and Ansley, 1995............................54
Hattrup and Schmitt, 1990..........................56
Halpin, Simpson, and Martin, 1990..................57
Zeidner, 1987......................................58
V. DISCUSSION..............................................61
Implications of the Findings.............................61
Utility of the Program...................................66
vii
Recommendations for Future Research........67
APPENDIX......................................70
A. TABLES................................,..70
B. CALCULATIONS FOR BARTLETTS TEST........74
C. THE JAMESS SECONDORDER APPROXIMATION:
CALCULATION OF THE J STATISTIC...........76
D. ALEXANDER ET AL.S NORMALIZEDT
APPROXIMATION: CALCULATION OF THE
A STATISTIC............................. 80
E. PROGRAM PACKAGE DESCRIPTION AND CODE.....82
F. ACCURACY CHECK COMPUTATIONS............109
REFERENCES...................................113
Vlll
FIGURES
Figure
1 Illustration of GenderBased Test Bias........................19
2 Scatterplot of Hypothetical Homoscedastic Data................26
3 Scatterplot of Male Subgroup Homoscedastic Data..............26
4 Scatterplot of Female Subgroup Homoscedastic Data............27
5 Illustration of GenderBased Test Bias Due to Intercept
Differences............................................... 65
6 Sample Output.................................................86
7 Program Flowchart.............................................87
IX
V
TABLES
Table
1 Summary of Journal Review, January 1987 through March 1998....11
2 Summary of Studies Using MMR to Assess Differential Prediction
in the Journal of Applied Psychology, Personnel Psychology,
and Educational and Psychological Measurement (January 1987 to
March 1998).................................................71
3 Analysis of Qualls and Ansley (1995)..........................56
4 Analysis of Hattrup and Schmitt (1990)...................... 57
5 Analysis of Halpin et al. (1990)..............................58
6 Analysis of Zeidners Study (1987)............................59
7 Browser Program Package Files.................................83
8 Comparison of Results from Alexander and DeShons (1994a) SAS
Program to the Results of the New Java Program.............112
x
CHAPTER 1
STATEMENT OF THE PROBLEM
A major problem facing industrialorganizational psychologists is the
elimination of test bias. Personnel decisions, to be legally defensible and fair, must
be free of bias against groups protected by the Civil Rights Acts of 1964 and 1991,
and other employment laws. Personnel decisions are often based on aptitude tests,
skill tests, personality tests, employment interviews, and biodata instruments. To
protect employers from costly adverse impact lawsuits it must be demonstrated that
the instrument does not predict employment success differentially for demographic
groups (e.g., males vs. females, Caucasian vs. AfricanAmerican). Moderated
multiple regression (MMR) is a highly recommended and frequently used statistical
method for detecting such differential prediction. However, the prerequisite
assumptions of MMR are often misunderstood and violatedwith potentially
dramatic consequences.
When using MMR to assess test bias, the researcher is trying to determine if a
demographic variable influences how a test score is interpreted performance. If the
test is not biased, but the researcher concludes that it is (because the demographic
variable interacts significantly with the test scores), he or she is committing a Type I
1
error. More importantly, if a test is biased and the researcher concludes it is not
because the interaction term is not statistically significant, a Type II error is made.
As with any statistical method, it is necessary to understand the underlying
assumptions of the method, and how they effect potential error. One such
assumption required for using MMR is homogeneity of error variance across sub
groups. This assumption, specific to the assessment of categorical moderator
variables, means that the variance remaining in the criterion after predicting it from a
specific variable (e.g., a test score), is equal across the moderatorbased subgroups.
When this assumption is violated, the chance of either Type I or Type II error can be
radically increased. Additionally, this assumption is sometimes confused with the
leastsquares regression assumption of homoscedasticity, and consequently not
assessed. Thus, one general problem is that conclusions regarding test bias may be
frequently drawn under conditions of far less certainty than presumed.
There are two plausible reasons for researchers and test evaluators failing to
assess compliance with the homogeneity of error variance assumption: 1) a lack of
knowledge of the appropriate tests to be used to assess compliance, and 2)
unawareness of alternative methods to use when the validation sample violates this
assumption. Commonly used commercial statistical programming packages do not
currently offer tests for assessing homogeneity of error variance, nor do they easily
enable the user to compute recently studied alternative statistics (e.g., Jamess second
2
order approximation and Alexanders normalized tdistribution test). Thus, even if
researchers know that other methods of analysis may be more appropriate in
conditions of heterogeneous error variance, it is not a simple procedure to complete
that analysis. Knowing if the assumption is met or violated then becomes
inconsequential.
The goals of this study are threefold. The first is to comprehensively review
the existing literature in the employment testing domain to clarify the importance of
the homogeneity of error variance assumption. Secondly, it evaluates the
appropriateness of various alternative statistics when the assumption is violated in
the data collected. Finally, a computer program that can be used on multiple
personal computer (PC) platforms is introduced to aid researchers in both assessing
the compliance with the homogeneity of error variance assumption as well as
computing the appropriate alternative statistic(s).
3
CHAPTER 2
REVIEW OF THE LITERATURE
Test Bias
What is Test Bias?
An employment selection test is biased if it systematically affords an
advantage to one subgroup over another that is explained by group membership
alone. In the context of bias against a minority group, Arvey and Faley (1988)
explained:
.. .Bias is said to exist when members of a minority group have lower
probabilities of being selected for a job when, in fact, if they had been
selected, their probabilities of performing successfully in a job would
have been equal to those of nonminority group members (p. 7).
This definition provides an important distinction that was further clarified by the
Society for Industrial and Organizational Psychologys (SIOP) Principles for the
validation and use of personnel selection procedures (1987). That is, a difference in
average predicted scores between subgroups does not, by itself, constitute bias.
Such differences must be accompanied by the fact that a subgroup is rated
4
consistently and spuriously high (or low) as compared to other groups (p. 10). This
is a specific case of differential prediction (American Psychological Association,
1985). Used synonymously with test bias in this thesis, differential prediction exists
when the predicted level of performance is statistically different for individuals from
two different demographic groups who have the same score on the test.
Differential prediction can be assessed through statistical evaluation of the
subgroup predicted values on the selection criterion (e.g., job performance) given
predictor (e.g., selection test) scores. When multiple regression is used to determine
subgroup predicted scores, the regression lines will have significantly different
slopes for those subgroups. Moderated multiple regression involves the assessment
of an interaction term created from a subgroup variable (i.e., moderator) and a test
score variable. Throughout this study, the terms test bias, differential prediction,
and interaction of categorical moderators, refer to the same procedurethe analysis
of subgroup regression slopes for differences. This procedure is discussed in detail
in a later section.
Fairness vs. Bias
In much of the legal and testing literature, fairness and the absence of bias are
treated as identical characteristics of tests. However, they are not the samefairness
is associated with individual and organizational values, where bias is essentially a
5
statistical concept. In fact, an unbiased test may not be considered fair because of
conflicting values. Thus, further clarification of these terms will prevent
misunderstanding of what this study addresses.
Several models of fair selection have been developedeach representing a
different, often contradictory, ethical position (Linn, 1994). In a recent issue of the
American Psychologist devoted to intelligence testing, Wagner (1997) recognized
this distinction, stating: .. .the fact that an assessment device meets the psychometric
definition of an unbiased test does not necessarily mean that use of the test is fair
with regard to adverse impact (p. 1063). Arvey and Faley (1988) listed twelve
fairness models. Thus, a comprehensive review of the different models of test
fairness is beyond the scope of this work. In short, many of these models resemble
the Cleary (1968) model (Maxwell & Arvey, 1993), which is the fundamental basis
for using the regression model to assess test bias. Other models for choosing fair
selection procedures often result in conflicting conclusions. Such varying results are
largely due to different utilities or weights assigned to the factors being considered.
Essentially, evaluation of fairness often comes down to a political, rather than a
logical resolution of conflicting values (Arvey & Faley, 1988; Linn, 1994).
Conversely, bias has a more consensual psychometric definition. In the
American Psychological Associations (APA) Standards for Educational and
Psychological Testing (1985), a precise definition of bias was presented:
6
The accepted technical definition of predictive bias implies that no
bias exists if the predictive relationship of two groups being compared
can be adequately described by a common algorithm (e.g., regression
line)...If different regression slopes, intercepts, or standard errors of
estimate are found among different groups, selection decisions will be
biased when the same interpretation is made of a given score without
regard to the group from which the person comes. Differing
regression slopes or intercepts are taken to indicate that a test is
differentially predictive for the groups at hand (pp. 1213).
The SIOP Principles (1987) similarly advocates a more singular definition:
Although other definitions of bias have been introduced, only those based upon the
regression model,.. .have been found to be internally consistent (p. 18). Hence,
MMR analysis of test bias is the implicitly recommended but not required method.
The precise statistical test of slope, intercept, or standard error differences is not
stated. For example, it is implied that a t test of regression slopes relative to the
standard error (i.e., the test for evaluation of the significance of the moderator term in
MMR) is no more accepted than referencing Marascuilos (1966) U (which
incorporates the error variance terms) to a chisquare distribution.
Pervasiveness of Test Bias
Given this precise definition of bias (i.e., differential prediction), one might
ask how frequently it is found in tests used for making personnel decisions. There are
widely differing opinions regarding this question. In the evaluation of criterion
related test validity, the 1985 APA Standards only conditionally requires
7
investigation of differential prediction when previous research has established a
substantial prior probability of differential prediction for the particular kind of test in
question (p. 17). Similarly, the March 1997 Draft of these standards recommends
that special attention be given to assessment of extraneous sources of variance
that might bias the criterion for or against identifiable groups (p. 20). However, in
a letter providing comments regarding this Draft, the Equal Employment Advisory
Council (EEAC) recommends the former, less prescriptive requirement, stating:
This subject has been exhaustively studied for over the last 30 years.
It is clear that such studies are fairly conclusive in the finding that
differential validity and differential prediction studies need not be
done for some major groups for cognitive ability tests. In the field of
industrial psychology, the Journal of Applied Psychology stopped
accepting manuscripts on possible differences between Anglo
Americans and AfricanAmericans in 1977 (p. 12).
Many (e.g., SIOP, 1987; Wagner, 1997) have cited Hunter, Schmidt, and
Rauschenberger (1984) as justification to disregard the assessment of racial
differential prediction. After evaluating numerous studies, Hunter et al. (1984)
concluded:
The evidence from all these studies is clear. The regression lines for
composite [cognitive ability] predictors are identical for white, black,
and Hispanic workers. There is no underprediction of minority
performance. The hypothesis of test bias against minority members is
disconfirmed (p. 54).
However, this statement may not be as conclusive as it appears. Some
studies cited by Hunter et al. did find differential prediction (e.g., Gordon & Rudert,
8
1979; Jensen, 1980; Campbell, Crooks, Mahoney, & Rock, 1973, as cited), but that
job performance was overpredicted for minorities and underpredicted for whites.
Linn and Werts (1971) were cited as explaining that this overprediction was due to
test unreliability. Secondly, note that the assertion addressed composite ability
predictors. Overprediction was evident in the cited study (Powers, 1977) for
individual ability measuresand the law requires that each component of a battery
free of adverseimpact (further legal considerations are discussed further in a
subsequent section).
Finally, the Journal of Applied Psychology has published several studies
since 1977 that assessed ethnicitybased differential prediction (e.g. Chan & Schmitt,
1997; Hitt & Bair, 1989; Mael, 1995; Oppler, Campbell, Pulakos, & Borman, 1992;
Sackett, DuBois, Cathy, & Noe, 1991; Stone & Stone, 1987; Waldman & Bruce,
1991). This indicates that many researchers have not dismissed potential ethnicity
based test bias as Hunter et al. (1984) have advocated.
In order to identify the frequency of MMR use in differential prediction
studies, a review of articles published from January 1987 to March 1998 in the
Journal of Applied Psychology, Personnel Psychology, Educational and
Psychological Measurement, Academy of Management Journal, Journal of
Management, and Organizational Behavior and Human Decision Processes was
conducted. Studies using categorical MMR in the latter three journals were quite
9
sparse, although MMR for assessing continuous moderators was common. As
illustrated in Table 1 (below), several recent studies used MMR to assess test bias in
the first three journals listed. In this review, a study was considered to.assess true
test bias if either gender or race (and in one case, age group) was entered in a
hierarchical regression model as an interaction term with another predictor of the
criterion of job, training, or scholastic performance. Of fourteen test bias studies
identified and reviewed, six found one or more significant interactions of a
demographic variable with the test scores in prediction of a performance criterion.
Table 2 (see Appendix A) provides the specific authors, criterion, predictors, and
subgroup sample sizes assessed in these studies. These fourteen test bias studies
were a subset of 35 published articles that used MMR to assess the moderating effect
of a categorical demographic variable on any criterion (e.g., intention to quit, union
commitment, extrarole behavior). Across all of these 35 studies, 134 regression
models with a categorical moderator were evaluated and reported. Of these 134
models, only 36 were reported to include statistically significant interactions with
these categorical moderators.
10
TABLE 1
Summary of Journal Review, January 1987 through March 1998
JAP1 PP2 EPM3 Total
Total studies using MMR with categorical moderators 50 10 9 69
Studies with race or gender as moderator4 20 7 8 35
Studies with race or gender as a moderator on 4 4 7 15
performance criterion5
Notes: Journal of Applied Psychology
personnel Psychology
Educational and Psychological Measurement
4Any criterion (e.g. job satisfaction, union commitment)_______
5True test bias study, i.e., gender or race interaction with test
predicting
performance
These observations suggest that either (a) most tests do not predict
performance (or other criterion) differently for varying demographic groups, or (b)
many predictor measures are biased, but MMR does not provide statistical power to
conclude that they are biased. Again, the latter possibility is a central concern of this
thesis, and the issue will be addressed further in subsequent sections. First, however,
a brief review of the legal environment of employment testing is appropriate to gain
an improved perspective on the issue.
11
Legal Interpretations
The origin of legal scrutiny of selection tests is Title VII of the Civil Rights
Acts of 1964 and 1991, and case law (Sackett & Wilk, 1994). This section reviews
the applicable sections of these laws and interpretations regarding selection test bias
to date.
The 1964 Civil Rights Act states that an employer is forbidden to limit,
segregate, or classify his employees or applicants for employment in any way which
would deprive... any individual of employment opportunities... because of such
individuals race, color, religion, sex, or national origin (Section 703a). The only
explicit reference to testing was found in section 703b, known as the Tower
Amendment. This section allows employers to administer and use the results of
professionally developed ability tests, provided that such a test, its administration or
action upon the result is not designed, intended, or used to discriminate because of
race, color, religion, sex or ethnic origin (Sackett & Wilk, 1994, p. 940941). Initial
interpretations of this section focused on intentional discrimination directed toward
individuals.
The pivotal event in testing litigation came with the development of adverse
impact theory. Declared in Griggs v. Duke Power Co. (1971), illegal adverse impact
is evident if members of a racial subgroup are selected at a higher rate than another
subgroup as a result of the test, unless the test could be shown to be jobrelated
12
(Sackett & Wilk, 1994). Thus, Griggs set several important precedents: 1) it
allowed comparison of selection rates to be compared to the job applicant pool, 2) it
allowed organizations to use jobrelatedness as a defense to discrimination suits,
and 3) it extended application of discrimination law to allow class action suits on the
behalf of groups or individuals. It is because of the latter precedent that statistical
evidence for a test can be crucial to defending that test. Additionally, adverse impact
illegality was extended to include sex discrimination (Dothard v. Rawlinson, 1977),
and as applicable to subjective employment testing methods (e.g. interviews) in 1988
(Watson v. Fort Worth Bank and Trust, as cited in Donohue & Siegelman, 1991).
However, under the Griggs interpretation, the use of a valid test with differential
prediction is illegal only if it results in a higher proportion of one subgroup being
selected. It is possible that, because of the demographics of the applicant pool,
adverse impact would not occureven with a differentially predictive test. That is,
the fourfifths rule (that a selection rate for any group is at least 80% of the groups
with the highest selection rate) presented in the Uniform Guidelines on Employee
Selection Procedures (1979), may not be violated, and a prima facie case of
discrimination could be dismissed.
However, this interpretation did not see fruition because the Uniform
Guidelines adopted a definition of differential prediction as the definition of fairness:
13
When members of one race, sex, or ethnic group characteristically
obtain lower scores on a selection procedure than members of another
group, and the differences in scores are not reflected in differences in
a measure of job performance, use of the selection procedure may
unfairly deny opportunities to members of the group that obtains
lower scores... (Section 14, B., 8. (a)).
The Supreme Court later recognized the distinction in Connecticut v. Teal (1982). In
that case, adherence to the fourfifths rule of thumb was no defense to this prima
facie case, because the selection test used had a lower passing rate for some
minorities (Player, Shoben, & Lieberwitz, 1995).
Conversely, empirical evidence that a test is not differentially predictive of
job performance is a defense in discrimination litigation. Thus, when a tests scores
show disparate impact, the employer may counter with proof that there is no
predictive bias (Tenopyr, 1996, p. 195). In Albermarle Paper Co. v. Moody (1975),
the supreme court upheld the district court decision that Albermarle Paper had
sufficiently validated a selection test, despite disparate impact. That is, research and
statistical evidence that the test was differentially valid (nonsignificantly different
validity coefficients across subgroups), and that it served a legitimate business
purpose of selecting the most qualified applicants was a sufficient defense.
It should be noted that differential validity involves a less stringent analysis
than test bias assessment. That is, differential prediction evaluation involves an
assessment of intercept and slope differences (Bartlett, Bobko, Mosier, & Hannan,
1978), and these differences can be a function of the differential correlation and/or
14
variances (Bobko & Russell, 1994, p. 196). Put differently, an employment test
could be equally valid for two groups (i.e., a similar amount of variance in job
performance is explained by the test for both groups), but the slopes of a regression
model used for job performance prediction for each subgroup could differ. If these
subgroup slopes differ, but a common regression model is used for predicting
performance of both subgroups, job performance is undeipredicted for one group
and overpredicted for another across a certain range of scores, even though the test is
not differentially valid. Since personnel selection decisions are often made from
considering scores in the middle and higher ranges of test scores (Aguinis & Pierce,
in press(a)), such a situation can result in favorable outcomes for one group at the
expense of the other. Thus, it seems a reasonable assumption that a validated test
that is also free of bias (i.e., not differentially predictive) would be a more than
adequate defense to an adverse impact suit.
A final topic that denotes the importance of test bias assessment needs
attentionburden of proof. Based on the 1991 Civil Rights Act and a U.S. District
Court case, Legault v. Arusso, the burden of proving (or disproving) adverse impact
shifts from the employee to the employer after the allegation is made. That is, the
employee (plaintiff) first must produce evidence that a particular testing procedure
excluded him or her based on protected group membership. Once that is established,
the employer (defendant) then bears the full burden of proving the business necessity
15
of the test (Player et al, 1995; Kravitz et al., 1996). That is, providing evidence that
the test is a valid predictor of relevant job performance. Furthermore, (as established
in Albermarle Paper Co. v. Moody) even if the employer meets this burden, the
plaintiff may still establish discrimination by showing that the employer refused to
adopt a readily available, nondiscriminatory alternative to the challenged practice
(Player et al., 1995, p. 258). The implication here is that the when an employer uses
a test that is biased (or has failed to assess bias appropriately), regardless of the
impact, a plaintiff could present an alternative, unbiased test that the employer could
have usedand win the discrimination case.
In summary, discrimination litigation can certainly be costly to organizations.
In 1991, the Equal Employment Opportunity Commission (EEOC) won $188 million
in benefits for discrimination plaintiffs (Cascio, 1995). Thus, it is economically, if
not ethically prudent, for employers to thoroughly assess test bias and avoid costly
legal expenses. Most importantly, test researchers and evaluators can take the lead in
preventing future bias by striving for the higher standard of assessment.
The Assessment of Bias: Moderated Multiple Regression
MMR has become the method of choice for evaluating test bias for several
reasons. As noted, it was implicitly recommended by the APA Standards (1985) and
16
by SIOP in 1987. In fact, the 1997 Draft of Standards for Psychological and
Educational Testing also includes the following standard (in a different chapter, titled
Fairness in Testing and Test Use). Standard 1.21 reads:
When studies of differential prediction are conducted, the reports
should include regression equations (or an appropriate equivalent)
computed separately for each group, job, or treatment under
consideration or an analysis in which the group, job or treatment
variables are entered as moderators (APA, 1985, p. 17).
Secondly, regression is a widely known procedure for many applications in
organizations (e.g. sales forecasting, investment portfolio risk analysis), and the
terminology used in the procedure are thus familiar (e.g. beta weights, R2). It does
not require a large conceptual leap to understand the testing of significant differences
for predicted values across groups. The following two sections explain the accepted
procedure for assessing differential prediction using MMR, and the documented
weaknesses of that procedure.
Statistical Analysis Procedure
MMR involves hierarchical regression that first tests the relationship of the
predictors of interest (e.g., test score and gender) on the criterion variable (e.g. job
performance), and secondly tests the relationship of a term that carries information
about both predictors (the interaction term): The categorical predictor (in this
example, gender) is typically dummy coded 0 for one sex and 1 for the other. The
17
interaction term can then be computed for each subject by multiplying the two
predictors such that the resulting regression equation is in the form:
A
Y = a + b]X + b2Z + b,X.Z, (1)
A
where Y is the predicted value for Y (job performance), a is the least squares
intercept, bj is the least squares estimate of the population regression coefficient for
X (test score), bj is the least squares estimate of the population regression coefficient
for Z (gender), and b3 is the least squares estimate of the population regression
. coefficient about the interaction between X and Z (Cohen & Cohen, 1983).
Rejecting the null hypothesis that P3 = 0, indicates the presence of an interaction or
moderating effect. Stated differently for this example, the derived regression model
will predict a different level of job performance for males and females with the same
score for all but one score (that score is where the regression lines for each group
intersect). Figure 1 illustrates how resulting regression lines for males, females, and
the common line would differ in the case of differential prediction or genderbased
test bias. Note that, in this specific illustration, the line representing predicted values
for females has a steeper slope than that of males. In that case, scores for males are
overpredicted from a common regression line beyond the point where the lines
intersect, and underpredicted for females.
18
use the X Z term carries information about both the test and the sub
R is similar to testing for slope differences for separate subgroup
todels. For the latter type of analysis, separate models are derived in the
A
+ bx X, and Y = a+b2 X, for corresponding subgroups. The null
len tested is that the corresponding parameters of b, and bj are equal,
However, when P3 is statistically different from zero for the MMR
:an be due to slope or intercept differences. A test of the null hypothesis
ie second method, will only test for differences in slopes.
19
Figure 1. Illustration of Genderbased Test Bias
The hierarchical form of regression indicates that predictors are not entered
into the regression equation heuristic simultaneously, but in a logical order.
Typically, the continuous predictor (e.g., test score, X) and the polychotomous
predictor (e.g., race, Z, dummy coded) are entered in the first step, and the interaction
term (X .Z) is entered in the second step. However, research concluded that the only
unacceptable sequence of entering variables is when the interaction term, X .Z, is
entered into the regression as the first step by itself. Entering the predictors and
interaction term simultaneously in a single step is acceptable and yields the same
results as entering noninteraction terms first (Stone & Hollenbeck, 1984).
Aiken and West (1991) presented a thorough discussion of appropriate
methods for using and interpreting interactions using MMR. One such
recommendation involves centering of predictor variables (i.e., put in deviation
score form so that their means are zero) that are entered into the regression. This
procedure, although it has no impact on the slope of the interaction term, minimizes
problems associated with predictor multicollinearity, and eases interpretation of the
nonproduct terms in the final regression model. However, a recent study
demonstrated empirically that centering of means in hierarchical regression analysis
does not improve multicollinearity problems, and is not necessary (Kromrey &
FosterJohnson, 1998). Thus, this technique seems only to improve interpretability.
20
Type I and II Error
Researchers and practitioners are certainly concerned with the adequacy to
convince others that their conclusions are correct. Absolute certainty is only possible
with data for the entire population, and the sensitivity of a statistical test to infer that
sample characteristics reflect the true population is commonly referred to as
generalizability. Two types of error are possible in testing a generalized inference.
Type I error consists of concluding that the null hypothesis is false, when it is
actually true. Type II error refers to accepting the null hypothesis when, in fact, it is
false in the population. In test bias assessment using MMR, the null hypothesis is
that the slope of the interaction term, P3, is equal to zero. In most research
applications, Type I error is the main concern, and thus directly controlled (e.g. a =
.05 or .01 for a 5%, 1% level of significance, respectively). Although sometimes
presumed by researchers, the alpha level does not mean that there is a 5% or 1%
chance of committing this type of error. In effect, setting alpha at .05 merely places
a condition on how different from zero P3 must be before concluding that the null
hypothesis is false (Schmidt, 1992). Nonetheless, researchers are concerned with
Type I error because they do not wish to lose credibility by declaring a nonexistent
finding. However, in the case of MMR, the null hypothesis is that the interaction
coefficient (b3 in our example) is zero in the population, or that bias does not exist.
Thus, failing to reject this hypothesis when bias does actually exist (i.e., committing
21
a Type II error) is represented by the Type II error rate, also called (3. The Type II
error rate is not directly controlled like Type I error, and this is an issue of statistical
power (i.e., 1 P ). Given the potential impact of using a biased test in employment
settings, this would seem to be of greatest concern to test bias evaluators.
Because of the pervasive use of MMR, researchers have examined several
study design factors and sample characteristics have been shown to impact the power
of MMR to detect interactions. Through Monte Carlo simulation, StoneRomero,
Alliger, and Aguinis (1994) demonstrated that total sample size, unequal subgroup
sample size, and within group correlation coefficients substantially decreased
statistical power. In fact, the results of this simulation led them to call into question
the findings of no test bias in several recent studies (e.g., Hattrup & Schmitt, 1990)
based on the characteristics of the sample used. In even more recent studies
published in the Journal of Applied Psychology and Journal of Management,
additional factors were found to profoundly influence the power of MMR to detect
interactions. These include: 1) predictor range restriction, 2) predictor and criterion
reliability, 3) criterion scale coarseness, 4) artificial dichotomization or
polychotomization, and 5) magnitude of the moderating effect (Aguinis, 1995;
Aguinis & StoneRomero, 1997).
Finally, transformations of data have been examined as alternative techniques
when assumptions of MMR are violated (e.g., Hsu, 1994; Bobko & Russell, 1990;
22
Linn & Hastings, 1984). However, transformations are twoedged swords: while
they reduce apparent violations of assumptions, it is not clear whether resulting
interactions (or lack thereof) are artificial byproducts of the fact that interactive
terms are not invariant to nonlinear transformations of the data (Bobko & Russell,
1994, p. 198). Aguinis and Pierce (in press(a)) made it clear that Hsus (1994)
recommended transformation can actually eliminate the moderating effect when it is
present.
Despite these weaknesses, it should be noted that MMR has been
demonstrated to be more statistically powerful than some other techniques for
detecting interactions including categorical moderators. For example, StoneRomero
and Anderson (1994) found MMR superior to testing the equality of subgroupbased
correlation coefficients (SCC) that uses a chisquarebased test. Similarly, Anderson,
StoneRomero, and Tisak (1996) demonstrated MMR is more powerful than errors
invariables regression (EIVR) when sample size or predictor reliabilities are low.
Finally, MMR is a test for both differences in subgroup slopes and intercepts.
Nonetheless, a prudent analyst obviously must be cognizant of the profound
impact of sample characteristics and MMR assumption violation on statistical power.
Thus, another important assumption to MMR, the primary focus of this paper, is
discussed next.
23
The Homogeneity of Error Variance Assumption
In addition to the usual assumptions for regression analysis of continuous
predictor/moderator interactions (e.g. linearity and additivity, homoscedasticity), the
test for equality of subgroup regression slopes requires the additional assumption of
homogeneity of residual (error) variance across groups (i.e., a2ej =...= u]k ) (Aguinis
& Pierce, in press(a); Dretzke, Levin, & Serlin, 1982; Stone & Hollenbeck, 1989),
where
< = <(lp,2).  (2)
That is, just as the analysis of variance (ANOVA) test on the means of Y assumes
that the variance in the dependent variable (cr2 ) is equal across groups, the
regression slopes test assumes that the variance in Y that remains after predicting Y
from X is equal across groups. In each group, this value is estimated by the mean
squared residual from the regression of Y on X (Alexander & DeShon, 1994).
This assumption is different from, but often confused with, the assumption of
homoscedasticity (that residual scores are similarly distributed across various points
of the X scale). Although these terms are equivalent for MMR models with
continuous moderators, Aguinis and Pierce (in press(a)) clarified that the
homogeneity of subgroup error variance assumption applies only to MMR models
where one variable in the interaction term (e.g., group membership) is
24
polychotomous. In fact, it is possible to violate one assumption without violating the
other. Aguinis and Pierces (in press(a)) graphical representation of this is useful in
promoting understanding of the distinction between homoscedasticity and
homogeneity of error variance, and is thus imitated here. Figure 2 (below) is a
simple scatterplot of the predictor (X) scores on the criterion (Y) for a hypothetical
data set that includes all group members. Note that the points are distributed
similarly about the regression line, indicating compliance with the homoscedasticity
assumption. Figures 3 and 4 are the scatteiplots of each subgroup (i.e., Figure 3 for
all male subjects, Figure 4 for females). These figures illustrate that the variances of
the two subgroups are markedly different or heterogeneous. That is, the criterion
scores for males are further from the regression line than the criterion scores for
males. However, observe that even the subgroup scores are similarly distributed
around the regression line, that is, homoscedastic. Thus, it is plain to see how one
assumption is satisfied, while the other is not.
25
Figure 2. Scatterplot of Hypothetical Homoscedastic Data.
Figure 3. Scatterplot of Male Subgroup Homoscedastic Data
26
10
8
6
R 4
C
u .4 2
O 0
0 2 4 6 8 10 12 14 16 18
Predictor (X)
Figure 4. Scatterplot of Female Subgroup Homoscedastic. Data
Despite this readily observed difference in the assumptions, it appears that
when researchers do treat the two synonymously (e.g., Stone & Hollenbeck, 1989),
the better known homoscedasticity assumption is assessed (usually via an inspection
of a scatterplot of residuals on the predicted values as above) while the heterogeneity
of error variance assumption is ignored (Aguinis & Pierce, in press(a)). This is
particularly disturbing because violation of the homogeneity of error variance
assumption often leads to Type II errors. That is, concluding that a test is not biased,
when it actually is biased. The accumulated evidence for increased Type II errors is
discussed in detail in a later section.
27
Although Dretzke et al. (1982) contended that this assumption is usually
assessed in Aptitude X Treatment interaction research, there is little evidence that
researchers in the employment testing domain have consistently assessed this
assumption. Of the 69 studies using MMR to assess interactions with categorical
predictors in the Journal of Applied Psychology and Personnel Psychology described
in Table 1, only one study (i.e., Stewart, Carson, & Cardy, 1997) noted such
assessment, and only 18 (26%) provided the sample descriptive statistics (e.g. sub
group standard deviations on the criterion and predictor) necessary for the reader to
independently assess compliance with the assumption. It should be noted that
because many of these studies (e.g. Grover & Crooker, 1995; Campion, Pursell, &
Brown, 1988) did not find significant interactions, they reported only subgroup
sample sizes, and no additional subgroup descriptive data.
The nonreporting of analysis of this assumption may be explained by brevity
concerns. However, it may indicate ignorance to the problems associated with
violation of regression assumptions in general. Weinzimmer, Mone, and Alwan
(1994) found that in 201 regressionbased studies in the Academy of Management
Journal and Administrative Science Quarterly from 1986 to 1990, only 9.5 percent
reported diagnosis of any of the regression assumptions. Even more astounding,
none of the studies reported the full range of diagnostic tools necessary to verify
that a regression model was appropriate for the given data (p. 182).
28
DeShon and Alexander (1994b) conducted a similar review to assess
compliance with the homogeneity of error variance assumption. Interestingly, this
review and analysis of 20 differential prediction studies (that included 405
differential prediction tests) in the Journal of Applied Psychology, Personnel
Psychology, and the validity database published in the Journal of Business and
Psychology from 1980 to 1993 showed that the assumption was violated in over 9%
(39 of 405) of models evaluated in these studies. These 20 studies were the only
ones that reported or made available the necessary information to calculate within
group regression coefficients and error variances (the total number" of differential
prediction studies in these journals was not reported).
There has also been some more recent research to determine how commonly
the assumption is violated in practice. In a poster to be presented at the 1998
conference of SIOP, Oswald, Saad, and Sackett examined some large landmark
selection and classification databases for compliance with the assumptionthe U.S.
Army Project A data and a validity study of the General Aptitude Test Battery
(GATB). For the GATB data, the homogeneity assumption was generally met.
However, similar to DeShon and Alexander (1994b), they found that for about 10%
of the 420 validity coefficient comparisons, the assumption of equal error variance
across subgroups was violated. That is, the ratio of the error variances, calculated
using equation (3) below, exceeded 1 to 1.5 (this rule of thumb was empirically
29
derived in DeShon and Alexander (1996), and is discussed further in a subsequent
section).
cy^(i)(1 ~ ^>y(i)) m
2 st 2 \ V*/
crK2)V1 rxy(2)>
One might suggest that a 9% or 10% violation rate is not very alarming.
However, assuming that violation of this assumption results in erroneous conclusions
(discussed in the next section), nine percent of the tests may be biased.
Furthermore, with more than sixty percent of the projected workforce in the year
2000 consisting of women and ethnic minorities (Cascio, 1991), nine percent
represents an extraordinary number of people who could be potentially affected by
inadequate conclusions regarding employment tests. Finally, the prudent researcher
desires to make the strongest possible argument for his or her conclusions, and a
thorough understanding of the statistical analysis options is necessary to do so.
Because of this crucial role of data analysis in the interpretation of research, issues
that may improve methodology are important (Kromrey & FosterJohnson, 1998).
Consequently, a detailed examination of the potential that violation of the
homogeneity assumption has such an impact is warranted. This is the topic of the
next section.
30
The Impact of Violation
Given the inattention to assessing the homogeneity of error variance
assumption, one might conclude that it must not be of appreciable importance.
Unfortunately, that does not seem to be the case. Using MMR to assess test bias
when the subgroup error variance is heterogeneous can easily lead to erroneous
conclusions because it impacts both Type I and II error rates (Aguinis & Pierce, in
press(a)).
Dretzke et al. (1982) and DeShon and Alexander (1996) examined the impact
of violating the assumption on Type I error rates through Monte Carlo simulations.
In general, the degree of impact was found to be largely a function subgroup sample
size and the degree of variance across the predictors. Dretzke et al. manipulated sub
group and total sample size and the nominal a across conditions of equal and
unequal error variance. For equal subgroup sample sizes, actual Type I error rates
of the F test for interaction did not differ significantly from the nominal a for either
case of assumption violation/compliance, but were consistently more liberal than the
nominal a. However, equal subgroup size is rarely the case in differential
prediction research (cf. Mael, 1995; Pulakos & Schmitt, 1995; Weekley & Jones,
1997). In the more realistic case of unequal subgroup sample sizes, actual a s were
outside the expected range and consistently inflated (i.e. overly conservative) when
the homogeneity assumption was violated. The actual a was found to be attenuated
31
the most when the subgroup with the smallest pXY (or highest ae) was paired with
the smaller subgroup sample size. This aspect was similarly found in Robinson and
Dunlaps (1997) simulations assessing the impact of assumption violation in
ANCOVA applications. DeShon and Alexander (1996) also replicated these results
and extended the analysis to manipulate the variance of the predictor variable (X) as
well. Dretzke et al. (1982) created the unequal variance condition by manipulating
pXY, which results in changes in variance on the criterion (Y) only. It was found that
unequal X variance also leads to overly conservative a rates, even in the case of
equal size subgroups. Thus, violation of the homogeneity of error variance
assumption has the effect of significantly decreasing the probability of incorrectly
concluding that a test is biased for the typical unequal group sample size situation.
Since Type I and Type II error rates are inversely related, these conditions can lead to
overly liberal Type II error rates, that is insufficient statistical power, which is
discussed next.
As mentioned, the chance of concluding a test is not biased, when it actually
is, should be of paramount concern to differential prediction researchers and
practitioners. In conditions of low statistical power, this chance is greater.
Alexander and DeShon (1994) examined the impact of assumption violation on
statistical power through multiple Monte Carlo simulations. Total and subgroup
sample sizes, number of subgroups (2 and 3), and validity (and consequently error
32
variance) magnitudes were manipulated. Even the error variance conditions were
alternately paired with the larger and smaller subgroup ns. Given the accepted
convention that statistical power should be .80 or above (Cohen, 1988), the F test for
equality of regression slopes in MMR performed dismally in Alexander and
DeShons analysis. In 10,000 simulations of 40 different condition combinations,
the power of the F test was compared to that of the chisquare test. This test was
empirically demonstrated to be immune to the effect of violating the heterogeneity of
error variance assumption, however it is not an acceptable substitute for the F test in
MMR, since it suffers from low statistical power, in general. The empirical rejection
rate (power) of the chisquare test was largely a function of sample size, and only
reached acceptable levels (greater than 80% correct conclusions that the test is
unbiased) for total sample sizes of 300. However, the F did not exceed the .80 power
level until total sample size reached 800 for two subgroup simulations (e.g., male v.
female) and 1200 when comparing three subgroups (e.g., white, AfricanAmerican,
Hispanic).
The evidence from Alexander and DeShons (1994) simulations is striking in
two additional ways. First, the power of the F test was consistently and considerably
lower than the chisquare power when subgroup samples were unequal, regardless
of which group (higher or lower error variance) had the larger n. Of the 15 studies
listed in Table 2 (see Appendix A), only three (i.e., Sackett et al., 1991; Qualls &
33
Ansley, 1995; Young, 1994) studies used samples with close to equal subgroup
sizes (i.e., better than a 60%40% split). Given the statistical power problem, it is
then not surprising that fewer than half of these studies found tests with significant
interactions. Secondly, the power of the F test was also lower when the larger sub
group sample size was paired with the smaller correlation coefficient. This has been
documented as precisely the prevalent situation in differential prediction settings.
That is, typically, the majority group (i.e., larger n) usually has the larger error
variance associated with the regression (Aguinis & Pierce, in press(a); Hunter,
Schmidt, & Hunter, 1979; cf., Hattrup & Schmitt, 1990; Melamed' BenAvi, &
Green, 1995).
In summary, it is clear that the F test used in MMR is not robust to violation
of the homogeneity of error variance assumption. That is, researchers are more
likely to incorrectly conclude that bias exists (Type I error) when the smaller sub
group sample also has the smaller correlation coefficient. However, in the more
common setting (when subgroup sample sizes are unequal and the larger subgroup
has the smaller correlation coefficient associate with it), committing a Type II error
may be more likely, and frequent, than committing a Type I error.... (Aguinis &
Pierce, in press(a)). It bears repeating that this indicates a higher probability that a
test declared to be unbiased is actually differentially predictive across subgroups.
This problem of statistical power in the presence of heterogeneous error variance was
34
presented many years ago (cf. Bartlett & OLeary, 1969; Gulliksen & Wilks, 1950).
If fact, Gulliksen and Wilks (1950) recognized this, and suggested that slope
differences should not even be assessed with MMR in the presence of error variance
heterogeneity. Thus, the next section addresses how to determine if this condition
exists for a validation sample at hand, followed by alternatives for dealing with the
seemingly unavoidable presence of error variance heterogeneity.
Estimation of Homogeneity of Error Variance
A plethora of variance heterogeneity tests have been developed and
examined in the ANOVA literature. Based on error rate comparisons of several of
these tests by Gartside (1972) and Games, Winkler, and Probert (1972), DeShon and
Alexander (1996) concluded that .. .Bartletts (1937) test is one of the most flexible
and powerful tests available (p. 271). Both Gartside and Games et al. found that in
simulation conditions applicable to test bias settings (e.g. three or fewer subgroups,,
unequal subgroup sample size), Bartletts test adhered the most closely to nominal
Type I euor rates, and demonstrated the highest statistical power rates. However, as
DeShon and Alexander noted, these two studies also found that Bartletts test was
outperformed when the variances being compared deviated from normality.
Similarly, Gartside as well as Games et al. found that the more conservative Log
35
ANOVA test (Bartlett & Kendall, 1946, as cited) maintained the closest nominal
a rates, under various nonnormally distributed variance conditions.
Given these findings, Bartletts test seems to be the method of choice
accompanied with an empirically derived rule of thumb to consider deviations from
normality (Aguinis & Pierce, in press(a); DeShon & Alexander, 1996). That is,
DeShon and Alexander (1996) also simulated thousands of conditions and found that
the F test does not become adversely affected until the error variance of one sub
group is approximately 1.5 times larger than the other subgroups error variance.
This is useful assessment information, considering that normally distributed samples
are rare (Micceri, 1989).
The Bartlett test is easily adapted to MMR applications by substituting
unconditional subgroup variances (i.e., sf) with subgroup error variances (i.e.,
ct2^)) (DeShon & Alexander, 1996) as presented in Appendix B. However,
adaptation of the Log ANOVA test to MMR applications is not as straightforward.
This test assumes that subgroup sample sizes are equal, and requires each subgroup
sample to be broken up into an ambiguous number of even smaller samples for
analysis (Games et al., 1972). Given that the performance of this test was found to
vary based on the number of subsubgroup samples chosen (Gartside, 1972), this
test would be of little practical value for interpretation. Thus, calculation of the
Bartlett statistic in conjunction to DeShon and Alexanders rule of thumb (i.e., where
36
1.5 is the maximum acceptable ratio of one subgroup error variance to another)
appears to be the best means of assessing compliance with the homogeneity of error
variance assumption.
Alternative Statistics for Assessment of Bias
The next logical topic to address is: What should be done when the actual
validation sample does not comply with the assumption? Several alternative
statistics for evaluating regression slope differences have been promoted in the
literature that include nonparametric tests that do not require the assumption, and
parametric tests that correct for degrees of freedom associated with the more
common t and F tests (Aguinis & Pierce, in press(a)). Again, each approach has
advantages and disadvantages that depend on the distribution of the sample and the
experimental design, and none can be used to assess intercept differences. The
following is a description and comparison of the four most prominent test bias
assessment alternatives: the nonparametric chisquare test (U), the WelchAspin F
approximation, Jamess secondorder approximation 0, and Alexanders
normalizedt approximation (A) (Aguinis & Pierce, in press(a); Alexander &
Govern, 1994; DeShon & Alexander, 1994a; Dretzke et al., 1982).
37
Chisquare Test (z2)
Dretzke et al. (1982) examined the nonparametric alternative test for equality
of slopes, Marascuilos (1966) U statistic. This test is similar to the ordinary F test
for comparing two regression coefficients, however, it incorporates the individual
error variances of each subgroup to control for the problems associated with
violation of the homogeneity assumption. Also, the U statistic approximates the
asymptotic chisquare distribution for k 1 degrees of freedom.
WelchAspin Approximation (F*)
This statistic (an extension of Welchs (1947) twosample test of means) was
also examined by Dretzke et al. (1982), and can be viewed as a compromise between
the F and the U statistics. That is, the F* uses separate residual variance estimators
like the U, but the test for significance is based upon a more appropriate number of
degrees of freedom (p. 378) associated with the F test. The F is a relatively easy
statistic to calculate, and is in fact a test included in both SAS and BMDP statistical
software packages (Algina, Oshima, & Lin, 1994; DeShon & Alexander, 1994a), but
not SPSS (version 7).
38
Jamess Second Order Approximation (J)
The J statistic was developed in 1951 to correct for infinite degrees of
freedom used to reference U to the chisquare distribution (DeShon & Alexander,
1994a). Because of the computational complexity of J, examination of this statistic
was not extensive until recently. Since this statistic was found to control for Type I
and Type II error rates over a wide range of conditions compared to the other
approximations (discussed below), DeShon and Alexander (1994a) developed a
program for SAS to assess regression slope differences when the user inputs group
sample sizes, regression weights, and variances for the dependent and independent
variables. Appendix C presents the calculations for J.
Alexanders Noimalizedt Approximation (A)
The A statistic was derived by Alexander and Govern (1994) as a simpler
alternative to the J statistic. Calculation of this statistic (presented in Appendix D)
involves a normalizing transformation for the t statistic for each subgroup. The
resulting statistic is referenced to the chisquare distribution with uncomplicated (i.e.,
k 1) degrees of freedom determination.
39
Comparison of the Alternative Statistics
Each of the above alternative tests has been assessed via several Monte Carlo
simulations. While all solidly outperform the F test of MMR in terms of statistical
power in the presence of heterogeneous error variance, these tests also are susceptible
to some conditions that limit their power. Accordingly, each of these simulation
studies will now be briefly discussed.
Dretzke et al. (1982) assessed Type I error rates in the presence of
heterogeneous error variances for the F, F*, and the U tests of regression slopes.
Through manipulation of subgroup ns, nominal a rates, and variance heterogeneity
conditions, Dretzke et al. found that only Welchs F* consistently maintained Type I
error probabilities within 95% confidence limits of the nominal rate. The parametric
U test was overly liberal in the majority of the simulation conditions. Similarly,
Algina et al. (1994) examined Type I error rates only for the F* and the J statistics as
compared to the independent samples t test. In this study, distribution skewness,
variance homogeneity, and total/subgroup sample sizes were manipulated. Both the
F* and the J were found to adequately control Type I error rates close to nominal
levelseven at sample sizes as small as 60 combined with 2:1 ratios of subgroup
variances. However, DeShon and Alexander (1994a) compared the J, F\ and F Type
I error rates in a manner similar to Algina et al., but increased the number of sub
groups to as high as eight. The F* statistic became slightly overliberal when k was
40
greater than 3 under certain conditions, and this became more profound as k
increased. The J statistic maintained empirical rejection rates very close to the
nominal level.
Three simulation studies (i.e., Alexander & Govern, 1994; DeShon &
Alexander, 1996; Robinson & Dunlap, 1997) assessed both Type I and II error rates.
Nearly identical rankorder results were obtained in each of these studies, but only
DeShon & Alexander (1996) assessed the A, F*, and J statistics simultaneously and
considered departure from normality of Y as well as variance heterogeneity in both X
and Y. Therefore, these findings present the most comprehensive evidence and the
primary findings are presently discussed.
First, Type I error rates were within acceptable ranges for all statistics (except
F) in the two subgroup conditions with heterogeneous error variances. Each also
had nearly equivalent power, with acceptable rates (i.e. > .80) achieved for equal
subgroup sample sizes above 125. However, J had a slight power advantage when
sample sizes were small and subgroups were unequal, and A ranked second.
Conversely, A was slightly more robust to deviations from normality in Y.
However, both were quite overliberal to Type I error when X and Y were non
normal in both groups.
In summary, both the A and J statistics are strong alternatives to the F test in
conditions of heterogeneous error variances for evaluating subgroup slope
41
differences. Even though the U statistic is not as susceptible to nonnormality as the
other parametric tests, it has little utility. This test performs inadequately for small
or moderately large sample sizes, and requires that intercepts across the (maximum
of 2) subgroups are equal (i.e., can only test the equality of rs across subgroups, or
for differential validity) (DeShon & Alexander, 1996; Wilcox, 1988). The F* has no
advantage over the A or J statistics, and is the most susceptible to normality
deviations. The A statistic is simpler to compute and is slightly more robust to
deviations of normality. However, one important advantage of J is in statistical
power for small samples.
A researcher could quite simply look at the descriptive statistics for his or her
sample and choose the best method. However, calculation of these alternative
statistics seems unnecessary if the heterogeneity assumption is not violated. When
this assumption (and the other regression assumptions) are met, MMR is an
acceptable test bias assessment technique. Since there is evidence to indicate that
homogeneity of error variance is common enough to warrant concern, it would be
quite useful for test researchers and evaluators to have the ability to quickly evaluate
compliance and compute these statistics if the assumption is not met. With an easily
accessed computer program, the complexity of computation becomes a moot point,
and both the A and J statistics can be calculated for comparison. Such a tool could
42
help prevent organizations from unwittingly implementing a selection tool that is
biased by gender or race factors.
43
CHAPTER 3
METHOD
Computer Program for Statistical Assessment of Test Bias
In order to provide such a tool for test bias assessment, a new computer
program was developed. Although one program (i.e., DeShon and Alexanders
(1994a) SAS program) exists for computing J and A, it is not readily accessible for
all PC users, and it does not assess error variance homogeneity. DeShon and
Alexanders program requires SAS, a commercial statistical package that not all
researchers have access to. Therefore, several programming languages (e.g. Java,
C++, Pascal, Visual Basic) were considered for creating a program to compute
DeShon and Alexanders (1996) error variance rule of thumb, as well as Bartletts,
Jamess, and Alexanders statistics. Although any of these languages could be used
for such computations, Java is an increasingly popular programming language
because of its flexibility for worldwide web applications. Since most PCs today
have a webbrowser preinstalled, a Java applet has the potential to reach the most
PC users regardless of operating system platform (e.g., Windows, Macintosh, OS2).
44
Additionally, Java applets can be executed as standalone applications when the user
has the necessary supporting files installed on the PC, and these files are part of the
popular Microsoft Internet Explorer webbrowser. Thus, the program for assessing
test bias was developed using the Java programming language (with Microsoft
Visual J++, version 1.1). The main functions of the program are discussed below.
Check for Homogeneity of Variance
Input. Assuming that the user has any other statistical package available (e.g.
SPSS, Minitab, SAS, etc.) this program prompts the user for the necessary data in
two general steps. First, the user inputs the number of subgroups to be compared
and selects the alpha level (this is only necessary for Jamess statistic as precise p
values are calculated for the other statistics). Second, a new input window is
displayed for each individual subgroups input. To minimize user requirements, the
minimal data entries are requested (e.g. correlation coefficients and standard
deviations as opposed to raw data). More specifically, to conduct Bartletts test, the
user must provide the number of subgroups (k), the information required to estimate
each subgroup error variance (aj(;)) (i.e., sy(j) and ; discussed in the following
section), and the subgroup sample sizes (i.e., r^, which is necessary to compute the
degrees of freedom).
45
Computations. Although the subgroup error variances can be estimated from
the mean square residual terms after reversing the regression equation (e.g., job
performance becomes the predictor, test scores the criterion), it can also be estimated
from the variance in the actual criterion (e.g., job performance or s2y0)) and the
correlation of the test score and the criterion (i.e., ). Since these data are also
required for computation of the alternative statistics, obtaining this input minimizes
user requirements. However, the user must then conduct two separate operations for
each subgroup before using the program. That is, he or she must obtain the standard
deviation, sy(j), for each subgroup, and the bivariate correlation of the groups test
scores on the criterion, For this option, ^(;) is an estimate of the population
variance, aj(j), and is an estimate of the correlation in the population, for
each subgroup. Thus the error variance for each subgroup can be estimated using
the equation:
a e(i)estimaled = SY(i) 0 TXY{i) )
(4)
The degrees of freedom, Vj, for these estimates is calculated as ^ 1. Thus,
Bartletts statistic (M) is calculated using the equations in Appendix B by computing
and storing the necessary variables in arrays. The size of these arrays is determined
by the number of subgroups to be compared. After M is determined, the precise p
46
value of rejecting the null hypothesis that the variances are equal is calculated. The
values of M and the degrees of freedom are used by a separate algorithm to
determine this pvalue.
The ratio of error variances is also computed for comparison to the 1:1.5 rule
of thumb. When more than two subgroups are evaluated, the highest error variance
ratio is selected from all possible ratios. For example, in a four subgroup
comparison (i.e., groups 1 through 4), there are six possible ratios: 1:2,1:3,1:4,2:3,
2:4, 3:4. An algorithm first determines the number of possible ratios, and later
computes each ratio and stores them in an array of that size. A simple loop is then
used to choose the highest ratio in that array.
Output. After computations are complete, three statistics regarding the
homogeneity of error variance for the subgroups are displayed: (1) the (highest)
ratio of error variance, (2) the value of Bartletts statistic (M), and (3) the pvalue
associated with that M. Additionally, conditional statements in the program provide
explanations of these values. That is, whether or not the ratio meets DeShon and
Alexanders (1996) rule of thumb, and whether or not Bartletts test indicates
homogeneity or heterogeneity is displayed.
47
Calculation of Jamess and Alexanders Statistics
Input. Since k, nk, sy(i) and rmi) were already used to calculate Bartletts
statistic, the only additional data required are the subgroup standard deviations of
test scores (s^). However, for efficiency reasons, this input is requested along with
the other input.
Computations. The J and A statistics are calculated from the equations in
Appendices C and D, respectively. These calculations are again completed using
array manipulation to facilitate simple calculation of the sum and product terms in
these equations. Note that these equations use unstandardized regression weights (b),
not correlations. Regression weights are easily calculated using equation (5) for the
input data:
Additionally, because of the complexity of the equations, separate lines of code are
used to compute pieces of the larger equations (e.g., numerator and denominator).
This enabled simplified error debugging. Finally, note that a precise pvalue cannot
be calculated using Jamess equations. The thrust of this approximation is in
adjusting the critical value of the chisquare distribution to correct for infinite
degrees of freedom (DeShon & Alexander, 1994a). Thus, the value of J is not
referenced to the chisquare distribution directlythe adjusted critical value is
48
calculated from an initial critical value based on the actual degrees of freedom for the
sample. That is what the large equation for h(a) in Appendix C does.
Output. Along with the output for homogeneity, the user is provided with
each statistics value, the adjusted critical value associated with J, and the pvalue
associated with the A statistic in visually appealing format. Additionally, hypothesis
rejection information for these tests is displayed. At the top of the display is the data
that the user input for verification. Finally, instructions for saving the output and/or
executing it again using different data are displayed. Figure 7 in Appendix E is a
graphical flowchart representing the sequence the program follows, and Figure 6
shows what the output looks like when saved as a text file.
Accuracy Checks
In order to assess the accuracy of the program calculations, hypothetical data
was entered in the Java program and DeShon and Alexanders (1994a) SAS program.
The SAS program was validated using hand calculations and Monte Carlo study data
(R. P. DeShon, personal communication, November 15,1997). To evaluate the
Bartlett test accuracy (not part of DeShon and Alexanders program), hand
calculations using the input for iteration A in Table 8 (Appendix F) were compared
to computer output. During program development, the value for each calculated
49
variable (e.g., b+, standard errors, logarithm of error variances, as defined in
Appendix B) was output and compared to the hand calculated values. Except for
extremely small differences attributable to rounding, the program is accurate.
Appendix F provides a more detailed presentation of the checks performed.
50
CHAPTER 4
RESULTS
The Program
Executable files used in the Java programming language are called classes,
and Appendix E contains the Java code for the main class file of the program. Other
class files are also part of the program package, but the purpose of these files is
simply to construct and manage the dialog windows that interface with the user to get
the input. The written code for these files was generated using resource templates in
the J++ programming package, and is not included here for the sake of brevity. All
of the statistical calculations and output functions are included in the code found in
Appendix E, in addition to a sample output text file. In all, the actual program
package consists of this code (when compiled), and seven additional files of varying
length that serve to provide the graphical user interface. All of these files (called an
applet in Java terminology), plus a hypertext markup language (HTML) file to
contain the applet are required to execute the program using a Javacapable web
browser (e.g., Netscape Navigator version 3.0 or later), and fit on one floppy disk as
51
a selfextracting zip (compressed) file. Another compatible browser, Microsoft
Internet Explorer version 4.0 (IE4.0) is available free of charge to Windows 95 and
Windows NT users. It is also available for Windows 3.1, UNIX systems, and
Macintosh users for a modest fee.
However, execution of the package as an applet does not permit saving or
printing the generated output. Since applets are mainly intended for execution over
an Internet, this capability is limited for security reasons (i.e., Internet users would
not want web pages to be able to modify the contents of their PCs). Thus, a second
standalone version was developed to provide this capability. This version simply
has a few more lines of code to enable the user to save the output as a text file for
printing or future reference. To do this, the Java application generator utility in
Microsofts Software Development Kit for Java (version 2.01) was used to combine
the eight class files into one MSDOS executable program. This file can be executed
on any MSDOS based personal computer as long as the Microsoft Java Virtual
Machine (JVM) files are also installed on the computer. The JVM files are part of
Internet Explorer (version 4.0), but can also be downloaded free of charge at
Microsofts Website at http://www.microsoft.com/java/. The JVM, even when
compressed, requires over 3.7 megabytes of disk space, and thus must span three
floppy disks for distribution. If an interested user does not have Internet access, but
does have an unzipping (decompressing) program available, these files could be
52
distributed via floppy disks in addition to the small (39 KB) standalone application
file. Currently, the entire program can be executed on the World Wide Web at this
address: http://www.members.aol.com/imsap/testbias.html. The files to install the
program are also available for download from another site linked to this page.
In summary, two versions of the program were developed: One without
output preservation capability for use with an uptodate web browser, and a stand
alone version that can save output as a text file, but requires some large files to
support it. Appendix E contains a brief summary of the files required for each
version, and the Java code for the standalone version. Since a compatible web
browser, and the required files to support the standalone version are available free of
charge from the Internet, most computer owners interested in using this program will
be able to without any additional expense. In fact, based on data from Technologies
Research Group (TRG), an independent marketing research firm, 83 percent of the
ISPs [Internet service providers] in the United States have chosen to provide Internet
Explorer to their new customers in the past six months (PRNewswire, 1998a), and
nearly 57% of small and medium sized businesses have adopted IE 4.0
(PRNewswire, 1998b).
53
Sample Test Bias Analyses
In order to evaluate the utility of the program, data from four published
studies (i.e., Qualls & Ansley, 1995; Hattrup & Schmitt, 1990; Halpin, Simpson, &
Martin, 1990; Zeidner, 1987) were analyzed using the newly developed program.
These studies were selected because they were the only test bias studies of the fifteen
reviewed (see Table 1) that provided all the descriptive statistics necessary to check
compliance with the homogeneity of subgroup error variance assumption. A brief
summary of the studies, and the (often contradictory) results of these analyses is
presented below.
Qualls and Ansley, 1995.
Although this study did not examine employment testing, it did examine
some widely used academic achievement tests for gender bias. Specifically, the
predictive bias of the Iowa Tests of Basic Skills (ITBS), the Iowa Tests of
Educational Development (ITED), and the American College Test (ACT) was
evaluated in terms of predicting both highschool and firstyear college grade point
averages (GPA). These tests are used for both college admissions selection and
school curriculum development. MMR was used to evaluate test bias where each
composite test score, gender, and the interaction term (gender X test score) were
54
regressed hierarchically on both highschool and firstyear college GPA for fairly
large samples of men and women. Three of eleven interactions were significant (p <
.05) in predicting highschool GPA, but no significant interactions were found to
indicate gender bias in college GPA. For each test and test subscale (108 analyses)
the study reported subgroup sizes, predictor and criterion standard deviations, and
predictorcriterion correlations. Thus, the program developed here could be used to
assess compliance with the homogeneity of error variance assumption and calculate
Jamess and Alexanders alternative statistics.
Table 3 (below) presents the findings for 12 of these 108 analyses (composite
scores predicting collegeGPA, plus onesub scale score) using the program.
DeShon and Alexanders (1996) rule of thumb for error variance homogeneity (1 :
1.5) was met for each of these samples, but Bartletts statistic indicated that all of the
error variances were heterogeneous (for eleven g < .05; for the one subscale
evaluated g < .01). For the composite ACT and ITBS scores, Jamess and
Alexanders tests were not significant. However, both statistics indicated that the
subgroup slopes for the 12th grade ITED composite score predicting college GPA
were unequal (g < .05). Additionally, the Quantitative Thinking subscale of this
test is the only portion that predicts college GPA differently for males and females,
according to the J and A statistics (g < .01).
55
TABLE 3
Analysis of Qualls& Ansley, 1995
Predictor Subgroups n1 sx1 sy1 rxy1 n2 sx2 sy2 fcrr Var Bartlett's ncy2 Bias? Ratio M James' Alexander's
AC 1' Head temalemale 5 26 4.98 U.66 0.2 539 5.16 TJ7T 0.144 N 1.214 4.9828* 0.6076 0.6069
ACT English femalemale 526 4.20 0.66 0.298 539 3.98 0.72 0.21 N 1.249 6.5235* 0.772 0.771
ACT Math femalemale 526 3.70 0.66 0.296 539 3.94 0.72 0.255 N 1.220 5.221* 0.3387 0.3383
ACT Composite femalemale 526 3.70 0.66 0.299 539 3.39 0.72 0.236 N 1.234 5.8637* 0.0766 0.0765
ITBS grade 4 femalemale 406 0.96 0.66 0.224 415 0.94 0.72 0.139 N 1.229 4.3326* 0.9021 0.9005
ITBS grade 6 femalemale 421 in 0.66 0.274 427 1.05 0.72 0.173 N 1.248 5.1832* 1.0582 1.0563
ITBS grade 8 femalemale 443 1.18 0.66 0.326 448 1.11 0.72 0.209 N 1.273 6.4739* 1.424 1.4213
ITED grade 9 femalemale 404 4.28 0.66 0.302 398 4.79 0.72 0.191 N 1.262 5.3961* 2.9337 2.9247
ITED grade 10 femalemale 387 4.33 0.66 0.328 406 4.30 0.72 0.219 N 1.270 5.6063* 1.4807 1.4774
ITED grade 11 femalemale 455 4.98 0.66 0.334 456 5.42 0.72 0.258 N 1.250 5.6597* 1.4116 1.4089
ITED grade 12 femalemale 312 5.26 0.66 0.352 317 6.52 0.72 0.195 N 1.307 5.5887* 6.2675* 6.226*
ITED (Quant) 12 femalemale 319 5.83 0.66 0.379 323 5.59 0.72 0.141 n/a 1.362 7.6035** 7.1802** 7.126**
_e < .05
** p < .01
Hattrup and Schmitt, 1990.
This study used MMR to evaluate two employee training program selection
test batteries, and reported no predictive gender or racial bias on the two criterion
measures of performance. The first test consisted of a battery of subscales from the
Differential Aptitude Test (DAT) and the Employee Aptitude Survey (EAS). An
alternative battery of tests was developed from job analysis data, and included two
subscales measuring eyehand coordination constructs. Additionally, two separate
criterion measures of performance were developed from job samples, and used to
assess the validity of these test batteries. Two forms of these task performance
measures (TPM), an unweighted composite and a unit/rationallyweighted
composite, were used.
56
As presented in Table 4 (below), four of the eight moderated regressions were
conducted under conditions of error variance heterogeneity (i.e., both DeShon and
Alexanders rule of thumb and Bartletts test indicated this). Nonetheless, neither
Jamess nor Alexanders tests indicated significantly different subgroup slopes for
any error variance condition. In fact, the pvalues for all the tests exceeded 10.
TABLE 4
Analysis of Hattrup & Schmitt, 1990
Test Subgroups n1 sx1 sy1 ntyl n2 sx2 sy2 rxy2 Bias? trrvar Ratio BaftKtt'S M James' Alexander's
UATEAS( 1FM) minoritywhite 50 53.56 S7T U.45 245 49.03 7.02 TTW N 1.76 /.2t)33" 1.7842 T7425"
DATEAS(WTPM) minoritywhite 50 53.56 101.3 0.49 245 49.03 75.75 0.35 N 1.55 4.2979* 2.285 2.2243
DATEAS(TPM) malefemale 30 64.85 8.31 0.53 267 50.54 7.59 0.37 N 1.00 0 0.308 0.3021
DATEAS(WTPM) malefemale 30 64.85 84.44 0.55 267 50.54 81.36 0.38 N 1.14 0.2134 0.2155 0.2116
Alt(TPM) minoritywhite 50 107.8 9.78 0.33 245 53.12 7.02 0.39 N 2.04 12.0451** 2.1831 2.1403
Alt (WTPM) minoritywhite 50 107.8 101.32 0.36 245 53.12 75.75 0.38 N 1.82 8.3106** 1.7871 1.7571
AIKTPM) malefemale 30 91.01 8.31 0.46 267 67.05 7.59 0.38 N 1.10 0.1328 0.0037 0.0037
Alt (WTPM) malefemale 30 91.01 84.44 0.47 267 67.05 81.36 0.39 N 1.01 0.0014 0.0482 0.0474
*Â£<.05
** Â£ < .01
Halpin, Simpson, and Martin 1990.
This study examined the predictive equality of another widely used
educational assessment testthe Peabody Picture Vocabulary TestRevised (PPVT
R). Specifically, the authors used MMR to assess racial bias of this test in predicting
an unbiased criterion score on the Wechsler Intelligence Scale for Children
Revised (WISCR). With a small sample of schoolaged children (N = 75), the test
was concluded to be biased because the race X PPVTR interaction term was
significant in a stepdown regression analysiseven though subgroup slopes were
57
not significantly different. That is, the test was biased in predicting WISCR scores
because separate subgroup regression equations had significantly different
intercepts.
Table 5 (below) presents the analysis of these data using the test bias
assessment program. Again, both homogeneity assessment tests indicated that the
error variance assumption was violated. However, contrary to the report of bias,
both Jamess and Alexanders tests indicate no bias (p > .05).
TABLE 5
Analysis of Halpin et al.s Study
hrr Var Harrietts
Scale Subgroups n Sx Sy r Bias? Ratio M James Alexander's
PPVTFt Whites 49 19.3348 13.7933 0.48 y 1 : 2.2459 4.8467 0.1446 0.1419
Blacks 26 13.1788 8.8971 0.42
* p < .05 (James's and Alexander's tests were not significant)
Zeidner, 1987.
The purpose of this study was to assess for age bias in the predictive validity
of the Scholastic Aptitude Test (SAT). The moderating variable, age group, was
derived by artificially polychotomizing the age of 795 Israeli college students into
four groups, and again the performance criterion was firstyear college GPA. Using
MMR, Zeidner concluded that there was evidence of age bias where college
58
performance was underpredicted for the 1821 and 30+ age groups. The program
indicated that error variance was homogenous for these subgroups (both the rule of
thumb and Bartletts test show this), and Jamess and Alexanders statistics confirm
that the slopes differ across subgroups (p < .01). A summary of the analyses using
the new program is presented in Table 6, below.
TABLE 6
Analysis of Zeidners Study
Scale Subgroups n Sx Sy r Bias? Jtrr var Ratio isartierts M James' Alexander's
SA1 Composite 4 age groups Y 1:1.3598 5.5494 13.9476** 13.5138**
1821 288 8.93 10.95 0.30
2225 314 9.8 11.02 0.21 
2629 72 10.49 13.15 0.41
30+ 121 8.42 12.22 0.08
In summary, the Java program was used here to examine error variance
homogeneity and subgroup regression slope differences for 22 employment and
education selection tests presented in professional journals. Both DeShons rule of
thumb and Bartletts test indicated that five tests (in Hattrup & Schmitt, 1990;
Halpin, et al., 1990) were originally evaluated with MMR in conditions of
heterogeneous error variance. However, Bartletts test indicated that 12 other
samples (in Qualls & Ansley, 1995) had heterogeneous error variance when the rule
of thumb indicated otherwise. There were no cases of the rule of thumb violation
accompanied by nonsignificant Bartletts test results. Alexanders and Jamess tests
59
indicated significant test bias for three samples. In one case (i.e., Zeidner, 1987), this
was consistent with the MMR finding, and the error variance appeared to be
homogeneous. In the other two cases (i.e., Qualls & Ansley, 1995), Jamess and
Alexanders tests were in contradiction with the MMR assessment, and only
Bartletts test indicated heterogeneous error variance. Possible explanations for these
findings are discussed next.
60
CHAPTER 5
DISCUSSION
Implications of the Findings
The pervasiveness of homogeneity violation for the studies examined above
was even more prominent than in previous research. Recall that previous researchers
(e.g., Oswald, et al., 1998; DeShon & Alexander, 1994b) found violation of the
homogeneity assumption in about 10% of nearly 500 test bias assessments. Here,
DeShon and Alexanders (1996) rule of thumb indicated that 22.7% (5 of 22) of the
tests analyzed violated the assumption, while Bartletts test indicated that 77.2% (17
of 22) were in violation. The obvious question is, which is correct? Earlier,
Bartletts test was described as the tool of choice in comparison to other statistical
tests of error variance homogeneity. However, DeShon and Alexander (1996)
derived the empirical rule of thumb because Bartletts test is adversely influenced by
conditions of nonnormality. In the studies above, no normality tests or descriptive
statistics (i.e., skewness or kurtosis) for the sample distribution were reported.
Although normality deviation (especially if present in both the dependent and
61
independent variables) could explain the disagreement between the two homogeneity
estimates, nonnormality is not easily assessed. Hopkins and Weeks (1990)
described three general methods of assessing normality: (a) omnibus inferential tests
(e.g., chisquare goodness of fit and the ShapiroWilk test), (b) descriptive measures
and inferential tests of skewness and kurtosis, and (c) examination of graphical
representations of data distribution.
Type I error rates for Bartletts test were found to be inflated to .10 from a
nominal alpha of .05 at a highly nonnormal distribution (i.e., skewness (y,) = .90,
kurtosis (y2) = .27 ) (Hopkins & Weeks, 1990). Similarly, for small samples (nk = 3,
5, and 10) with more skewness and kurtosis (y,=1.0, y2 = 1.2), Type I error rates
were inflated to .072, .079 and .102, respectively (Weston & Hopkins, 1998). These
findings provide a general sense of the importance of normality to Bartletts test.
However, the methods for the omnibus normality tests and calculation of y, and y2
require either raw data or several additional descriptive statistics from the samples
(e.g., z scores, modes, medians). Considering this, and the fact that precise
guidelines for acceptable ranges for these parameters are not currently available,
normality assessment was not included in the Java program developed here.
However, Hopkins and Weeks (1990) also pointed out several techniques for
evaluating nonnormality using common statistical packages (e.g., SAS, SPSS,
BMDP, and SYSTAT). Since Micceri (1989) contended that true normality exists
62
only rarely in most research applications, reliance on Bartletts test alone without
some ancillary assessment of normality is ambiguous to say the least. DeShons rule
of thumb was derived from thousands of simulations where the effects on F tests
(used in MMR) were examined. Thus, it seems that one can be quite sure that error
variances are heterogeneous when both of these methods indicate this (i.e., the ratio
is higher than 1 : 1.5, and Bartletts test is statistically significant), or when the rule
of thumb is not met. Judging from the results here, reliance on Bartletts test alone
would be the most conservative interpretationespecially when we consider the
effect of error variance heterogeneity on slope difference detection in these analyses.
Contradicting results for MMR and the Jamess/Alexanders analysis
occurred in only 2 of the 22 tests analyzed. These two tests were not evaluated in
clear conditions of heterogeneous error varianceonly Bartletts test was significant.
Does this mean that heterogeneity of error variance does not matter? Drawing such a
conclusion on a small sample of tests would be as erroneous as validating a test using
22 subjects. One could also argue that these (only) researchers whom reported
detailed subgroup statistics also had superior designs to others that could not be
analyzed. Nonetheless, one obviously cannot conclude that MMR conducted in
conditions of heterogeneous error variance will always result in incorrect inferences.
MMR and the alternative statistics led to equivalent conclusions (of no test bias) in
the four clearly heterogeneous cases.
63
The more important consideration, however, is the main difference between
MMR and these alternative tests. That is, MMR is a test for both significantly
different slopes and intercepts, where the alternative tests examine only slope
differences. The result of the analysis of Halpin et al. (1987) is a good illustration of
this crucial difference. Recall that MMR analysis indicated the test was biased, but
the alternative statistics were not significant. Figure 5 (below) is an illustration
similar to Halpin, et als of how a test can be biased when slopes are equal. Note that
in this illustration, a common line that represents the prediction equation for both
subgroups is inadequate when the subgroup equations differ dramatically in
intercept. In this case, the same score (X) will always be an underprediction for
males, and an overprediction for females compared to using the (impermissible)
separate subgroup prediction equations. Thus, although MMR can suffer in
heterogeneous error variance and other conditions (e.g., range restriction, sample
size, multicollinearity), it can be more sensitive to test bias when one includes
intercept differences as evidence of bias. Although this is not a currently required
standard, a test that is equal across subgroups in terms of slope and intercept is
defensible at a higher level than those that have equal subgroup slopes alone are.
Therefore, it is important for test researchers and evaluators to understand the
strengths and weaknesses of MMR. This includes knowing that MMR can lead to
incorrect conclusions in the conditions of heterogeneous error variance, but not
64
r some conditions, MMR is the superior method of analysis. When
stical inference regarding the entire population, one can never prove
tiesis (i.e., that there is no test bias)one can only evaluate if the
orts that conclusion. The program developed here is only one of many
i researcher in making these informed judgments.
females
Common line
1 lustration of GenderBased Test Bias Due to Intercept Differences
65
Utility of the Program
The Java program developed here is not intended to be a onestop test bias
assessment tool. Other researchers have examined additional artifacts that affect
MMR analysis and developed computer programs to assess statistical power in
various conditions (e.g., Aguinis, Boik, & Pierce, 1998; Aguinis, Bommer, & Pierce,.
1996; Aguinis & Pierce, in press (b); Aguinis, Pierce, & StoneRomero, 1994).
However, the reason for developing this program was to combine some related
analysis options that are not easily accessed elsewhere to be used in conjunction with
other techniques. Unless test evaluators have SAS and DeShon and Alexanders
program, or the wherewithal to author their own program, this is the only currently
available way to compute Jamess and Alexanders statistics without complex, error
prone hand calculations. Java programming, accompanied with extensive accuracy
checks for all the statistics, enables users to be confident of the mathematical results
they obtain from many common personal computer platforms. Additionally, this
program can very easily be added to a network or intranet in large organizations for
easy accessthat is a strength of Java. In fact, initial trial versions of the program
were uploaded with minimal effort to a worldwide web site and tested from several
computers using different web browsers and operating systems. Anyone with basic
webauthoring skills could do the same. However, no Internet experience is required
to run the standalone version.
66
Recommendations for Future Research
Certainly, this research has not answered all the important questions
regarding the assessment of test bias. In fact, other worthy approaches to such
assessment have been offered. One such alternative approach to traditional statistical
significance testing involves the bootstrap. Instead of relying upon theoretical
assumptions to derive sampling distributions for testing, bootstrapping uses a
technique called resampling to empirically estimate these distributions (Fan &
Jacoby, 1995). Through this process, as Wilcox (1997) pointed out, the homogeneity
of error variance assumption can be negated. In fact, Wilcox has derived the
equations to bootstrap Alexanders test for comparing means and briefly compared
alternative bootstrapping algorithm error rateseach with situational strengths and
weaknesses. Additionally, Fan and Jacoby (1995) wrote an SAS executable program
that uses bootstrap techniques to conduct linear regression that could be also be
replicated for more widespread use and applied to MMR. Although some
researchers have embraced bootstrapping techniques since the late 1970s, the major
statistical software packages (e.g. SAS, SPSS, BMDP) have not yet incorporated
them because of concerns about the underlying theory of the method (Fan & Jacoby,
1995). Nonetheless, further research and simulation studies that compare bootstrap
methods to traditional methods could serve to render this entire thesis moot.
67
If bootstrapping does not solve these problems, there will still be important
areas to address in future research to improve test bias assessment. Concerning the
methods promoted here, one important piece is the assessment of normality and its
effect on Bartletts and the alternative tests. Many researchers may not be
comfortable with a rule of thumb in making important inferences that can potentially
affect many people. Others may be uneasy about the conflicting results due to this
ambiguity. Clearly, Monte Carlo analysis could provide some useful information for
interpreting conflicting results (by providing additional empirical guidelines), such as
the effect nonnormality has on Bartletts test versus DeShon and Alexanders (1996)
rule of thumb.
Finally, the issues raised in this thesis should advance concern regarding the
documentation of test bias studies. Failing to report assessment of the homogeneity
of error variance (or at least the descriptive statistics that are necessary to do so) may
imply incomplete consideration of some important factors. This is not to say that
journal editors are now ethically (and certainly not legally) obligated to include
lengthy additional tables of data in every test bias study, but that researchers should
demonstrate awareness of this limitation of MMR analysis, and report what was done
to address that limitation. Such a proactive approach to test bias evaluation would
demonstrate professionalism and forward thinking. This is what most do with other
68
important limitations, and now researchers can simply say they used this Java
program to assess the assumption, and report the results.
69
APPENDIX A:
TABLES
70
TABLE 2
Summary of Studies using MMR to assess test bias in the Journal of Applied Psychology. Personnel Psychology,
and Educational and Psychological Measurement (January 1987 to March 1998).
# Author(s) yr Jnl1* Criterion Predictors Moderator0 Significant?*
1 Weekley & Jones 97 PP performance video test(empirical) gender(m=3B3, f=29) n
performance video test(empirical) race (AA=223, W=322) n
performance video test(rational) gender(m=383, f=29) n
performance video test(rational) race (AA=223, W=322) n
2 Harville 96 EPM performance (radio operator) AFQT gender(m=79, f=48) n
performance(radio operator) AFQT race (AA=35, W=81) n
performance(personnel) AFQT gender(m=99, f=80) n
performance(personnel) AFQT race (AA=53, W=106) y
performance(aircrew) AFQT gender(m=141, f=31) n
performance(aircrew) AFQT race (AA=40, W=112) n
3 Pulakos & Schmitt 95 PP performance interview gender(m=335, f=129) n
performance interview race(AA=100, H=97, W=259) n
4 Qualls & Ansley 95 EPM College GPA (1st year) ITED (grade 9)** gender(rri401, f=409) n
ITED (grade 10) gender(m=410, f=389) n
ITED (grade 11) gender(m=461, f=454) n
ITED (grade 12) gender(m=317, f=312) n
ITBS (grade 4) gender(m=415, f=406) n
ITBS (grade 6) gender(m=427, f=421) n
ITBS (grade 8) , gender(m=448, f=443) n
ACT gender(m=539, f=526) n
a JAP = Journal of Applied Psychology; PP = Personnel Psychology: EPM = Educalional and Psychological Measurement
bAA = African American; H = Hispanic; PR = Puerto Rican; ASA = Aslan; W = White; m = male; f = female
* y = e < .05
TABLE 2
Summary of Studies using MMR to assess test bias in the Journal of Applied Psychology. Personnel Psychology,
and Educational and Psychological Measurement (January 1987 to March 1998).
# Author(s) yr Jill* Criterion Predictors Moderator0 Significant?*
5 Young 94 EPM College GPA (1st year) SAT (verbal) gender(m=1659, f=2044) ethnicity (ASA=186, AA=211, y
SAT (verbal) H=70, PR=70, W=3166) y
SAT (math) gender(m=1659, f=2044) ethnicity (ASA=186, AA=211, y
SAT (math) H=70, PR=70, W=3166) y
High School Class Rank gender(m=1659, f=2044) ethnicity (ASA=186, AA=211, y
High School Class Rank H=70, PR=70, W=3166) n
6 Kirchner 93 EPM College GPA (1st year) GRE gender () n
7 Sackett et al. 91 JAP performance many race (n's not reported) n
performance many gender (m=223, f=160) ratee race, rater race n
8 Waldman & Avolie 91 JAP performance many (n's not reported) n
9 Halpin et al. 90 EPM WISCR PPVT race (AA=26, W=49) n
10 Hattrup & Schmitt 90 PP Weigted TPM 2 aptitude measures gender(m=267, f=30) n
unweighted TPM 2 aptitude measures gender(m=267, f=30) n
Weigted TPM 2 aptitude measures race (AA=50, W=245) n
unweighted TPM 2 aptitude measures race (AA=50, W=245) n
Weigted TPM alternate test battery > gender(m=267, f=30) n
unweighted TPM alternate test battery gender(m=267, f=30) n
Weigted TPM alternate test battery race (AA=50, W=245) n
unweighted TPM alternate test battery race (AA=50, W=245) n
JAP = Journal of Applied Psychology; PP = Personnel Psychology: EPM = Educational and Psychological Measurement
bAA = African American; H = Hispanic; PR = Puerto Rican; ASA = Aslan; W = White; m = male; f = female
* y = p < .05
TABLE 2
Summary of Studies using MMR to assess test bias in the Journal of Applied Psychology. Personnel Psychology,
and Educational and Psychological Measurement (January 1987 to March 1998).
# Author(s) yr Jnl" Criterion Predictors Moderator0 Significant?*
11 Campion etal. 88 PP performance appraisal 4 tests gender(m=119, f=30) n
performance appraisal 4 tests race(AA=56, W=93) n
12 Latack et al. 87 JAP performance many info mgt strategy, Job gender(m=338, f=41) y
13 Stone & Stone 87 JAP success responsibility info mgt strategy, Job race (n's not reported) y
qualificaiton responsibility race (n's not reported) age group (1821=288, 22 25 = 314, 2629=72, y
14 Zeidner 87 EPM GPA composite aptitude scale 30+=121) ethnicity (ASA=20, AA=34, y
15 Crane et al. 87 EPM SNGPA High School GPA H=67, W=272) ethnicity (ASA=20, AA=34, y
Prerequisite GPA H=67, W=272) ethnicity (ASA=20, AA=34, n
CA achievement tests H=67, W=272) ethnicity (ASA=20, AA=34, y
NCLEXRN High School GPA H=67, W=272) ethnicity (ASA=20, AA=34, y
Prerequisite GPA i H=67, W=272) ethnicity (ASA=20, AA=34, n
CA achievement tests H=67, W=272) y
' JAP = Journal of Applied Psychology; PP = Personnel Psychology: EPM = Educational and Psychological Measurement
bAA = African American; H = Hispanic; PR = Puerto Rican; ASA = Asian; W = White; m = male; f = female
* y = e < 05
APPENDIX B:
CALCULATIONS FOR BARTLETTS
TEST
74
As presented in Gartside (1972), Bartletts Statistic, M, is approximately
distributed as chisquare with k 1 degrees of freedom, when the (nk 1) > 3. Given
the following notation,
k = number of subgroups
nj. = number of observations in each subgroup
r = subgroup variance on the criterion
v degrees of freedom from which s2 is based
Bartletts statistic is computed as:
M =
(X,v/) log, (ZviSf / X,v/) X,v/ loS*s?
1
for unconditional subgroup
variances; substituting a2e(i> (estimated as described in Chapter 3, Check for
Heterogeneity of Error Variance, Computations)for s2 yields:
M =
(X,v/) lQgg (X, v>ge(/) 1X,v/) ~X, v, lQge a
1 H (V 1 / V; 1 / y] V, )
3(A:l)VZj' ' ^
2
e()
75
APPENDIX C:
THE JAMESS SECONDORDER
APPROXIMATION: CALCULATION OF
THE J STATISTIC
76
To test for different subgroup slopes, U, (also referred to as J in some
references) is first calculated according to the equation (c.f., Alexander & Govern,
1994; DeShon & Alexander, 1996; DeShon & Alexander, 1994a):
i=i
+ \ 2
O2
^b,
; which is calculated from the following steps:
0 r,2)S
(,. 2)S2Xl
1. Determine the squared standard error (), where S* =
2. Define a weight for each regression weight (i.e. bj) such that ^ w;. = 1:
W; =
PI
ipi
3. The varianceweighted estimate of the common regression slope (b+) then
k
becomes: b+ = ^
i=i
4. Compute U as indicated above.
Next, the adjusted critical value (c) for a chisquare distribution with k 1
degrees of freedom and confidence level a is determined as follows:
1. let V; = ii; 2,
2.
/=!
W:
(note that values for s and t are only those required in the
calculation of h(a) below),
77
c
3.
S s
X2 s ~ s >
n^+253)
q=l
where
0=1
denotes the product of each term from
1 to s (note that s represents the multiplier to provide the values of %2%s,
(i.e., for %2,%4 ,X6,XS, s is 1,2,3,4, respectively))
4 T = Y
(1
1=1
V;
5. h(a), adjusted for infinite degrees of freedom, is calculated by:
78
h{a) = c + y2{3yu+%2)T +
(XeX3^ + x2 )21 (* 3) T2 + (X)(3x4 + X2) *
(87?23 10R^+ 47?21 6R22 + 87?,27? 47?,2,) +
(2J?23 47?22 + 27?2i 27?,2 + 47?127?,, 27?,,) X (x2 1) +
(y^)(7?,2 + 47?127?,, 27?,2 7210 47?,, +47?,,7?,0 7?,0)x
_(3x42x2l)
(7?23 37?22 + 37?21 7?20)(5x6 + 2x4 + X2 ) +
(Xfi)!"^12 ~ ^23 +67?22 47?2i +7?20)x
(35x8+15x6+9X4 +5x2) +
(y^g)(27?22 +47?2, 7?2o + 27?127?10 47?,j7?,0 + 7?,0)x
(9x8 3x6 5x4 X2) +
(X)(_ ^22 + "^11 X^X8 + 3X6 + X4 + X2 ) +
X)("^23 ^12 ^1 i)(^X8 +9x6 +^X4 +JX2)
The null hypothesis (i.e., H0: P, =...= p*) is rejected when U > h(a).
79
APPENDIX D:
ALEXANDER ET AL.S NORMALIZEDT
APPROXIMATION: CALCULATION OF
THE A STATISTIC
80
Calculation of the A statistic is similar to U (or J) described in Appendix C.
To test for differences in slopes of an interaction coefficient, the A statistic is
calculated as follows and referenced to the chisquare distribution with (k 1)
degrees of freedom:
k
A = ^zf ; where the following calculation steps are required:
1=1
1. Determine the squared standard error (S2.), define a weight for each
regression weight (i.e. bj), and determine the varianceweighted estimate of the
common regression slope (b+) as in steps 13 in Appendix C.
2. Define a onesample t statistic for each subgroup where:
3. Square each t statistic and transform it by calculating Zj, where:
(c3 4 + 3c) (4c7 + 33c5 + 240c3 + 855c)
z. = c +1:, and
' b (1062 + %bc +10006
a = vj5; b = 48a2; c = ^J[aln(l + t2/v,.)], where v/= Â£ 2.
4. Square these Zjs and sum to determine A (as initially shown), and
reference to the chisquare distribution with k1 degrees of freedom.
5. The null hypothesis (i.e., H0: P, =...= PA) is rejected when A > h(a).
81
APPENDIX E:
PROGRAM PACKAGE DESCRIPTION
AND CODE
82
This appendix outlines the necessary files for the two program versions, and
presents sample output for the standalone version. The files listed for each version
are included on a diskette for distribution, or can be downloaded from the internet at:
http://members.aol.com/imsap/fitp.htmL Following this outline is the Java code that
is compiled to execute the standalone version. The browser version is identical
except for the code used to prompt the user and save the output as a textfile.
Browser Version
The program files included in the browser version are listed in Table 7,
below. For distribution, all files except the text files are contained"in a self
extracting zip file named, SetupTestBias.exe, and require 253 KB of storage space on
a floppy diskette.
TABLE 7
Browser Package Contents
File Name(s) Size Description
testbias.html 1 KB The file that contains the command to execute the JAVA class files that contain the program.
api.html 3KB The main web browser file to navigate to other program documentation pages.
contact.html 2KB Supporting browser page that provides my email and postal address.
Bartlett.html 7KB Supporting browser page that explains Bartletts Statistic and purpose.
Homo.html 6KB Supporting browser page that explains the homogeneity of error variance assumption.
Inputhtml 4KB Supporting browser page that describes the input required from the program user.
83
TABLE 7 (continued)
Browser Program Package Files
File Name(s) Size Description
James.html 12KB Supporting browser page that explains Jamess and Alexanders statistics and purpose.
MMR.html 6KB Supporting browser page that reviews multiple moderated regression.
refs.html 15KB Supporting browser page that lists references used for the program development.
system.html 4KB Supporting browser page that outlines hardware and software requirements to execute the program (as well as web sites to obtains some programs).
DialogDataxlass. 1KB Compiled JAVA code that retrieves user input for use in the main program (DiaIogWindow2.class).
DialogLayout.class 5KB Compiled JAVA code that defines text and graphic positioning in the first input window.
DialogWindow2.class 15KB The main compiled JAVA program code that contains all calls to other classes, statistical calculations, and output formulation.
DialogWindow2Frame. Class 1KB Compiled JAVA code to create a frame to contain the main dialog of the program.
OurFrame.class 2KB Compiled JAVA code that defines variables and graphics for the first input frame.
Subgroup 1 Frame.class 3KB Compiled JAVA code to create a frame to contain the subgroup input dialog.
Subgroupdata.class 1KB Compiled JAVA code to retrieve userinput subgroup data.
Subgroupdata2.class 1KB Compiled JAVA code to convert userinput text to numbers that can be used for calculations.
Sugroupl .class 3KB Compiled JAVA code that defines variables and graphics for the subgroup input frames.
84
TABLE 7 (continued)
Browser Program Package Files
File Name(s) Size Description
4 help*.class files 6KB Compiled JAVA code to execute help windows associated with input screens.
187 files with .gif extension 306KB Graphic files that support the .html pages above containing nontext items (e.g. equations, Greek symbols).
Install.txt 1KB A textfile to instruct the user how to execute the program after installation.
msg.txt 234B A message display when the user begins file extraction.
Readme.txt 633B Brief instructions for installing the program from the selfextracting zip file (TestBias.exe).
SetupTestBias.exe 251KB The selfextracting zip file containing all of the nontext files listed above.
85
StandAlone Version
Figure 6. Sample Output
The data you input were...
The data you input were...
Subgroup n sx sy rxy
1 5555 5.0 5.0 0.5
2 333 3.0 9.0 0.3
Homogeneity of Subgroup Error Variance Information
DeShon & Alexander's rule of thumb: is NOT met. (The
Error Variance ratio is 1:3.93)
Bartlett's Test indicates heterogeneous error variance (M
= 446.116, p < .00001).
Since both statistics indicate heterogeneous error
variance, a necessary assumption of MMR is NOT met, and
the alternative statistics below are more accurate
indicators of test bias.
Alternative Statistics
James's Test INDICATES TEST BIAS! (p < 0.05) U = 6.4313,
and U(critical) = 3.8688
Alexander's Test INDICATES TEST BIAS! (A = 6.3607, p =
0.0117)
Thus, both statistics indicate that the regression slopes
for the subgroups are different.
Program Flow and Java Code
Figure 7 (below) below includes the overall sequence of the standalone
program. The option for saving output as a textfile is not included in the browser
version.
86
Figure 7 Program Flowchart
Save textfile
Prompt to exit
87
The following is the noncompiled text code (a file named
DialogWindow2.java, the compiled code is named DialogWindow2.class). Note that
lines beginning with a // indicate a nonexecutable comment to provide brief
descriptions.
// Standard Java packages
import java.applet. *;
import java.awt.*;
import java.io.*;
import DialogWindow2Frame;
import AutoDialog;
import subgrp;
import help 1;
import help2;
public class DialogWindow2 extends Applet
{
// STANDALONE APPLICATION SUPPORT:
// mJStandAlone will be set to true if applet is run standalone
//
private Button m_button = new Button("Continue");
private boolean m_fStandAlone = false;
private OurFrame m_frame = null;
private Subgroup 1 Frame m_frame2 = null;
private int m_k = 0;
private double m_alpha = .05;
III define arrays (size will = m_k) for calculations
private double[] m_narray;
private double[] m_sxarray;
private double[] m_syarray;
private double [] m_rxyarray;
private double[] m_barray;
private double[] m_vanray;
private double[] m_sysqrarray;
private double [] m_sxsqrarray;
private double [] m_rxysqrarray;
private double [] m_errvar;
88
private double[] m stderrsqr;
private doublef] rn_invstderrsqr;
private doublet] m_weight;
private double]] [] Rst;
private double]] m_95chi;
private doublet] m_99chi;
private doublet] m_chi;// array for chisub2s
private double]] groupairay;
private doublet] ratios;
private String thumb;
private String bartlett;
private String james;
private String meaning;
private String homo;
private String homo2;
double dEVRatio;
private doublet] m_z;
private double]] m_t;
private String alex;
private String Amean;
private doublet] m stderr;
int frameshow = 0;
int homoflag = 0;
int altflag = 0;
public static void main(String args])
{
DialogWindow2Frame frame = new DialogWindow2Frame("Test Bias
Assessment Program");
// Must show Frame before we size it so insetsO will return valid values
//
frame.showO;
frame.hide();
frame.resize(frame.insets().left + frame.insets().right + 550,
frame.insetsQ.top + frame.insetsObottom + 280);
DialogWindow2 applet_DialogWindow2 = new DialogWindow2Q;
frame.add("Center", applet_DialogWindow2);
89
applet_DialogWindow2.m_fStand Alone = true;
applet_DialogWindow2.initO;
frame.showQ;
}
public DialogWindow2()
{
}
public String getAppletInfo()
{
return "Name: DialogWindow2\r\n" +
"Author: Scott A. Petersen\r\n" +
"Created with Microsoft Visual J++ Version 1.1";
public void init()
{
resize(550, 380);
add(m_button);
//create inputl frame
if (m_k < 1)
{
m_frame=new OurFrame(this);
m_frame.hideO;
}
}
public void destroyO
{
}
int group = 0;
int groupread = 0;
double gotdata = 0;
public void paint(Graphics g)
{
if (gotdata = 100)
{
90
