Comparing student scores on two versions of a practical laboratory examination in medical technology

Material Information

Comparing student scores on two versions of a practical laboratory examination in medical technology
Koneman, Philip A
Publication Date:
Physical Description:
xii, 151 leaves : illustrations ; 29 cm


Subjects / Keywords:
Medicine -- Computer-assisted instruction ( lcsh )
Medicine -- Examinations ( lcsh )
Medicine ( fast )
Medicine -- Computer-assisted instruction ( fast )
Examinations. ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )
Examinations ( fast )


Includes bibliographical references (leaves 145-150).
General Note:
Submitted in partial fulfillment of the requirements for the degree, Doctor of Philosophy, Educational Leadership and Innovation.
General Note:
School of Education and Human Development
Statement of Responsibility:
by Philip A. Koneman.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
37296518 ( OCLC )
LD1190.E3 1996d .K66 ( lcc )

Full Text
Philip A. Koneman
B.A., University of Colorado at Boulder, 1982
M.A., Denver Seminary, 1990
A thesis submitted to the
Faculty of the Graduate School of the University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Educational Leadership and Innovation

1996 by Philip A. Koneman
All rights reserved.

This thesis for the Doctor of Philosophy
degree by
Philip A. Koneman
has been approved
Brent G. Wilson
Laura D Goodwin
Richard Marlar
/2/5/4 6

Koneman, Philip Alan (Ph.D., Educational Leadership and Innovation)
Comparing Student Scores On Two Versions Of A Practical Laboratory Examination
In Medical Technology
Thesis directed by Associate Professor Brent G. Wilson
Recent advances in computer technology have brought about new possibilities for
computer-based assessment. As a result, traditional conceptions of test reliability and
validity have been challenged. Educators who wish to develop computer based
assessments need a strategy for estimating examination reliability and validity, and a
method for determining how student scores will compare to those resulting from more
traditional forms of assessment.
This study compares scores from medical technology students on two versions of a
practical examination. One examination is a traditional wet bench exam conducted
in the laboratory. The other is a computer-based version or the same examination that
uses high-resolution color photographs and multi-part test items. After taking both
exams, students completed a brief survey assessing their concerns about the
computer-administered version.
Students were randomly assigned to one of two groups. A 2 x 2 x 2 repeated measures
split-plot design was used to compare differences among group means. To counter a
possible carry-over effect, order of examination was introduced as a between-subjects
factor. Due to indications that gender bias may occur in computer-based tests, gender
was also introduced as a second between-subjects factor. These additional factors also
increase the statistical power of the procedure, which is jeopardized due to a small
number of subjects (17).
The analysis of variance procedure tested seven hypotheses. Statistically significant
between-subjects and within-subjects main effects were reported at the 0.05 and 0.01
level (alpha = 0.05), leading to a rejection of three of the hypotheses.

The item analysis and survey data were used to explain the within-groups and
between-groups differences. In this study, environmental, interface design, and image
fidelity issues all contributed to the mean differences. Recommendations are made for
improving the measure, and suggestions for further research are included.
This abstract accurately represents the content of the candidates dissertation. I
recommend its publication.
Brent G. Wilson

1. INTRODUCTION.......................................................1
Statement of the Problem........................................4
Theoretical Perspective.........................................5
Significance of the Study.......................................6
Limitations of the Study........................................8
2. LITERATURE REVIEW..................................................9
Classical Test Theory...........................................9
The True Score Model.......................................11
Kinds of Reliability........................................14
Estimating Internal Reliability.............................14
Changing Conceptions of Validity............................18

Emerging Technologies: New Directions in Learning and
Four Generations of Computerized Educational
New Approaches to Learning and Assessment: Knowledge
Construction and Alternative Assessments....................24
The Impact of New Approaches to Learning Upon
Performance-Based Assessment and Medical Education.......26
Studies About Computer-based Testing in Medicine.........28
3. METHODOLOGY....................................................34
Program Requirements.....................................35
The Clinical Microbiology Course.........................36
The Content Validity Estimation..........................37
Conducting the Content Validation Study..................40

The Practical Examination.................................41
Research Design..............................................45
Administering the Examinations............................51
Administering the Bench Exam..............................52
Administering the Computer Exam...........................52
Administering the Student Survey..........................54
Scoring the examination...................................54
4. RESULTS..........................................................57
Estimating the Content Validity of the Measure...............57
Student Scores...............................................59
Comparing Mean Scores: Analysis of Variance...............61
Estimating Examination Reliability: Conducting an Item
Designing the Item Analysis Models........................68

Interpreting the Item Analysis Data,
Post-Examination Survey Results..............................74
Environmental Issues.....................................75
Interface Issues.........................................76
Fidelity Issues..........................................76
5. DISCUSSION......................................................78
The Content Validation Study.................................78
Interpreting the Results of the Repeated Measures Analysis of
Variance: Assumptions, Violations, and Explanations..........82
Assumptions and Violation of Assumptions for the
ANOVA Model..............................................83
Explaining Between-groups and Within-groups Variability..85
Between-groups Variance as Random Error..................86
Between-groups Variance As Measurement Error.............86
Explaining Within Group Variability......................87
Within-groups Variability: Qualitative Considerations....90

Examination Equivalence........................91
Recommendations for Developing Computer-Administered
Laboratory Practical Examinations....................93
Developing Computer-based Tests: Environmental Issues.93
Developing Computer-based Tests: Interface Issues.95
Developing Computer-based Tests: Fidelity Issues..95
Implications: Recommendations for Further Research...96
A. BACTERIOLOGY LECTURE OBJECTIVES.......................101
AND RATING WORKSHEET...................................105
E. COMPUTER EXAMINATION..................................114

H. CONTENT VALIDITY DATA........................127
I. CONTENT VALIDITY SUMMARY.....................128
L. SURVEY RESULTS...............................138
BENCH AND COMPUTER SCORES.....................140

I dedicate this thesis to my wife Tanya for her unfaltering support of this and my other
professional endeavors.

My thanks to Dr. Wilson for his oversight of this thesis, and to Dr. Goodwin, for her
assistance in developing the research design used here. I also wish to acknowledge
Christie Grueser and Jill Hartman of the University of Colorado Health Sciences
Center School of Medical Technology for their assistance in conducting this study.

In 1984 the Panel on the General Professional Education of the American
Medical Colleges made critical recommendations for educational change, including
limiting the amount of factual information students must memorize, reducing passive
learning in favor of active, independent learning and problem solving, and applying
computer technology and information science to medical education (Association of
American Medical Colleges, 1984, Piemme, 1988). Although computers have been
used in medical education since the 1960s, their use in medical technology training
programs for knowledge acquisition and assessment has been limited. The availability
of microcomputers with multimedia capabilities is changing how computers are used
for learning and assessment, and numerous advantages of computer-administered tests
have been noted. Test administration by computer has become an affordable
alternative to pencil-and-paper administration, and with the widespread use of
computers in schools, the technology to administer tests by computer is no longer the
domain of the large test publisher (Linn, 1989). Computers are also viewed as
valuable tools for exploring new forms of assessment, such as alternative
assessments, performance-based assessments, and authentic assessment (Reeves &
Okey, 1996). Finally, computers are recognized as tools that open up new

opportunities for assessment that are more consistent with learning activities that
require active participation and the application of knowledge, such as computer-based
simulations. Ultimately, the use of technology in assessment may bring about
educational reform (Sheingold and Frederiksen, 1994).
There are different types of computer assessments, ranging from the mere
administration and scoring of multiple-choice tests by computer to complex and
sophisticated intelligent measurement knowledge-based systems (Bunderson, Inouye,
and Olsen, 1989). Much of the current interest in computers for learning and
assessment is due to recent technological developments associated with
microcomputers. Advances in digital imaging and CBT authoring systems may
encourage the diffusion of computer-based testing on a wider scale, with individual
departments or faculty members developing their own learning and assessment tools.
Kodak Photo CD technology provides a cost-effective means of converting 35mm
transparency images into digital form, which is significant within medical education
for two reasons. First, many medical educators have collections of 35mm
transparencies that number in the thousands. Second, a variety of microcomputer
applications on multiple computer platforms can import Photo CD images for
immediate computer display. Multimedia authoring tools such as Asymetrix
Toolbook, Allegiant SuperCard, and Microsoft Visual Basic have the capability of
combining text, images, graphics, sound, and video clips in non-linear, user-
accessible formats. The interest of such capabilities for training and assessment is

evidenced by the number of commercial products developed recently that support
learning in medical technology. The Anaerobe Educator, The Gram Stain Tutor,
GermWare, and Microbes in Motion are four computer-based learning systems that
combine text, graphics, digitized images, and other multimedia resources into an
integrated, CD-ROM based learning system. Each of these programs also features a
self-assessment feature for student review. The software tools used to create these
programs offer individual faculty members and departments new options for
designing, developing, testing, and implementing new forms of assessment.
The American Society of Clinical Pathologists (ASCP) has recognized the
value of storing digitized medical images on the computer for purposes of assessment.
Since 1994, the ASCP Board of Registry examinations have been administered solely
by computer (Castleberry and Snyder, 1995). Digitized images are displayed on the
computer screen directly above the text of each question. Previously, the examination
was administered in pencil-and-paper format, accompanied by multiple pages of color
plate images. Student scores are not significantly lower on either form of the test,
although there is some concern among examinees concerning the quality of the
digitized images (Examination Committee of the Board of Registry, 1995). The fact
that this and other licensing examinations are now administered by computer
demonstrates an interest in utilizing computer technology for streamlining the testing

Computers have the potential for revolutionizing assessment. Changes in the
power and distribution of computing resources have wrought irreversible changes in
educational assessment (Bunderson, et al., 1989). With this power, however, comes
great responsibility. Classical test theory provides a theoretical basis for assessment
that has a primary goal the development of sound educational measures. Regardless of
the type of assessment that is developed and administered by computer, test
developers are responsible for adhering to the standards developed by the American
Educational Research Association, the American Psychological Association, and the
National Council of Measurement in Education (1985) concerning test validity and
reliability. According to Standard 1.1, Evidence of validity should be presented for
the major types of inferences for which the use of a test is recommended. A rationale
should be provided to support the particular mix of evidence presented for the
intended uses (p. 13). As we shall see in Chapter 2, a test cannot be considered valid
if it is not reliable. This standard of reliability applies to all measures: whether
standardized examinations or teacher-constructed tests. Unfortunately, the lack of
research studies reporting how teacher-constructed instruments are deemed reliable
and valid suggests that educators do not consistently apply the standards
recommended by the organizations who have defined the standard.
Statement of the Problem
I have briefly mentioned advances in microcomputer technology that provide
new opportunities for individual faculty members and departments to design, develop,

and implement computerized assessments. Whether computer-administered
assessments are created de novo as complex problem-solving or simulation scenarios,
or adaptations of current pencil-and-paper measures, test developers are responsible
for providing sufficient evidence for the instruments reliability and validity. I suspect
that in institutions of higher education, many, if not most teacher-constructed tests
have not undergone a formal process of validation. As we shall see in Chapter 2,
estimating the validity of a measure is a detailed and time-consuming process, and
many educators may not be aware of the psychometric properties of validity and
reliability and how they are estimated. Therefore, classical test theory continues to
inform new directions in computer-based assessment.
The faculty in the School of Medical Technology at the University of
Colorado Health Sciences Center desire to refine the current procedures for
conducting laboratory practical examinations. Administering the examinations in their
current form is a labor-intensive and time-consuming process. Given the recent
advances of microcomputer technologies, a computer-based approach to testing seems
feasible, so long as any newly-developed measures are reliable and valid, and the
scores reported on new tests are equivalent to those on current measures.
Theoretical Perspective
There are two divergent streams of activity that involve educational
assessment. In the theoretical realm of educational measurement, procedures for

estimating the validity and reliability of an objective measure originate from, and
have traditionally been based upon, classical test theory. In the applied realm of
educational measurement, alternative forms of assessment are being emphasized. The
traditional understanding of validity and reliability is being challenged by the
increased use portfolios, simulations, and other alternative forms of assessment,
leading to divergent perspectives. New approaches to learning that focus upon
knowledge construction as opposed to knowledge transfer are making alternative
forms of assessment popular alternatives. As alternative forms of assessment
challenge the use of measures grounded in classical test theory, the traditional
definitions of reliability and validity are being reformulated. To date, however, no
new standards for assessing the strength of a measure have been developed by a
recognized and authoritative source. Given the lack of new standards to replace the
recommendations set forth in 1985, classical test theory still provides clear and simple
methods for assessing the most important psychometric properties of tests. Concepts
such as internal-consistency reliability and content, criterion-referenced, and construct
validity continue to provide a framework for estimating the usefulness of an
educational measure.
Significance of the Study
There are numerous studies and recommendations in the literature about
estimating the validity of a variety of measures, including tests used in science

education (Yarroch, 1991), licensing examinations (Smith and Hambleton, 1990),
computer literacy measures (Clements and Carifio, 1995), and student evaluation of
classroom teaching performance (Tagomori and Bishop, 1994). There are also a few
studies comparing pencil-and-paper and computer administrations of a measure
(Cates, 1993, King, Wesley, and Miles, 1995, Vansickle and Kapes, 1993). What is
lacking in the literature, however, is a comparison of teacher-constructed pencil-and-
paper versus computer-based assessments for medical technology training.
The study reported here was designed to assess the feasibility of administering
medical technology laboratory examinations by computer. It compares two modes for
administering a laboratory practical examination: a wet bench examination, versus a
computer-based version of the same material. The examination is currently used at the
Medical Technology school of the University of Colorado Health Sciences Center.
The results of this study will be of interest to other faculty members or departments
that wish to explore the feasibility of administering practical laboratory examinations
by computer.
My study involved four main steps for estimating the validity, reliability, and
equivalence of the bench and computer examinations. First, the content validity of the
instrument was estimated. Second, the computer version of the examination was
developed, students in the Medical Technology program completed both
administrations of the examination, and group means were analyzed using a repeated
measures analysis of variance. Third, an item analysis was conducted for both modes

of administration. Finally, students completed a brief questionnaire about each
measure and participated in a structured group interview to discuss the advantages
and deficiencies of the computer-based version of the examination.
Limitations of the Study
This study has two primary limitations. First, the examination instrument I
used here had been developed previously and is currently being used in the
microbiology curriculum. Therefore, it was not developed specifically for this study.
Ideally, the content validation process should be applied to a measure in the
development stage so that a pilot study of the measure can be conducted before the
measure is used. Prior to this study, a pilot was not conducted.
Second, the number of subjects in this study is low for conducting a repeated
measures analysis of variance factorial design (17 students). Two between-subjects
factors were introduced into the design. The order factor, which refers to the order in
which students completed the two administrations of the examination, was necessary
to minimize the potential for a practice or carry-over effect. Gender is also of interest,
since it has been suggested that computer software is gender biased (Mangione, 1995;
Beer, 1994). These factors increase the statistical power of the analysis of variance
procedures, and thereby counter the effects of the small number of students
participating in the study.

The purpose of this chapter is to review the literature about the use of
computers in education for learning and assessment. Given the responsibility of test
developers to provide evidence to support the strength of a measure, I begin by
reviewing basic psychometric concepts from classical test theory, noting recent
challenges to the traditional definitions of reliability and validity. I then discuss new
directions in assessment, and how recent advances in technology provide new
opportunities for assessment in medical technology education. I conclude this chapter
by explaining what contribution this study will make to the existing literature.
Classical Test Theory
Classical test theory is the name given to the body of knowledge within the
disciplines of psychology and education that provides a theoretical foundation for the
development of aptitude, achievement, personality, and interest measures (Crocker
and Algina, 1986). The basic psychometric concepts of classical test theory are
attributed to E.L. Thorndike, who published the first text book on test theory in 1904.
Classical test theory is concerned with measuring individual behavior, observed as
psychological attributes, which are also termed psychological traits or psychological
constructs. An important quality of psychological constructs is that they are not

directly observable. Measurement is the process by which things are differentiated
(Hopkins, Stanley, and Hopkins, 1990, p. 1). The measurement of psychological
constructs occurs when a quantitative value is assigned to a behavioral sample using a
test instrument.
Crocker and Algina (1986, pp. 5-7) note that the process of developing test
instruments to measure psychological constructs is confounded by five problems.
First, no single approach to the measurement of psychological constructs is
universally accepted. Since the measurement of psychological constructs is never
direct, different types of behavior can be used to assess the degree to which a
construct is demonstrated operationally. Second, the measurement of psychological
traits is based upon a limited sample of behavior. Test instruments can include only a
sample of all possible questions related to a construct, and designing tests that contain
the correct number of items assessing a variety of content within a domain is a
difficult procedure. Third, obtained measurement is always subject to error, and there
are a number of reasons why a measure will fail to provide evidence for a construct.
Measurement error may be attributed to the measure itself if it lacks the important
psychometric properties of reliability and validity, or to the examinee, in the case of
fatigue, boredom, guessing, or careless marking. Fourth, the units on the measurement
scale are not well-defined. If an examinee misses all the items on a test, can I
legitimately infer that he or she has zero mastery of the skills measured? If I correctly
answer five questions and you correctly answer 10, do you have twice the mastery of

a domain that I do? These questions point to two significant issues in test theory:
measurement error, and the interpretation of test scores. Finally, psychological
constructs are not defined solely in terms of operational definitions, but must also
possess relationship to other constructs or observable phenomena. A construct must
be defined in terms of observable behavior, and in terms of its logical or mathematical
relationship to other concepts within a theoretical system.
Classical test theory addresses these issues in two ways. First, a sound test
instrument possesses the psychometric properties of validity and reliability. Second,
no measure can perfectly identify a psychological construct, and thus the observed
score obtained from a measure contains a degree of error. Therefore, an examinees
observed score is understood according the true score model. The true score model
underlies the concept of reliability, which is a necessary but not sufficient condition
for validity. Since the true score model provides the theoretical underpinnings for
reliability and validity, it needs to be explained.
The True Score Model
The true score model is derived from the work of the British psychologist
Charles Spearman. According to Spearmans model, any observed test score is a
composite of two hypothetical components; a true score and a random error
component (Crocker & Algina, 1986, pp. 106-107). This model is represented as
Xpf=tf4-Efp, where X represents the observed score of examinee p on f form of a test, t
represents the examinees true score, and E represents a random error component

(Feldt & Brennan, 1988, p. 108). Theoretically, as the number of parallel
administrations of the test increases, the average of the resultant errors approaches
zero. The true score is therefore the average score an individual would receive on an
infinite number of test administrations using parallel forms.
How accurately a test measures what it was designed to measure is termed
reliability1 (Thorndike & Hagen, 1977, p. 73). The reliability of a measure is
estimated by correlating an examinees scores on multiple parallel administrations of
a measure (theoretically). Since most educational tests measure constructs that are
indirectly observable, some degree of error will be introduced. Only the observed
score is known, although the teacher or researcher would like to know the examinees
true score. In a population, the ratio of the standard deviation of true scores to the
standard deviation of observed scores is termed the reliability index. This ratio is
expressed as:
Pxt - <*r
where T is the refers to true scores, and X is observed scores. This ratio expresses the
correlation between true scores and all possible observed scores from many repeated
testing administrations, a condition that is only theoretically possible. If a group of
examinees were tested on two occasions with the same test or parallel forms of the

test, it would be possible to establish a relationship between pXT, the correlation
between true and observed scores, and pxx', the correlation between observed scores
on two parallel tests (Crocker & Algina, 1986, p. 115). Two tests are defined as
parallel if each examinee has the same true score on both forms of the test, and the
error variances for both forms of the test are equal. The reliability coefficient is
defined as the ratio of true score variance to observed score variance, which is equal
to the square of the reliability index. The reliability coefficient is expressed as
pX,X2 (7 T
_____________________ (2.2)
where Xi and X2 are the two versions of the test, c2Tis the true score variance, and
g2x is the observed score variance.
The maximum value the reliability coefficient can assume is one. It is
interpreted as the proportion of observed score variance to variance in examinees
true scores, and is used to estimate two things. First, the reliability coefficient is the
proportion of the observed score variance on one parallel form of a test that can be
predicted from observed score variance on the second parallel form of the measure.
Second, the square root of the correlation coefficient is the correlation between the
examinees observed scores and true scores. It should be noted that the reliability
coefficient is a theoretical concept based upon the true score model, and provides only
an estimate of the accuracy of a measure.

Types of Reliability
Since reliability is defined as the accuracy or consistency of a measure, there
are several kinds of reliability a measure may possess (Green, 1991, p. 28).
Consistency of a measure over time is termed stability, also known as test-retest
reliability. Consistency over different test forms is termed equivalence, or parallel
forms reliability. Consistency among raters is termed interrater reliability.
Consistency within a single test itself is called internal consistency reliability. As
noted by Hopkins, et al. (1990, p. 131), internal consistency reliability estimates are
the most commonly employed method for teacher-constructed instruments, since
reliability is estimated based upon a single administration of a measure. When
reporting a tests reliability, it is important to specify the kind of reliability that is
being estimated.
Estimating Internal Reliability
Since the concept of reliability is based upon a theoretical construct,
psychometricians have developed procedures for estimating reliability from observed
scores. Since multiple administrations or a test for purposes of estimating reliability is
usually not feasible, two methods are commonly used to estimate the reliability of
tests based upon a single administration. Each of these methods are based upon
assumptions, that if violated, jeopardize the reliability estimate.
The split-half method divides a test into two halves, often as odd versus even-
numbered questions. The half-scores are then correlated to obtain a parallel-form

reliability of a half-length test. To account for the decreased number of items in each
half-test, the Spearman-Brown formula is used to estimate the reliability of the full-
length test2 (Hopkins, et al., 1990, p. 131). The Spearman-Brown formula can be used
only if the two forms of the test are truly parallel.
The Kuder-Richardson procedure can be used to estimate internal reliability
without splitting a test into halves. The Kuder-Richardson formula 20 is roughly
equivalent to calculating the mean intercorrelation of the items on the test,
considering this value to be the reliability of a single item on the test, and then
estimating the value of this average correlation to all items on the test (Hopkins et al.,
1990, p. 132). The Kuder-Richardson formula 21 is a simpler form of the K-R 20
procedure that is slightly less accurate but easier to compute. It requires only the test
mean and the variance, and the number of items on the test. Its value is always less
accurate than the K-R 20 procedure. Both versions of the Kuder-Richardson formulas
are used to estimate the lower bounds of validity (Hopkins et al., p. 132). The K-R 20
procedure is a more general form of Cronbachs alpha, which will yield the same
result. Cronbachs alpha can be used in situations where the test items are not
dichotomous; which is an assumption of the K-R 20 procedure (Green, 1991, p. 8).
In summary, a variety of methods are available for calculating the internal
consistency reliability of a test based upon a single administration. These methods
have the true score model as a theoretical foundation. Reliability is the consistency of
test scores across multiple measurements. Validity is the appropriateness of what a

test measures. Just as a speedometer may consistently read high, an educational
measure may consistently overestimate a students true score, or may measure
something else entirely. Therefore, reliability is a necessary but not a sufficient
condition for validity.
According to Messick (1989, p. 13), validity is an integrated evaluative
judgment of the degree to which empirical evidence and theoretical rationales support
the adequacy and appropriateness of inferences and actions based upon test scores or
other modes of assessment. An estimate of validity assesses the degree to which a
test measures what it is designed to measure. Test scores are used to draw inferences
from test performance to some other behavioral situation. Validity is a unified
concept, but multiple lines of evidence may be required to arrive at a validity
estimate. Validity is not an either/or concept, but expresses the degree to which a
test relates to the domain it is intended to measure. Therefore, a tests validity is a
matter of degree, and is contextual, in that learning objectives and learning activities
must be of a sufficient quality to warrant the use of a particular test. Evaluation
procedures are one component of the educational process, educational objectives and
learning activities being two others (Hopkins et al., 1990, p. 9). Thus, a measures
degree of validity is estimated in the context of the stated learning objectives and the
activities by which learners are enabled to fulfill those objectives. Educational
measurement is never static, but is a dynamic process. Thus, reliability and validity

are not solely properties of a measure, but are properties of a measure used within a
specific context.
According to classical test theory, there are three types of evidence of validity
(Thorndike & Hagen, 1971, Crocker & Algina, 1986, Hopkins et al., 1990, Green,
1991). Content validity is an estimate of the degree to which the items on a test
adequately a performance domain or construct of interest. Content validity is the
relevant type of validity for academic achievement measures (Hopkins et al., 1990, p.
77). Criterion-related validity is an estimate of how accurately a test can be used to
draw inferences about subsequent behavior on a performance criterion that cannot be
directly measured with a test. Predictive validity (the ability of a measure to predict
future achievement on some criterion variable) and concurrent validity (the ability of
a measure to predict achievement on a related criterion variable) are both forms of
criterion-related validity. Construct validity is an estimate of how accurately a test
describes the degree to which an examinee manifests an abstract psychological trait or
ability. Table 2.1 summarizes traditional approaches to reliability and validity.

Table 2. 1
Types of Reliability and Validity
Type Use
Stability reliability Consistency of a measure over time; also known as test-retest reliability.
Equivalence reliability Consistency over different test forms; also termed parallel forms reliability.
Inter-rater reliability Consistency among raters; important in qualitative observation.
Internal consistency reliability Consistency within a single test itself; commonly employed method for teacher-constructed instruments; includes alpha.
Content validity An estimate of the degree to which the items on a test adequately represent a performance domain or construct of interest; most appropriate for achievement tests.
Criterion-related validity An estimate of how accurately a test can be used to draw inferences about subsequent behavior on a performance criterion that cannot be directly measured (includes predictive validity, concurrent validity).
Construct validity An estimate of how accurately a test describes the manifestation an abstract psychological trait.
Changing Conceptions of Validity
This three-part classification of validity has recently been criticized as
insufficient. Messick (1989) contends that content validity and criterion validity are
subsumed within or under construct validity, and that relying on only one type alone,
either criterion validity or content validity, is not sufficient. The meaning of a

measure, and hence its construct validity, must always be pursued not only to
support test interpretation but also to justify test use (Messick, p. 17). In
summarizing the proposed changes to the 1985 Standards for educational and
psychological testing (AERA, APA, and NCME, 1985), Moss, (1995, p. 12) suggests
that the consensus among validity theorists about the inadequacy of the three-fold
framework and the central role of construct validity emerged by 1980. In addition, she
explains that the explicit consideration of intended and unintended consequences of
assessment use be incorporated into validity research. In explaining how score
meaning, relevance, utility, and the social consequences of assessment relate to
construct validity, Messick (1995, p. 6) recommends six distinguishable validity
aspects that include a content aspect, a substantive aspect, a structural aspect, a
generalizability aspect, and an external aspect. He argues that these six aspects
function as general validity criteria for all psychological and educational
measurement. Finally, Moss (1994) considers us to be at a crossroads in education,
since the current conceptions of reliability and validity constrain the kinds of
assessment practices that are likely to find favor among those promoting educational
In response to recent challenges to a more traditional conception of validity
(Moss, 1995, Messick, 1989) the reader should remember that these criticisms have
not yet been formalized into standards or recommendations. Although there is general
agreement that the traditional conception of validity needs to be expanded, there is

still agreement among psychometricians that validity is not a dichotomous property of
a measure, but is a contextual evaluation of the accumulated evidence indicating the
degree to which a test fulfills its intended function. And because a test necessarily
functions within a specific context, estimating a tests reliability and validity is always
a context-dependent activity.
To summarize, the classical categories of reliability and validity provide a
benchmark for establishing important psychometric properties of a test. Although
there is widespread agreement that the concepts of validity and reliability need to be
expanded to include alternative forms of assessment, I would suggest that most
instructors and professors in higher education who give objective or problem-solving
tests fail to estimate the reliability and validity of their measures, since estimating the
reliability and validity of a test can be a difficult and time-consuming process. In
addition, educators may have different ideas about what constitutes validity. Sireci
(1995) notes that the current debate involves broad versus narrow definitions of
content validity. Broad descriptions focus upon test and item response properties,
whereas narrow definitions limit content validity to items, tests, and scoring
procedures. Regardless of how broadly or narrowly one defines validity, the
theoretical battles that will undoubtedly shape the future of psychometric theories for
the better are most likely far from the day-to-day assessment activities of many
educators, who are still faced with the responsibility of justifying the use of tests.
Even if content validity fails to incorporate the full range of evidence, partial evidence

is of greater value than no evidence. Educators who incorporate emerging
technologies for learning and assessment need some guidelines for assessing their
effectiveness. The classical definitions of validity and reliability provide educators
with a framework for evaluating a variety of tests, including measures that are based
upon new learning theories or incorporate new technologies.
Emerging Technologies: New Directions in Learning and Assessment
Recent advances in computer technologies for information storage, display,
and retrieval are impacting how computers are used to assist in learning and
assessment. Although the primary purpose of this chapter is to explain the use of
computers for assessment, new ways of conceptualizing the learning process are
challenging the traditional ways in which computers are have been used for
There are differing theoretical orientations for explaining current trends in
computer-based assessment. An evolutionary/historical approach focuses more upon
technological development, and emphasizes the development of computer
technologies, and the adaptation of testing to computers as a generative process. A
second approach focuses more upon alternative forms of assessment that have been
introduced as constructivist learning approaches expand the more traditional models
of learning This approach describes the use of computers for assessment in terms of
the characteristics of computer-based tests, and how they potentially can make

amends for the mere replication of multiple-choice tests that are administered by
computer. As a theoretical shift in learning theory emphasizes the active construction
of knowledge by learners, new approaches to computer-based testing that place more
control in the hands of the test takers will continue to challenge traditional views of
assessment. These two theoretical perspectives are converging: alternative assessment
and advances in computer technology are broadening the concepts of reliability and
validity as described by classical test theory. In this section I discuss both
perspectives. I begin by defining four generations of computer educational
measurement. I then discuss new approaches to assessment that are encouraged by
both constructivist approaches to learning and recent advances in microcomputer
Four Generations of Computerized Educational Measurement
Computerized educational measurement is a sub-field of educational
measurement brought about by the advent of computer technology (Bunderson et al.,
1989). The fundamental research question for computerized testing is the equivalence
of scores between the manual version and the computer-administered version of a
test. The American Psychological Association Committee on Professional Standards
and the Committee on Psychological Tests and Assessments developed the Guidelines
for Computer Based Tests and Interpretations (1986). These standards hold test
developers responsible for demonstrating score equivalence.

Computerized educational measurement can be summarized according to a
four-generation framework, where each successive generation represents an increase
in sophistication and power. According to Bunderson, et al. (1989), these generations
do not differ significantly from one another in regards to the computer as a test
delivery system, but rather are distinguished by the degree to which the computer
system is programmed to give feedback to the user.
The first generation, computerized testing (CT) is characterized by the
development of non-adaptive tests that are similar to manually-administered tests, and
merely utilize the computer for the test administration process.
The second generation of computerized educational measurement is
computerized adaptive testing (CAT). Computer adaptive tests vary the presentation
of tasks according to the users prior responses. Item response theory provides a
psychometric foundation for CAT tests that adapt items on the basis of the item
difficulty parameter.
The third and fourth generations of computerized educational measures differ
from prior generations in that they provide a high degree of interpretation. Continuous
measurement (CM) systems belong to the third generation, while intelligent
measurement (IM) systems characterize the fourth generation. IM systems differ from
CM systems in that they incorporate a knowledge-base. Third and fourth generation
systems need to be developed in conjunction with professionals who can scrutinize

the advice given in response to a selection, and are therefore too costly and difficult to
develop for use by individual faculty members or small departments. With the
increased capabilities and technological advances of desktop computers, individual
faculty members are afforded new ways of using computers for testing. Alternative
approaches to assessment that utilize computers make sense in settings where new
approaches to learning are being adopted. Specifically, technologies such as hypertext
that emphasize random access to information and a high degree of learner control
have opened the door to new opportunities for using computers for assessment.
New Approaches to Learning and Assessment: Knowledge Construction and
Alternative Assessments
New paradigms for learning that incorporate advances in technology to place
learners in control of the instructional objectives, instructional sequencing, and pace
of instruction currently being emphasized in education. These approaches often
emphasize the use of computers for accessing large stores of information that may
become of interest during instruction. Hypertext and related technologies are viewed
as valuable cognitive tools for active, user-centered learning that is characterized by
knowledge construction.
The concept of knowledge construction is illustrated by comparing traditional
approaches to computer-based instruction with the use of cognitive tools for learning
such as hypertext. Computer-assisted instruction (CAI) emphasizes the computer as

an intelligent tutor that provides corrective feedback as information is transferred
from the system to the learner via an expert model (Venezky & Osin, 1991).
Though concerned with instruction, this approach tends to minimize or entirely ignore
the larger learning context comprised of the sum of the internal and external factors
that encourage or inhibit learning. Newer computer technologies such as hypermedia
can be designed that is less tranmission based and more easily appropriated by the
learner. As the use of these new technologies increases, a theoretical shift away from
the Intelligent Tutoring System (ITS) paradigm and its emphasis upon modeling,
monitoring, and feedback has been observed. The new paradigm being adopted is
more constructivist, since students are encouraged to monitor and diagnose their own
learning and problem-solving performance through the use of cognitive tools (Derry
& Lajoie, 1993). In contrast to the ITS paradigm, proponents of the constructivist
view consider learning more a matter of nurturing the ongoing processes of
knowledge construction rather than receiving information from external sources
(Moyse & Elsom-Cook, 1992).
The Impact of New Approaches to Learning Upon Assessment
As these new approaches to learning are adopted, an obvious question is the
extent to which assessment technologies are keeping pace with instructional
technologies. According to Reeves and Okey (1996), constructivist approaches to
learning encourage new approaches to assessment, as evidenced by a recent interest in
forms of assessment that diverge from the classical testing model. These include

computerized adaptive testing (Smittle, 1994), performance-based assessment
(Reeves & Okey, 1996; Baker, ONeil, & Linn, 1993), using new technologies to
deliver assessment (Collins, Hawkins, & Frederiksen, 1993), and dynamic assessment
(Lajoie & Lesgold, 1992). Certain of these approaches are more appropriate in certain
domains than in others. As noted by Spiro and Jehng (1990), hypertext as a learning
technology is appropriate for teaching complex domains such as medicine, since the
random-access capabilities of hypertext allow learners to access information in
sequences that are unique and meaningful to them. It should not be surprising then, to
learn that alternative approaches to computer-based testing, such as performance-
based assessment, have been implemented in medical education and training.
Performance-Based Assessment and Medical Education
Performance-based assessments have been used extensively in medicine.
Although there is no one definition that best defines performance-based assessment,
an emphasis upon higher-order skills in a real world context simulating open-ended
tasks are a common theme (Swanson, Norman, & Linn, 1995, p. 5). According to
Baker, ONeil, and Linn (1993, p. 1211), performance-based assessments:
1. Use open-ended tasks.
2. Focus on higher order or complex skills.
3. Employ context-sensitive strategies.

4. Use complex problems requiring several types of performance and significant
student time.
5. Consist of either individual or group performance.
6. May involve a significant degree of student choice.
Swanson, et al. (1995) describe four performance-based assessment methods
that have been used in medicine: Patient Management Problems, computer-based
clinical simulations, oral examinations, and Standardized Patients (SPs).
Patient Management Problems (PMPs) provide learners with an opening
scenario about a patient, move to progressive scenes where additional information is
gathered, and are followed by one or more scenes where patient management
activities are initiated. PMPs were used in medical licensing and specialty
certification examinations until the late 1980s, when a recognition of their
psychometric deficiencies led to their elimination (Swanson, Norcini, & Grosso,
Computer-based clinical simulations use text and other modalities for
presenting high-fidelity models of the patient care environment, requiring examinees
to select from a full range of diagnostic and therapeutic options available. Combined
with the use of imaging technology, realistic, full-color presentations of patient
findings are now possible. Simulations are not currently used for high-stakes testing,

but inclusion of computer-based simulations will probably become a component of
medical licensing examinations by the end of the decade (Swanson, et al., 1995, p. 6).
Oral examinations were eliminated from U.S. licensing examinations over 30
years ago, although orals continue as a component of specialty certifications.
(Swanson, Norman, and Linn, 1995, p. 6).
Standardized patients are non-physicians who are trained to portray patients in
a testing situation. A variety of methods for rating examinees are used, including
rating via checklist by the SP or rating by a physician-observer. This procedure was
recently introduced into the Canadian licensing examination (Reznick et al., 1993). In
the United States, the National Council on Education Standards and Testing (1992)
has proposed that performance-based assessment be a featured component of national
Studies About Computer-based Testing in Medicine
A number of recent studies and activities demonstrate the increased interest in
computer-based assessment in medicine. In 1988 a study was designed to determine
the feasibility of creating and administering computer-based problem-solving
examinations for evaluating second-year medical students in immunology. This
format was compared to objective and essay format examinations. No significant
differences were found between the three testing methods (Stevens, Kwak, & McCoy,

The National Board of Medical Examiners (NBME) has developed the
Computer-Based Examination (CBX) system that delivers simulations and multiple-
choice questions to examinees (Clyman & Orr, 1990). The system interfaces with a
videodisc player capable of accessing thousands of images. In 1987 the effectiveness
of the system was studied. The system was found to effectively assess clinical
competence. The NBME is currently in the distribution and testing phases of
implementing computer-based testing in a number of medical schools.
More recently, computer means of assessment have been introduced into
pathology and medical technology training. A computerized system for assessing
proficiency in cytology has been proposed (Breyer, Lewis, & Mango, 1994). This
system uses high-resolution digital images for cell recognition purposes. The
Department of Molecular and Cell Biology at the University of Connecticut has
developed The Virtual Classroom, which is a World Wide Web resource containing
instructional materials for microbiology instruction. In addition to class notes, syllabi,
and lecture materials, the web site contains practice examination and a limited
number of images (Terry, 1995).
Beginning in 1994, the American Society of Clinical Pathologists (ASCP)
began administering all Board of Registry Examinations by computer (Castleberry &
Snyder, 1995). Although information concerning student responses to the exam in this
format is not available, more than 14,000 students were examined by computer in the
first year of computerized testing.

From the scope of the projects cited here, it may appear as though alternative
forms of assessment may be difficult to implement by individual faculty members or
departments using teacher-constructed tests. First, there is the expense associated with
developing complex measures. Second, there is the need to estimate the validity and
reliability of new measures, which is a time-consuming and labor-intensive process.
Third, from the absence of studies in the literature about establishing the reliability
and validity of teacher-constructed tests, I conclude that these activities are the
exception to the norm. Finally, the implementation and pilot-testing of new forms of
assessment take time, and therefore may be discouraged in favor of traditional means
of assessment.
An intermediate category of assessment has been identified, however, that is
not as simplistic as multiple-choice pencil-and-paper measures, but also not as
complex as detailed performance-based assessments or computer-based simulations.
Neither traditional testing nor performance-based assessment methods are a panacea.
Selection of assessment methods should depend upon the skills to be assessed, and
generally, use of a blend of methods is desirable (Swanson, Norman, & Linn 1995,
p. 11). Low-grade simulations are one form of this proposed blend of methods.
Low-grade simulations are similar to what are termed simulated identification tests,
which are useful in situations where mistakes made by the examinee could have
serious consequences. In several allied health fields, this type of test has been used
successfully. For example, medical laboratory technicians are often trained and tested

for their ability to identify bacteria and other microbes on prepared slides (Priestly,
1982, pp. 100-101). These measures would presumably require less time and effort to
develop than complex computer-based simulations or performance-based
assessments, and therefore may more readily be adopted and developed by small
departments and individual faculty members. Table 2.2 summarizes four methods for
implementing testing by computer, ranging from a low to high degree of complexity.
Table 2. 2
Four Categories of Computer-based Assessment
Category Description
Multiple-Choice Easy to construct and administer; scoring can easily be
Tests accomplished by computer.
Low -Grade Less complex than Problem-based assessments or
Simulations computer simulations; useful when examinees need to be protected from hazardous materials.
Problem-based Computer-based simulations utilizing low-fidelity models;
Assessments may be text-based, and usually do not offer the full range of models characteristic of computer simulations.
Computer Simulations using text and other modalities for presenting
Simulations high-fidelity models of the patient care environment; a full range of diagnostic and therapeutic options available.

Advances in computer technologies have far-reaching implications for
assessment, and the challenges of these developments to measurement personnel will
be substantial (Linn, 1989, p. 3). Although the use of computers in assessment may
produce better diagnostic tests, integrate testing and instruction, and enhance the
instructional value of assessment, there is as of yet no empirical evidence to suggest
that the possibilities will be realized. The adoption of technologies encouraging
knowledge construction has accelerated the need for new approaches to assessment,
although performance-based assessments and complex computer-based simulations
are time-consuming to design, develop, test, and implement. Even though the
traditional definitions of reliability and validity are currently being challenged in light
of new approaches to assessment, the need to estimate validity and reliability has not
lessened. Fundamentally, a test developer is still required to justify the implicit claim
that a test is reliable and valid, which is implied whenever a test is administered. The
standards for tests put forth by the American Research Association, concerning the
need to justify a tests reliability and validity pertain to all forms of assessment.
The literature I cite here describes new possibilities for computer-based
assessment in medical education. What is lacking in the literature, however, is a clear
description of how to develop and validate a computer-administered item simulation
measure. The faculty of the Medical Technology School at the University of Colorado

Health Sciences Center are ready to explore the potential of using simulated
identification tests that are administered by computer. They are interested in testing by
computer so that they will spend less time preparing practical examinations, granting
them more time to spend with students. Nitko (1989, p. 453) classifies instructional
outcomes according to both student variables and system variables. System variables
include cost-effectiveness and teacher-time allocation. According to one medical
technology instructor, it takes approximately four hours to set up a laboratory
practical simulated identification examination that will take students less than one
hour to complete. Since a laboratory practical is given each week during the academic
term, set-up time takes approximately 10 percent of the total time instructors have
with students (Christie Grueser, personal communication, March 22, 1996). Faculty
members are anxious to explore the development, testing, and implementation of
simulated identification tests by computer, in order to reduce teacher-time (a system
variable) allocated to preparing the examinations. For expediency, my goal is to
design and develop an item simulation test from the practical examinations currently
being used, which I describe in this study. The results may be of interest to other
faculty members who may want to convert current forms of assessment to computer-
administered versions that take advantage of current technologies for representing
digitized medical images as a component of the assessment.

The present study was first conceptualized in 1994, when I began producing
digitized medical images for use in computer-assisted medical technology training.
The research design, examination instruments, and data collection techniques were
designed to assist the faculty at the Medical Technology School at the University of
Colorado Health Sciences Center in assessing the viability of administering laboratory
practical examinations using microcomputers. This study is a continuation of two
previous studies. I conducted an environment analysis in conjunction with the medical
technology faculty to assess the readiness of the medical technology school to adopt
computer-based testing methods. After completing the environment analysis, I
conducted a qualitative pilot study assessing student attitudes about computer-based
testing (Koneman, 1994).
The primary purpose of this study is to compare two methods of administering
a laboratory practical examination. As noted in Chapter 2, computer-administered
tests must demonstrate equivalence with alternative methods for testing. I completed
the present experimental study as follows. First, I met with the faculty at the Medical
Technology school to determine which practical examination should be used for the
study. I then developed a procedure for estimating the validity of the existing

measure. After the results were compiled, I worked with two faculty members to
develop the computer-based version of the examination, and to develop a method for
randomly assigning students to one of two groups. We then established a time frame
for administering both versions of the examination to all students, according to a
repeated-measures design. At the conclusion of the second administration, all students
completed a brief survey about the examination experience, and participated in a
structured group interview. Finally, the examinations were scored, and the results
The entire first-year class at the Medical Technology program at the Medical
Technology school to University of Colorado Health Sciences Center in Denver,
Colorado, participated in this study as a course requirement. Seventeen students were
admitted to the program for the Fall 1996 academic quarter. The class is comprised of
12 females and 5 males.
Program Requirements
Acceptance into the program requires successful completion of the equivalent
of 80 hours of undergraduate education from an accredited institution, including 16
hours of chemistry, 16 hours of biology, and at least one semester each of
microbiology and immunology. Students who complete the program will earn a
Bachelors of Science in medical laboratory sciences. After earning this degree,

students normally complete the Board of Registry examination, administered by the
American Society of Clinical Pathologists. Upon successfully passing this
examination, students receive certification as a Medical Technologist (MT, ASCP).
The Clinical Microbiology Course
The examination instruments used for this study were developed for the
clinical microbiology course, which is the first course students complete in the
medical technology program. This course meets Monday through Friday, from 8:00
AM to 4:00 PM, and is comprised of both lecture and laboratory sessions. The lecture
and laboratory learning objectives are listed in Appendices A and B respectively.
As a part of this course, students complete weekly bench practical
examinations, which assess each students ability to integrate and apply material from
both the lecture and laboratory components of the course. Each bench examination
includes low-grade simulation items requiring students to correctly interpret medical
technology tests such as Gram stains, biochemical reagent tests, and to correctly
identify the salient colony morphology and growth zones of bacteria incubated on
agar plate media. During the course of the examination, students interact with live
organisms, Gram stains, and mock biochemical tests. These materials are set up at the
laboratory benches as Stations. To present conditions as realistic as possible for
these practicals, faculty spend three to four hours preparing each weekly bench
exam. As noted in Chapter 2, preparing the bench examinations requires

approximately ten percent of the time allocated by faculty for teaching. The rationale
for conducting the study is this: if the process of administering the bench examination
can be reduced, faculty will have more time to devote to students in non-assessment
Each practical examination administered throughout the academic quarter is
comprised of compound questions requiring students to make identifications and
diagnoses based upon the test interpretations. The examination questions are written
at the application and synthesis levels of Blooms taxonomy for the cognitive domain
(Bloom, 1956, p. 144-164). Normally, the number of stations included in an
examination corresponds to the number of students.
The primary materials used in this study are a content-validity estimation
rating form, two versions of the practical laboratory examination (bench and
computer versions), and the survey each student completed at the conclusion of the
second administration of the exam. Additional materials include the instructions given
to students for completing the computer-based version of the exam, and a floppy
diskette for storing each students computer responses.
The Content Validity Estimation
Developing reliable and valid test instruments is both an art and a science.
Although there is a recent emphasis upon reassessing the classical distinctions of

content, criterion, and construct validity, other sources view content validity as
relevant for objective measures with a high degree of specificity. From classical test
theory a number of recommendations have been made for designing teacher-
constructed tests that adhere to a clear and well-defined design strategy for producing
a valid and reliable measure.
The Standards for Educational and Psychological Testing (1985) recommend
that the content universe represented by the test should be clearly defined for content-
related evidence. A number of authors suggest an analysis strategy that entails
constructing a table of specifications for the content and process objectives for a
course, referenced to the taxonomy level of the cognitive domain for each test item
(Thorndike & Hagen, 1977, pp. 203-210, Hopkins, et. al. 1990, pp. 176-179, Crocker
& Algina 1986, pp. 72-75). A list of instructional objectives is used to weigh the
objectives as to their relative importance in the course, and then cross-checking each
item back to the table of specifications to ensure that no content area is emphasized
over the others. Using a strategy such as this is one evidence of content validity.
According to Crocker and Algina (1986, p. 218), content validation can be
viewed as a series of activities undertaken after an initial form of an instrument is
developed to assess whether its items adequately represent a given performance
domain. They recommend, at a minimum, that content validation entail the following

1. Define the performance domain of interest.
2. Select a panel of qualified experts in the content domain.
3. Provide a structured framework for the process of matching items to the
performance domain.
4. Collect and summarize data from the matching process.
Differing strategies for rating the fit of items to objectives have likewise
been proposed. Hambleton (1980, p. 210) recommends that expert reviewers use a
five-point scale to rate the degree of match. This method is less empirical than one he
recommended in 1977 (the Rovinelli-Hambleton procedure), but has the advantage of
being easily completed by a panel of expert reviewers, and the results are more easily
summarized without elaborate statistical analysis. Green (1991, p. 31) recommends
either a dichotomous or a scaled approach, favoring Hambletons 5-item Likert scale
approach. To estimate the content validity of one of the practical examinations, I
developed a content validity estimation rating form and selected four subject matter
experts to rate the degree to which each examination item corresponds to the course
To assist the subject matter experts in estimating the content validity, I
followed Hambletons recommendation and developed a rating instrument that
includes the laboratory objectives and a five-point scale for rating each item. The
laboratory objectives for the microbiology course appear in the left column of each
page of the rating form. Subsequent columns are used to rate each examination item

according to the following five point scale: (a) 1, strongly meets the objective, (b) 2,
moderately meets objective, (a) 3, barely meets objective, (d) 4, does not meet
objective, and (e) 5, does not come close to meeting objective. As a pilot study, I sent
this rating form and a printed copy of the pencil and paper instrument accompanying
the bench exam to four subject matter experts, two of whom returned a completed
form. The results of this pilot study suggest that the examination addressed a majority
of laboratory objectives, although some objectives were not applicable (Koneman,
1996). As a result, I revised the rating form used in the present study.
Conducting the Content Validation Study
Based upon recommendations by the subject matter experts who participated
in the pilot study, I made two revisions to the content validity estimation rating form
prior to conducting the content validity estimation for this study. First, two additional
item series were added for rating examination questions 16 and 17, which were added
to the practical examination since the pilot study was conducted. Second, I added an
additional column to each item series for specifying that an objective is not applicable
to a particular item, since both experts had commented that there were some
objectives that do not apply to this type of examination. Finally, I added shading over
the objectives that the experts indicated do not apply to this measure (objectives 1,3,
4, 5, and 9 of Appendix B). I revised the cover letter accompanying the examination
questions and the rating form, and sent the content validation study materials to three

subject matter experts. The first expert is an MD who teaches pathology and
microbiology. The second expert is certified as a medical technician by the American
Society of Clinical Pathologists (ASCP), is employed by a major microbiological
manufacturing firm, and is considered an expert in proficiency testing. The third
expert is also an ASCP certified medical technologist who teaches microbiology in an
accredited microbiology program. The cover letter accompanying the pencil-and-
paper version of the examination and a sample of the updated content validity rating
form sent to each expert are included in Appendix C.
The Practical Examination
We developed two versions of the practical examination containing the same
content but differing in mode of administration. The first version of the exam
(hereafter referred to as the bench exam) is a pencil-and-paper instrument that is used
to record student responses in the laboratory as the bench exam is taken. The second
version (the computer exam) contains the same content, but in a computer-
administered format. The two exams differ primarily in how the specimens are
presented to the student, and in how student responses are collected. The bench
version contains mock-ups of the organisms and procedures and is accompanied by a
pencil-and-paper form for recording responses; the computer version contains high-
resolution digitized photographs of actual organisms and tests, and students enter their
responses directly into the computer.

The bench exam. The bench exam consists of seventeen laboratory stations,
each containing a Gram stained slide viewed under a microscope, an agar plate with
colony growth, a biochemical test or procedure, or any combination of these items.
Students complete the bench examination by moving from station to station in
succession, and recording their responses on the pencil-and-paper form that is
included in Appendix D. In order to complete the entire examination in a timely
manner, students are allotted a maximum of three minutes at each station.
Many of the organisms and tests included on the exam are simulated mock-
ups of actual tests. Simulations are used for a variety of reasons. First, some of the
tests and procedures are time-sensitive. A color reaction in a biochemical test tube
may degrade rapidly, and therefore not remain stable for the hour it takes for students
to complete all portions of the bench exam. Other tests require time for incubation or
growth, and must be controlled for temperature or other atmospheric conditions, and
for the same reason, cannot be kept in the laboratory where the exam is conducted.
Finally, some tests identify potentially dangerous organisms, and therefore a mock-up
of the organism is used to protect students from exposure.
The examination items are designed to assess learning above the knowledge
level, and identify the students ability to analyze and apply information. Knowledge
application has been identified as the ability to select and apply the correct abstraction
to a problem situation (Hopkins, Stanley, & Hopkins, 1990, p. 173). Complex testing
instruments are required to present a problem situation. The test items on the bench

exam fall on a continuum between factually-oriented multiple choice questions and
performance-based simulations. Items such as these are in a category referred to as
low-fidelity simulation (Swanson, Norman, & Linn, 1995, p. 11), and simulated
identification (Priestly, 1982., p. 100). Students visit all stations in order, and after the
last station is visited, students have ten minutes to revisit any exam stations and
change their responses.
The computer exam. The computer version of the exam contains the same
items as the bench version. Instead of presenting students with a mock test relating to
each exam item, the computer version contains high-resolution digitized color
photographs of the actual test procedures and results. The computer version runs on
any microcomputer using the Microsoft Windows 3.1 graphical user environment, or
the Microsoft Windows 95 operating System. Hardware requirements include a
minimum of 8 megabytes of random access memory, a mouse or equivalent pointing
device, a hard drive with 10 megabytes of free space, and a video display card and
monitor capable of displaying 256 colors simultaneously at a screen resolution of
640x480 pixels. The computer examination was developed using Asymetrix
Multimedia Toolbook Version 4.0. The entire examination is installed from two 1.44
megabyte floppy diskettes.
The computer exam consists of seventeen screens of information that
correspond to the seventeen stations listed on the pencil-and-paper bench exam form.
Each screen contains one or more selection boxes and text fields for students to enter

their responses. At least one photograph accompanies each exam question. To
emulate the time students have to complete each station on the bench exam, the
computer exam contains a timer routine that limits the student to three minutes per
screen. Each screen comprising the computer version of the examination is shown in
Appendix E.
The computer version requires a floppy diskette to be placed into the 1.44
megabyte diskette drive assigned the drive letter A. During the execution of the
program the computer program looks for key file on the floppy diskette. This key file
contains a binary list indicating the order in which the questions are to appear. If this
file is not present on the floppy diskette in the A drive when the Begin button is
selected, an error message appears in a dialog box in the middle of the screen
explaining that the user does not have access to the exam. When the user clicks the
OK button to remove the dialog box, the Quit button is enabled, allowing the user
to exit the program and return to the Windows desktop. This acts as a security feature,
whereby students (or anyone else with access to the library computers) cannot access
the examination questions without a key diskette.
Once the exam is launched from the desktop, an ASCII text file for logging
student responses is created on the floppy diskette, and the opening screen is
displayed. This screen explains the format for the exam, and provides the user with
two options. The Instructions button displays the instructions screen, which
explains how to make selections and enter responses. The Quit button is

intentionally disabled, so students cannot exit the program once they have started the
The Begin Exam button displays the first item, sets the timer to three minutes,
and begins counting down by seconds. When the timer reaches zero, the next item in
the students list of items is displayed. This process is repeated until the last item is
displayed. When the counter for the last question reaches zero, a dialog box is
displayed that informs the student that he or she will have ten minutes to review any
of the examination questions and make changes. Navigation buttons for moving
between examination questions appear in the lower right corner of the screen. After
ten minutes, the ASCII file containing student responses is closed, and a copy of the
data file containing the students responses is saved to the floppy diskette. The
program closes automatically after all files are stored to diskette.
Research Design
The goal of this study is to compare student responses on both administrations of
the practical examination described above. To minimize the number of test
administrations required so as to not overtly interfere with the teaching schedule, I
selected a fixed factor, repeated measures, analysis of variance design. Two between-
subjects factors were included to alleviate a potential carry-over effect and explore
gender differences. These additional factors are also beneficial because they increase
the statistical power of the analysis, which is jeopardized by the small number of

participants. The first factor, order, is fixed with two levels, which indicates the
order in which the examinations were administered. The order factor levels are
computer/bench and bench/computer. The second factor, gender, is also fixed with
two levels. The repeated (within-subjects) factor, type of exam, includes two levels,
which correspond to the modes for administering the examination. Students were
randomly assigned to one of two treatment groups.
This design is a more complex instance of the one-factor repeated measures
design. A three-factor design is termed a split-plot design, and has the following
characteristics: it is a combination of the one-factor repeated measures model and the
three-factor fixed-effects model (Lomax, 1992, p. 232). Each subject responds to both
levels of the repeated measures factor, but to only one level of each non-repeated
factor. The advantage is the same for the one-factor repeated measures model:
subjects serve as their own controls in that individual differences are taken into
account, yielding a more economic design requiring fewer subjects. The potential
disadvantages to this design have been noted. Kirk (1968, p. 248) warns of the
potential for an order effect (also termed a carry-over effect). This potential can be
minimized by varying the order in which subjects are exposed to the levels of the
repeated factor. Estes (1991, p. 118) notes the potential difficulty in interpreting
multifactor designs. His concern, however, is more applicable to designs with three or
more levels of each factor. Finally, in unbalanced, nonorthogonal designs, the sums of
squares are not additive, and the tests of various effects may not be related in any

significant way to population parameters (Estes, 1991, p. 133). This effect is less
pronounced when the unbalance is relatively small. In the design I describe here, the
degree of unbalance is large. Minimum cell size must be greater than the number of
dependent variables included in the design (Hair, Anderson, Tatham, & Black, 1995,
p. 269). With the disparate ratio of males to females, one cell contains the required
minimum two observations. This degree of imbalance may limit the generalizability
of the findings.
In this design the dependent variable is examination score. There are three
independent variables. One of these, type of examination, is a within-subjects factor
with two levels, the bench score and the computer score. The two independent,
between-subjects factors are gender and order, and each has two levels. The following
hypotheses are tested by this design:
1. There is no main effect for order.
2. There is no main effect for gender.
3. There is no interaction between order and gender.
4. There is no main effect for type of exam.
5. There is no interaction between type of exam and order.
6. There is no interaction between type of exam and gender.
7. There is no interaction among type of exam, order, and gender.

The split-plot, three-factor repeated measures design is graphically depicted in
Table 3.1.
Table 3. 1
Gender Group Computer Score Bench Score
Male 1. Computer/Bench Mean Score Mean Score
2. Bench/Computer Mean Score Mean Score
Female 1. Computer/Bench Mean Score Mean Score
2. Bench/Computer Mean Score Mean Score
Note. Graphical depiction of a 2 x 2 x 2 split-plot, three-factor repeated measures
Students were assigned to one of two order groups according to a systematic
random strategy. The instructor listed each student alphabetically by last name.
Beginning with the first name in the alphabetized class roster, odd-numbered students
were placed in group one, and even numbered students in Group Two To comply
with the exemption granted by the Human Subjects Review Committee, I assigned
each student a four-digit identification number to protect his or her anonymity. All
identification numbers begin with 96, indicating that the study was conducted in the
ninth month of the year 1996. The remaining two digits in the identification have no

significance, other than uniquely identifying each student. The stratified random
assignment of students to groups is shown in Table 3.2

Table 3. 2
Systematic Random Assignment of Students to Groups
Roster Student Number Order Gender First Item
1 6921 1 F 1
2 6932 1 F 2
3 6943 1 F 3
4 6954 1 F 4
5 6965 1 F 5
6 6976 1 F 6
7 6987 1 M 7
8 6998 1 M 8
9 6902 1 F 9
10 6913 2 M 1
11 6924 2 F 2
12 6935 2 F 3
13 6946 2 M 4
14 6957 2 M 5
15 6968 2 F 6
16 6979 2 F 7
17 6988 2 F 8

Administering the Examinations
Prior to administering the examinations, the computer version was installed and
tested on nine computers in the Library Resource Center, located in the Dennison
Memorial Library of the University of Colorado Health Sciences Center. The nine
computers used for this study are identical in configuration: each is a Compaq
Prolinea desktop computer with a VGA monitor displaying 256 colors, a hard disk, a
keyboard, a mouse, and a 1.44 megabyte floppy diskette. Each computer uses the MS-
DOS and Microsoft Windows 3.1 operating environment. I standardized each desktop
after installing the examination software.
The bench and computer the examinations were administered simultaneously
on two consecutive Fridays. On the Thursday before the first administration date, the
course instructors gave verbal instructions to the class for completing the practical
examination. They informed students that they would be taking a laboratory practical
examination in two parts, and that one part would be given in the laboratory and in
the same format as the two previous laboratory practicals. They gave each student
printed instructions for completing the computer version. This instruction sheet,
shown in Appendix F, also included each students first name and identification
number (written in pencil by the instructor). Students were informed that they would
need their identification number during both parts of the exam. She also instructed

students not to discuss the examination until they had completed both parts. Students
were not told that the two parts of the examination contain identical items.
Administering the Bench Exam
The eight laboratory benches were divided into seventeen stations each
containing specimens representing one of items listed in Appendix D. Each student
began the exam at a different station, according to the first item assignment listed in
Table 3.2. Students were given three minutes to correctly identify the salient features
of the sample or test presented at each station, and recorded their answers on the
pencil-and-paper instrument shown in Appendix D. A laboratory timer was used to
signal the time limit at each station. After three minutes, each student moved to the
next station.
Students in Group One completed the bench exam during the first
administration; Group Two during the second. One of the instructors stayed in the
laboratory during the exam. She collected the examination instruments when all
students had completed the exam.
Administering the Computer Exam
Students in Group One completed the computer exam at the same time that
students in Group Two completed the bench exam, and visa versa. One of the
instructors accompanied students in Group Two to the library Resource Center for the

computer exam. I reserved the nine microcomputers containing the examination
software for two hours; one hour for the administration, and one-half hour before and
after the exam to attend to administrative details. Before each exam I turned on the
computers and opened the program group for the examination. After all students
arrived with the instructor, I presented each student with a floppy diskette for saving
their responses. Each diskette has the students identification number clearly visible
on the label, and contains a computer data file specifying the order in which the
examination questions would be presented to the student. I then gave verbal
instructions and a demonstration for inserting the data diskette into the computers
floppy diskette and launching the examination program from the desktop. I also
informed students that I would be available throughout the exam to answer any
questions about the program and attend to any technical problems they might
experience. I gave each student a pencil-and-paper form containing the numbers 1 to
17 for recording any notes to assist them in returning to any questions during the 10
minute review period. I also encouraged them to record their responses on the paper
form in addition to entering them directly into the computer, in case any technical
problems were to arise. As students completed the exam, the program updated their
responses after each screen. When students finished the last question, the program
allotted the student ten minutes to revisit any screen, and make changes. After each
group completed the computer exam, I collected their floppy diskettes, and shut down

each computer. After administering the exam to Group Two I removed the software
from each of the nine library computers on which it was installed.
Administering the Student Survey
At the conclusion of the second administration, each student completed the survey
instrument included in Appendix G. The purpose of the survey was to query students
about the two versions of the examination, and gather additional data to explain
differences in scores. After students completed the survey, the instructor conducted a
fifteen minute structured interview. I took notes in a field notebook as students
responded to her questions. The notes have been retained in case they are needed to
clarify individual responses on the survey.
Scoring the examination
The course instructors scored both versions of the examination and provided
me with item response data in computer format. Since all of the items contain
multiple parts, it was necessary for them to break each question into 78 distinct items,
each scored dichotomously.
The pencil-and-paper forms accompanying the bench exam were scored by the
instructor the same way the previous practical examinations had been scored. To
score the computer exam, I provided the instructors with three documents for each
student; their pencil-and-paper notes and responses, a printout of each exam screen,

showing their responses, and a printout of the ASCII text file that was generated while
the examination was in the process of being completed.
After scoring all exams, the instructors provided me with student scores in two
formats. The first is a printed table displaying each students examination number,
raw bench and computer scores, percentage computer and bench scores, each
students composite score, and the mean for both levels of the dependent variable for
both groups. The second format for the scores is a computer ASCII text file
containing examination responses for each student in a table format. Each row of data
in the file constitutes a case in a fixed-field format, with the first 17 cases representing
the bench scores, and the last 17 cases the computer scores. Columns 1 to 4 in the
data file contain each students identification number. Column 5 contains a one
character letter designation for the exam version (B or C), and columns 6 to 83
contain the dichotomously scored responses. I imported this ASCII file into a
Microsoft Excel workbook to calculate the item analysis and reliability statistics. I
imported the same data file into SPSS Version 7.0 to conduct the repeated measures
analysis of variance procedure.
I designed this study to compare student scores on two versions of a
performance-based laboratory practical examination in medical technology. One
version of the examination is a bench examination administered in the laboratory. The

alternate version is a computer-based simulation of the same items using high-
resolution, digitized color photographs. I selected three experts in the field to rate the
congruence between examination items and course objectives, in order to estimate the
validity of the instrument. The entire first year class of the Medical Technology
program at the University of Colorado Health Sciences Center were participants. The
study utilized a three factor, split-plot repeated measure design. The dependent
variable is the examination score. The independent variables are type of examination,
gender and order. The seven null hypotheses tested by the repeated measures split-plot
design were introduced in this Chapter. They include three main effects, three two-
way interactions, and one three-way interaction.
After completing both versions of the examination, students were
administered a brief survey assessing their concerns about the examination. I collected
these data as an additional means for interpreting the results of the analysis of
variance and item analysis data. I report the results in the next chapter.

I collected a variety of data in this study, which are reported in this chapter. I
first report the results of the content validity estimation study. Then I report each
students scores for both versions of the practical examination. Next report the results
for the repeated measures analysis of variance test. I then describe the model I
developed for conducting an item analysis, and report the findings. I conclude this
chapter by presenting the results of the survey I gave to students after they completed
both versions of the exam.
Estimating the Content Validity of the Measure
Researchers have made numerous suggestions for summarizing the matching
data collected during a content validation study. Although Crocker and Algina (1986,
p. 221) consider the degree to which test items are representative of a given domain as
more a qualitative than a quantitative decision, they list the following five indices as
appropriate methods for summarizing ratings:
1. Percent of items matched to objectives.
2. Percent of items matched to objectives with high importance ratings.

3. Correlation between the importance weighting of objectives and the number of
items measuring those objectives.
4. Index of item-objective congruence.
5. Percentage of objectives not assessed by any items on the test.
After receiving a completed rating form from each of the subject matter experts, I
entered the responses into a Microsoft Excel workbook file comprised of two
worksheets. The first worksheet contains the rating data in a format that is displayed
as one printed page. The worksheet consists of a series of columns representing
laboratory objectives 2, 6, 7a, 7b, 7c, 7d, 7e, 7f, 8, and 10 for each question rated by
the experts. The worksheet is contains columns series for three questions. The row
data in the worksheet consists of the rating data provided by each of the three experts.
No summary statistics appear on this worksheet; its primary purpose is to present the
rating data in an easy to read format. This worksheet is shown in Appendix H.
The second worksheet contains linking formulas to the first worksheet that display
the data for each experts ratings. The column data are the individual ratings form
each laboratory objective, and the row data is comprised of a data series for each
examination item. The leftmost column contains the examination item number. The
last column contains mean values for each experts rating for the item. The first three
rows of data in each series display each experts ratings. The fourth row contains the
mean item rating for each item, by objective. The fifth row in each data series

contains the percentage of experts who did not consider the objective to match the
question. The experts responses are summarized in Appendix I. Of the methods
listed for assessing content validity, in this case it is best to analyze individual items
on an objective-by-objective basis, since each item was compared with multiple
Student Scores
I calculated student scores for both version of the practical examination using the
ASCII data file described in Chapter 3. To calculate the scores, I first imported the
ASCII data into a Microsoft Excel workbook. This workbook file contains seven
separate worksheets that display the examination data in a variety of formats. Four of
the worksheets in this workbook were developed to calculate the item analysis
statistics that are described later in this chapter. Two worksheets in this file contain
the raw scores that were imported into SPSS Version 7.0 for calculating the Analysis
of Variance F statistic. The last worksheet in the workbook file contains each
students score on both versions of the exam, calculated from the dichotomous values.
Student scores are summarized in Table 4.1.

Table 4. 1
ID Group Gender Computer Bench
6921 1 F 61 68
6932 1 F 60 72
6943 1 F 67 75
6954 1 F 66 71
6965 1 F 65 73
6976 1 F 49 56
6987 1 M 67 73
6998 1 M 44 56
6902 1 F 59 69
6913 2 M 73 71
6924 2 F 68 71
6935 2 F 68 71
6946 2 M 62 69
6957 2 M 67 74
6968 2 F 73 72
6979 2 F 71 69
6988 2 F 71 75
A total of 78 points is possible on both examination versions. With the
exception of two students, scores were higher on the bench version of the exam
(overall mean for the bench score, 69.71; overall mean for the computer score, 64.18).

The Pearson-product moment correlation coefficient between the scores for the entire
class is 0.87. The scores are represented graphically in Figure 4.1.
Figure 4. 1
Student Scores: Graphical Representation
Comparision of Computer and Bench Scores
The mean scores between-groups varied as well. Overall, scores for students in Group
One were lower than those of students in Group Two for both examinations.
Comparing Mean Scores: Analysis of Variance
As explained in Chapter 3,1 selected a repeated measures factorial analysis of
variance procedure for this study. This procedure was selected to counter a possible
carry-over effect between administrations of the exam, explore gender differences,
and increase the statistical power of the procedure, which is minimized due to the
small number of subjects. In this study, the between-groups variance exceeded the
within-groups variance. Table 4.2 displays the descriptive statistics for the eight cells

represented by the research design introduced in Chapter 3 (and graphically depicted
in Figure 3.1).
Table 4. 2
Order Gender Mean SD. N
Bench Score C/B F 69.14 6.26 7
M 64.50 12.02 2
Total 68.11 7.18 9
B/C F 71.60 2.19 5
M 71.33 2.52 3
Total 71.50 2.14 8
Total F 70.16 4.97 12
M 68.60 7.30 5
Total 69.71 5.55 17
Computer Score C/B F 61.00 6.14 7
M 55.50 16.26 2
Total 59.78 8.20 9
B/C F 70.20 2.17 5
M 67.33 5.51 3
Total 69.13 3.68 8
Total F 64.83 6.69 12
M 62.60 11.10 5
Total 64.18 7.92 17
Note. C/B = Computer/Bench; B/C = Bench/Computer; F = Female; M = Male, SD
Standard Deviation, N= Number of Students.

The score distributions have a number of interesting characteristics. First, the
mean composite score (defined as the sum of the bench and computer score for each
student) is higher for Group Two than for Group One (140.6 and 127.9 respectively).
Second, the mean scores on both versions of the exam are higher for Group Two than
for Group One. The mean bench scores are 71.5 for Group Two and 68.1 for Group
One. The mean computer scores are 64.2 for Group Two and 58.8 for Group One.
Third, the within-groups range of scores is less for Group Two than for Group One
for both exams. The maximum bench score for group 2 is 75, and the minimum bench
score is 56, with a range of 19. The maximum computer score for group 2 is 75, the
minimum computer score is 69, with a range of 6. The maximum and minimum
bench scores for group 1 are 67 and 44, with a range of 23. The maximum and
minimum computer scores for group 1 are 73 and 62, with a range of 11. Finally, the
range of computer scores across groups is greater for the computer scores than for the
bench scores. The minimum and maximum computer scores are 73 and 44 with a
range of 29, whereas the minimum and maximum bench scores are 75 and 56, with a
range of 19. Figure 4.2 represents the score distributions graphically.

Figure 4. 2
Graphic representation of the Variability Between Overall Group Scores (Examination
Version by Group)
Mean Scores: Exam Version by Group
Computer Bench
Score Exam Version Score
When the within-groups and between-groups differences are this pronounced, one
would expect to see significant interactions produced by the analysis of variance. To
test the null hypotheses stated earlier, I selected the repeated measures procedure
under the general linear model (GLM) available in the Advanced Statistics module in
SPSS Version 7.0 for Windows 95 (SPSS Inc., p. 36). Prior to calculating the F
values for each hypothesis, I imported the raw dichotomous score data into an SPSS
data file from the Microsoft Excel workbook file containing these data. I then defined
variable and value labels in the SPSS data file consistent with the data source. Since
the number of subjects in the cells are unequal, this is an unbalanced (nonorthogonal)
model. Table 4.3 lists the repeated measures analysis of variance results.

Table 4. 3
Repeated Measures Analysis of Variance Results
Source df SS MS F
Between Subjects
O 1 390.89 390.89 5.55*
G 1 74.93 74.93 1.06
OxG 1 20.89 20.89 .29
error 13 (915.16) (70.40)
Within Subjects
TE 1 216.03 216.03 42.53**
TE x 0 1 58.62 58.62 11.54**
TE x G 1 5.08 5.08 1
TE x O x G 1 1.29 1.29 .25
error (TE) 13 (66.03) (5.08)
Note. Values enclosed in parentheses represent mean square or sums of squares error.
TE = Type of exam (bench or computer); G = Gender; O = Order; df = degrees of
freedom. *p <.05. **p< .01. All values were calculated with alpha = .05.
By default, SPSS uses the Levene test to check the homogeneity of variance of
residuals assumption that accompanies the univariate repeated measures analysis of
variance. In mixed designs utilizing more than one within-subjects factor, the

assumption that the variance-covariance matrices are equal across the cells formed by
the between subjects effect is tested using the Boxs M test. For this analysis, the
Boxs M F value reported is not applicable, since I used only one within-subjects
factor. The F value for the Levene test is statistically significant at the 0.05 alpha level
for the bench score (F=5.66, £=.031), but not for the computer score (F=2.76,
£=. 116). Although the homogeneity assumption has been violated for the bench score,
the implication of this will be discussed in Chapter 5.
The ANOVA results show a significant between-subjects main effect for the
order factor (F =5.55, £=.035), a significant within-subjects main effect for type of
exam (F=42.53, £=.00) and a significant interaction effect between type of exam and
order (F=l 1.54, £=.005). The main and interaction effects are displayed graphically in
Figure 4.2. Three of the seven hypotheses listed in Chapter 3 are rejected:
1. There is no main effect for order.
2. There is no main effect for type of exam.
3. There is no interaction between type of exam and order.
There is a statistically significant difference for both the mode in which the
examination was administered, and the order in which students completed the exams.
There were no significant between-groups or within-groups differences for gender.

Estimating Examination Reliability: Conducting an Item Analysis
As noted in Chapter 2, an examination can be considered useful for
assessment only if it is both reliable and valid. I presented evidence for the validity of
the laboratory practical examination earlier in this chapter. I now describe the
procedures I used to calculate its reliability.
Central to the methods used to determine the reliability of a measure is the
discrimination index of individual exam items. This is true because the overall
reliability of a measure depends upon how well individual examination items are able
to distinguish between those students who are know the material and those who do
not. Therefore, both item analysis and reliability data must be considered in
interpreting the reliability coefficient.
Works devoted to testing and measurement usually describe procedures for
computing a reliability coefficient by hand. In addition, computer programs such as
SPSS usually include procedures for calculating reliability and inter-item correlation
coefficients. For this study, I developed a computer model for computing reliability
coefficients and individual item discrimination indices. I used the following
procedures to create the model:3
1. Order the exam scores from highest to lowest.
2. Multiply the number of cases by .027 and round the result to the nearest whole
number. This number is represented by n.

3. Count off the top n scores. This is the high-scoring group.
4. Count off the n lowest scores. This is the low scoring group.
5. Determine the proportion in the high group (ph) answering each individual
item correctly, using the following formula:
pH = Number of correct responses to the item (4.1)
Repeat the procedure for the low-scoring group to obtain pl for each exam
6. Calculate the estimated difficulty index (p) by adding Ph to Pl and dividing the
resulting sum by 2 ( p = (pn + Pl )/2).
7. Subtract Pl from pn to obtain the measure of each items index of
(D = pH Pl) (4.2)
Designing the Item Analysis Models
I developed a separate item analysis model for each version of the
examination. The rationale for two separate models is based upon the small number
of students participating in this study; developing four models ( a unique bench and
computer item analysis model for both groups) would further reduce the number of
cases for each model, making each item discrimination index potentially more

difficult to interpret, due to the heightened effect of single score variances upon p and
I constructed each model from the 78 dichotomous item responses for each
student. The left-most column in each worksheet contains each students four-digit
identification number. The second column contains a letter designation for gender
(M,F). Columns 3 through 80 contain the response values, and are labeled according
to the item number. Column 81 contains a SUM function to total the dichotomous
values to calculate each students score.
Row 25 repeats the column headings for the item responses. Beginning with
row 27, a series of formulas calculate the following values; the percentage of high-
scorers correctly identifying an item, the percentage of low-scorers correctly
identifying an item, the item difficulty index (p), the item discrimination index (D),
the sum of correct responses, the sum of incorrect responses, the item variance, the
item mean, and the item standard deviation. The value 5 was used for each ph and pl
formula (17 subjects times 0.27 rounded to a whole number equals 5). Once I created
the initial formulas in the range C26:C35,1 used the automatic replication feature in
Excel to copy these formulas across each worksheet.4
The worksheets also include formulas to calculate summary statistics for each
exam version. Column 81 of each worksheet reports the following statistics: exam
mean, exam standard deviation, exam variance, exam standard error of measurement,

and Cronbachs alpha (the internal-consistency reliability coefficient). These statistics
for both exams are summarized in Table 4.4. The Excel worksheet containing the
item analysis data for the bench examination is included in Appendix J. Item analysis
statistics for the computer exam are listed in Appendix K.
Table 4. 4
Practical Examination Summary Statistics
Summary Statistics
Bench Computer
Exam mean 70.56 64.18
Exam standard Deviation 4.29 7.68
Exam variance 18.37 58.97
Exam Standard error 2.29 2.45
Cronbach's alpha 0.7157 0.8983
Interpreting the Item Analysis Data
The internal-consistency reliability coefficient for each exam can be
interpreted as follows. 71 % percent of the observed score variance is attributable to
the true score variance of these examinees for the bench version of the exam,
compared to 89% for the computer exam. In addition, the correlation between

observed and true scores for the bench exam is .85, and .95 for the computer exam.
The reader should remember that these values are only estimates.
The D (discrimination index) and p (difficulty index) values resulting from an
item analysis provide the test designer with information about the ability of an item to
discriminate between students. Items that are very easy or very difficult fail to
discriminate, and as noted by Hopkins, et al. (1990, p. 270), the potential
measurement value of an item is highest when its difficulty index is 0.5. The more
crucial issue, however, is the degree to which the items discrimination index is such
that indicates that the item does indeed discriminate between test takers. Crocker and
Algina (1986, p. 315) suggest using Ebels guidelines for interpreting D values:
1. If D > .40, the item discriminates satisfactorily.
2. If D > .30 and < .39, little or no revision is required.
3. If D > .20 and < .29, the item is marginal and needs revision.
4. If D < 19, the item should be eliminated or completely revised.
Table 4.5 lists the number of D values at levels ranging from less than 0 to 1 on both
versions fo the examination.

Table 4. 5
Number of Discrimination Indices From Less than Zero to One
Discrimination Index Bench Computer
D < 0 5 0
D = .0 36 29
D = .2 20 29
D = .4 14 10
D = .6 3 7
oo II Q 0 2
D = 1 0 1
The bench version of the examination has a greater percentage of negative or
zero D values, indicating that items on the computer exam were better at
discriminating high and low-scoring students. An additional consideration is
important when interpreting the discrimination index for criterion-referenced
measures. Items on mastery or criterion-referenced tests need not discriminate in
order to serve their purpose (Hopkins, et al., 1990, p. 268). Although I argued in
Chapter 3 that the items on this examination are above the knowledge level of
Blooms taxonomy, as an assessment instrument in the microbiology curriculum, it
would be preferable for all students to obtain a high score on the exam, resulting in
small D values. By the time students sit for the Board of Registry examination, they

ought to be able to score highly on a measure that assesses the degree to which critical
analyses can be made. Therefore, one can argue that a greater percentage of zero D
values could be a desirable trait.
To conclude the discussion of reliability, I offer additional evidence for the
overall validity and reliability of the bench exam. For the last three years, students
enrolled in the program that sat for the Board of Registry Examination have passed,
with mean scores above the national average on the Microbiology subtest. These
statistics, available in an unpublished report from the Board of Registry, are
summarized in Table 4.6.
Table 4. 6
Board of Registry Computer Adapted Test Statistics
Date of Examination Program Scores Population Scores
1993 Part B 569 491
1994 Part B 651 496
1994 Part F Bacteria 612 511
1994 Part F Micro. 714 503
1995 Part B 635 497
1995 Part F Bacteria 730 512
1995 Part F Micro. 598 502
Note. Program Scores=the mean scaled scores for HSC students; Population
Scores=mean scaled scores for the appropriate comparable population. These are

unpublished statistics included in a report sent directly to the instructors at the
University of Colorado Health Sciences Center Medical Technology School.
Without access to individual scores, there is no way to determine the
predictive validity of the practical examinations given throughout the year. The fact
that students enrolled in this medical technology program have scored so well on the
national examination provides additional evidence that the testing procedures are
reliable and valid.
Post-Examination Survey Results
I developed the post-examination survey for the purpose of collecting some
qualitative data to assist in interpreting the quantitative results of the repeated
measures analysis of variance. All but one student completed the post-examination
survey at the conclusion of the second administration of the practical examination.
The survey instrument, included in Appendix G, contains 6 Likert-scale items, and 3
open-ended questions. The mean Likert-scale responses are summarized in Table 4.7.

Table 4. 7
Mean Likert-Scale Survey Responses
Question 1 2.9
Question 2 1.8
Question 3 1.4
Question 4 1.8
Question 5 2.8
Question 6 2.0
Note. The following Likert-scale categories were used: 1. Definitely True; 2. True; 3.
No Opinion; 4. Not True; 5. Definitely Not True.
The three open-ended questions ask students to report about the computer
version of the examination. Question 1 regards potential distractions while taking the
examination in either form. Question 2 asks students about how the examinations
were administered. Question 3 asks the students whether the inability to handle
specimens during the computer version of the examination is a significant issue.
Student responses to these question can be grouped into three categories:
environmental issues, interface issues, and fidelity issues.
Environmental Issues
There are 6 comments regarding the level of noise in the library during the
computer examination. The library computers are located in a relatively high-traffic

area of the library. In addition, the only printers attached to the student-use computer
are dot-matrix printers, and during both administrations of the computer exams, I
noticed the printer noise.
Interface Issues
The surveys include 11 comments concerning the timer function in the
computer version of the exam. Specifically, students would like the computer exam to
be self-paced. Also related to the computer program interface are two general
comments about the computer images, without reference to the inability to make a
critical diagnosis. These comments concern issues such as image size and color
quality. Finally, 2 comments address the issue of having to type responses on the
computer screen.
Fidelity Issues
Students made a total of 18 comments regarding the fidelity of the
photographic images included in the computer exam. Six comments are in regards to
the photographs of agar plates showing hemolysis. Five comments address the
photographs of biochemical tubes, and the difficulty of distinguishing either a sodium
chloride reaction, or the presence of hydrogen sulfide gas in a biochemical tube. Five
comments address the role smell plays in making a diagnosis, and two comments
concern the importance of being able to hold a plate up to the light to make a
diagnosis. All survey responses are included in Appendix K.

In this chapter I present the results of the repeated measures design described
in Chapter 3.1 first summarize the content validity estimation data according to the
mean rating for each item on the practical examination, and conclude that there are
sufficient data to consider the laboratory practical examination used in this study a
valid measure. To compare student responses on both versions of the examination, I
used a repeated measures analysis of variance procedure with one dependent variable
and three independent variables, each with two levels. I discovered two significant
main effects and one significant interaction, and therefore must reject the following
null hypotheses: that there is no main effect for order, that there is no main effect for
type of examination, and that there is no interaction effect between type of
examination and order. I developed two models in Microsoft Excel for conducting an
item analysis for both versions of the examination, and present the reliability
coefficient, item discrimination indexes, and item difficulty indexes for both exams. I
conclude this chapter by presenting the results of the post-examination survey that
was completed by all but one of the students who participated in this study. The
results indicate that students were primarily concerned with issues relating to the
testing environment, the computer examination interface, and the fidelity of
environmental, interface design, and the fidelity of the color photographs displayed in
the computer exam to tests and procedures encountered on the laboratory bench. In
the next chapter I discuss the results reported here.

The data presented in this study are a continuation of two previous studies I
conducted about using computers for assessment at the Medical Technology School
of the University of Colorado Health Sciences Center. The main purpose of this study
is to determine whether student performance on the computer examination is at a
sufficient level to consider the bench and computer examinations equivalent. This
study makes a valuable contribution to the research base concerning the use of
computers for administering teacher-constructed measures. In this chapter I will
interpret the results of this study and discuss implication for further research. I begin
by discussing the content-validity data. I then interpret the results of the repeated
measures analysis of variance, and explain the observed differences by referring to
data from both the item analysis study and the post-examination survey. I conclude
this chapter with recommendations for improving the computer version, and
suggestions for further research.
The Content Validation Study
As noted in Chapter 2, there is an apparent discrepancy between the recent
emphasis upon all forms of validity as a subset of construct validity (Messick, 1989)

and the designation of varying kinds of validity appropriate for different measures
(Hopkins, et. al., 1990; AERA, APA, and NCME, 1985). These can be resolved by
focusing upon content validity as ...the family of test construction and validation
procedures pertaining to measurement of the underlying domain (Sireci, 1995, p.
26). For this study, I sought to emphasize the agreement of individual test items with
the greatest number of laboratory objectives as evidence of content representation. As
noted in Chapter 3, the instrument used by subject matter experts to rate the items by
objective was revised from an instrument used in a pilot study. Although certain
objectives clearly do not apply to this measure (e.g., Care and use of the microscope),
the ratings demonstrate that additional objectives are not appropriate to this test. For
example, the results listed in Appendix H and I indicate that laboratory objective 7e
(Perform susceptibility tests and interpret the results) was not applicable to any item,
with the exception of item 8, where two of the three experts indicated that the
objective does indeed apply. The ambivalence is understandable upon examination of
the test item. Students do not perform the tests, they merely interpret them, and
interpreting the tests is only one component of the identification. Therefore, two of
the three experts rated this objective as applicable to the item since one component of
the objective applies.
The experts also indicated that objective 8 (Given any culture, the student will
be able to perform and interpret any appropriate staining procedure) clearly does not
apply to five of the exam items (items 2, 5, 7, 10, and 11), because they have nothing

to do with staining. Of the items where the objective was deemed applicable by one or
more of the experts (items 1, 3, 4, 6, 8, 9, 12, 14, 15, 16, and 17), identification of the
correct response involves the interpretation of a staining procedure, but not the
procedure of staining the culture. Thus, the same ambivalence as with objective 7e is
evidenced here.
Other characteristics of the item-content matching task should be noted. First,
as is seen in Table 5.1, there are a total of 170 rating tasks in the content validity form
(17 examination items times 10 laboratory objectives). Of these tasks, Expert 1 gave
the highest number of level 1 ratings (94), while Expert 3 gave the highest number of
Not Applicable ratings (71). Expert 2 tended to rate items slightly lower than
Expert 1, as evidenced by the frequency of level 2 ratings (90). Second, across all
three experts, 154 of the 510 cumulative rating tasks, or 33 percent, were rated as
having the highest level of agreement with the stated objectives, and 135, or 26.37
percent are rated as having a high degree of agreement with the laboratory objectives.
Thus, 288 (56.47 percent) of the rating tasks are rated as having a level 1 or 2
congruence with the course objectives.

Table 5. 1
Summary Frequencies: Content-Validity Ratings By Expert
Rating Scale
1 2 3 4 5 N Total
Expert 1 94 24 9 0 0 43 170
Expert 2 13 90 15 1 3 48 170
Expert 3 46 21 13 9 10 71 170
Total 153 135 37 10 13 162 510
30.00% 26.47% 7.25% 1.96% 2.55% 31.76% 100.00%
Finally, it should be noted that there were no items on the examination that do
not match as least four of the ten objectives (40 percent), and therefore a majority of
the items on the examination match at least 50 percent of the course objectives. In
conclusion, using the quantitative index of the percentage of items matched to
objectives recommended by Crocker and Algina (1989, p. 221), this examination is
estimated to possess a high degree of content validity. I make this estimation,
recognizing current conceptions of validity, and with full awareness content validity is
only one of multiple lines of evidence for estimating validity.

Interpreting the Results of the Repeated Measures Analysis of Variance:
Assumptions, Violations, and Explanations
The null hypothesis of primary interest in this study is that there is no
significant difference in test scores between the two groups of student for each
version of the examination. From the results presented in Chapter 4, it is clear that
statistically significant differences exist both within-groups and between groups,
leading me to reject three null hypotheses: that there is no main effect for order, that
there is no main effect for type of examination, and that there is no interaction effect
between type of examination and order. In this section I explain the potential sources
of within-groups and between-groups variance.
First, I acknowledge the effect that the small number of students has upon the
results reported here. I selected a 2 x 2 x 2 repeated measures ANOVA design with
order of examination and gender as factors that increased statistical power. From the
ANOVA results presented in Table 4.3, the within-subjects effect is statistically
significant at the 0.01 level for both scores and the scores-by-order interaction.
Between subjects, the order in which the two versions for the examination were taken
was significant at the 0.05 level. There are three possible explanations for these
differences. First, the between-group variance may be due to unobservable and
random factors, not the order factor. Second, the between-groups variance may be due
to the order factor, and specifically, differences between the students exposure to the
two modes of test administration. Third, the differences within-groups are most likely

due to differences in the two examinations, such as the quality of some of the
photographic images. As I explain below, this conclusion is supported by both the
item analysis data and the post-examination survey responses. First, however, I
discuss the assumptions accompanying the analysis of variance.
Assumptions and Violation of Assumptions for the ANOVA Model
In general, the analysis of variance procedure makes three assumptions
concerning the underlying data (Lomax, 1992, pp. 107-111). First, residual errors are
assumed to be random and independent errors. This assumption is assessed by
analyzing residual plots by group, and random assignment of subjects to groups can
be used to achieve independence. Second, the distributions of the residual errors for
each group are assumed to have a constant variance (homoscedasticity). A number of
statistical tests are available for assessing this assumption, and only certain tests apply
to an unequal ns design. If the violation of this assumption is serious, data
transformations, such as power transformations can be used (Norusis, 1992, pp. ISO-
181). Third, it is assumed that the conditional distribution of the residual errors is
normal in shape. It should be noted that violations of these assumptions are less of a
consideration with equal ns and large ns designs.
The repeated measures split-plot analysis of variance retains the three
assumptions listed above for the between subjects factors, and makes an additional

assumption concerning the within subjects factor; the homogeneity of covariance
(compound symmetry).
The data used in this study failed to meet the assumption of homogeneity of
variance across groups, as indicated by a p value of .016 for the computer score
groups on the Levene test. The Levene test is appropriate for testing the variance of a
single metric variable across groups, and the Boxs M test in multivariate designs
(Hair, et al., p. 68).
An alternative strategy for analyzing homoscedasticity is to compare observed
versus expected residual plots and normal probability plots of standardized residuals.
In the first case, if a pattern in lacking in the observed versus predicted residuals plot,
homoscedasticity can be assumed. If the plot of observed versus expected residuals
generally fall along the normal line, again the assumption of normality is met
(Norusis, 1994, p. 43).
I created the recommended plots using the Plot subcommand in SPSS 7.0,
which are included in Appendix M. Although the Levene test indicated a significant
difference across groups for the computer score (second level of the repeated
measure), a significant departure from normality was not apparent. This is most likely
due to the small number of cases, which makes visual detection of a pattern difficult.
However, the normal probability plot for standardized residuals did not show an
extreme departure from normality.