The technical feasibility of using large-scale measures of opportunity-to-learn to inform policy and practice

Material Information

The technical feasibility of using large-scale measures of opportunity-to-learn to inform policy and practice illustrations using the Colorado TIMSS
Snow-Renner, Ravay Lynn
Publication Date:
Physical Description:
178 leaves : ; 28 cm


Subjects / Keywords:
Educational equalization -- Colorado ( lcsh )
Academic achievement -- Evaluation -- Colorado ( lcsh )
Academic achievement -- Evaluation ( fast )
Educational equalization ( fast )
Colorado ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 171-178).
General Note:
School of Education and Human Development
Statement of Responsibility:
by Ravay Lynn Snow-Renner.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
45545383 ( OCLC )
LD1190.E3 2000d .S56 ( lcc )

Full Text
Ravay Lynn Snow-Renner
B.A., College of William and Mary, 1986
M.A., University of Colorado, 1994
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Educational Leadership and Innovation

2000 by Ravay Lynn Snow-Renner
All rights reserved.

This thesis for the Doctor of Philosophy
degree by
Ravay Lynn Snow-Renner
has been approved

Snow-Renner, Ravay Lynn (Ph.D., Educational Leadership and Innovation)
The technical feasibility of using large-scale measures of opportunity-to-learn to inform
policy and practice: Illustrations using the Colorado TIMSS
Thesis directed by Assistant Professor Nancy M. Sanders
The study explores the feasibility of large-scale measures of opportunity-to-leam
(OTL) for informing policy and practice at state and local levels, using the TIMSS as an
exemplary measure of OTL with linked student achievement data. The extent and nature of
the relationship between TIMSS measures of classroom mean mathematics achievement on
content-specific subscales and content-specific OTL is examined through exploring their
correlations. To make these comparisons, it was first necessary to conduct a series of
intermediate data manipulations and analyses. Content-specific measures of mathematics
achievement were created for use at the classroom level, and their technical qualities
(reliability, validity) were examined through a series of analyses. Building on a model of
OTL that addresses content coverage and instructional practices, classroom-specific
indicators of OTL were explored, described, and mapped to specific mathematics content.
Correlations between content-specific achievement scales and OTL indicate small
relationships that vary by grade level and the specific content measured.
The overall finding is that the relationship between OTL and achievement are
complex and not easily captured by large-scale measures. Ideally, straightforward and
strong relationships between OTL and achievement measures might be used to ensure
educational equity, as well as to inform policies that attempt to link classroom practices with
student outcomes. However, the conceptions and judgments of researchers and policy
makers about content domains, specifications, item format, and definitions of OTL and
achievement determine and delimit the nature of the relationships that can be described
using large-scale measures. How achievement and OTL are defined bound the possible
findings. These findings indicate the importance of developing focused and technically
sound measures of OTL and achievement, rather than using exisitng assessments and
prototypical measures. Within complex databases, the researchers choices also play into
findings, and the author provides a series of decision points from her work with the TIMSS
illustrating the complexity of the choice and research process.
This abstract accurately represents the content of the candidates thesis. I recommend its

I dedicate this dissertation to Jon, my husband, who has been unfailingly patient, loving, and
supportive of me during this entire process. I also wish to dedicate it to my parents, who
were the family educational trailblazers of their generation.

I would not have been able to complete this dissertation without the assistance,
supervision, and mentorship of my advisor, Nancy Sanders. She has been invaluable in
providing feedback, support, and intellectual stimulation throughout the research and writing
I would also like to acknowledge the contributions of other individuals and entities
who supported me during this study:
Fran Berry and Mattye Pollard-Cole, who provided me with invaluable technical
assistance and content area expertise;
Shep Roey, of WESTAT, Inc., who assisted me initially with the logistics of
accessing and manipulating the Colorado data files;
the National Council for Education Statistics (NCES), which sponsored my training
on the use of the International TIMSS database in 1997;
the UC-Denver Education faculty, who awarded me a Graduate Fellowship during
the same year; and
the National Science Foundation, which provided partial support for this study.
While Nancy was the individual who helped me finish the process, l would like to
acknowledge the person who helped me start this process, my brilliant friend, Renee. When
we were in middle school, I took algebra without even thinking about it, while Renee, who
came from the trailer park, had to fight the administration in order to take math. She was
told by the counselor that she should take Home Economics classes instead. The anger that
I felt at this unfairness was the impetus for beginning doctoral study and focusing on issues
of student opportunity.
The work reported in this study was partially supported by grant number REC
9905548 from the National Science Foundation. The findings and opinions expressed herein
do not reflect the position or policies of the National Science Foundation.

Figures .........................................................................ix
Tables.......................................................................... x
1. INTRODUCTION TO THE PROBLEM........................................... 1
Opportunity to Learn.............................................. 2
OTL Indicators of Effective Practice....................... 4
Investigating Classroom Practices in Relation to Achievement .. 5
Research Problem and Research Questions........................... 6
Study Design and Methods.......................................... 7
Overview of the Dissertation...................................... 8
2. A REVIEW OF THE RESEARCH ON OTL ..................................... 10
OTL: A Conceptual History........................................ 10
OTL Within the Context of Standards-Based Reform
and Politics..................................................... 11
Uses of OTL Data................................................. 14
OTL to Inform Accountability Processes.................... 16
OTL to Monitor Reform Implementation...................... 18
OTL to Monitor Educational Equity......................... 20
OTL and Achievement: Shedding Light on What Works......... 22
OTL Research About Technical Issues.............................. 23
Developing Specific Models of OTL......................... 23
Validating OTL Measures and the OTL Construct:
Approaches With and Without Achievement Measures ......... 26
The TIMSS Achievement Survey: A Prototype OTL Measure............ 30
Chapter Summary ................................................. 35
3. METHODOLOGY ......................................................... 36
Research Questions............................................... 36
Design of the Study ............................................. 37
Sampling.................................................. 37
Data Analysis............................................. 38
Key Technical Decisions: Choices and Rationales ................. 42
Initial Decisions Addressing Issues of Sampling
and Weighting............................................. 43
Decisions about Operationalizing Achievement.............. 45
Decisions about Operationalizing OTL...................... 51
Decisions About Linking the Data and Methods of Comparison 58
Limitations of the Study.................................. 60

OF TECHNICAL QUALITIES.......................................... 61
Subscale Scores: Patterns of Achievement ...................... 61
Estimating Subscale Reliability................................. 64
Exploring Evidence for Subscale Validity:
Comparisons with Weighted Colorado Data......................... 67
Exploring Evidence of Subscale Validity:
Achievement Differences by Grade Level.......................... 69
Summary of Subscale Patterns and Techical Qualities............. 71
Data Patterns in OTL Variables ................................. 73
Curricular Focus ........................................ 74
Topic Coverage........................................... 75
Duration of Instruction on Content....................... 86
Student Learning Activities ............................ 90
Summary: Data Patterns in OTL Variables......................... 98
AND OTL VARIABLES......................................................101
Curricular Focus and Achievement................................103
Topic Coverage/Duration of Instruction, and Achievement.........103
Student Learning Activities and Achievement.....................104
Summary: OTL/Achievement Correlations...........................105
7. CONCLUSIONS.........................................................107
OTL and Achievement: A Summary of Intermediate Data Patterns .... 107
OTL/Achievement Correlations and Implications for the Technical
Feasibility of Large-Scale OTL Measures to Inform Policy........108
Technical Issues in Large-scale OTL
And Achievement Measurement .............................109
Recommendations for Research....................................118
Improved Collaboration Around These Issues ..............118
Increased Local Capacity for
Systematic Use of Data...................................119
Further Study of the Effects of Accountability Policies
On Learning..............................................120
A. TIMSS Teacher Questionnaire, Population 1 ......................122
B. Summary Report on the State TIMSS Sample for Colorado...........150
C. Weighting Report for the Colorado State TIMSS Survey ...........156
D. Tests of Between-Subjects Effects-Subscale Achievement by Grade ..163
E. Duration of Instruction on Content .............................164
F. Frequencies of Instructional Strategies Reported in Mathematics.170

4.1 Box-and-whisker plot illustrating Fractions subscale distributions by grade .... 63
4.2 Distributions of class performance on Fractions by grade level ................. 64

3.1 OTL--An Operational Model and Guiding Intermediate Research Questions .... 53
4.1 Summary Statistics of Mathematics Subscales................................. 62
4.2 Summary of Subscale Reliability Estimates .................................. 65
4.3 Comparison of Subscale Scores with Reports of Achievement
Using Weighted Data................................................ 68
5.1 Descriptive Statistics: Curricular Focus (MATHTOPS) by Grade................ 74
5.2 ANOVA of Curricular Focus-Differences by Grade Level........................ 75
5.3 Teacher Topic Coverage on Whole Numbers and Grade Level Differences .... 77
5.4 Teacher Topic Coverage on Fractions and Grade Level Differences........... 78
5.5 Teacher Topic Coverage on Measurement and Grade Level Differences........ 80
5.6 Teacher Topic Coverage on Data and Grade Level Differences................ 81
5.7 Teacher Topic Coverage on Geometry and Grade Level Differences............ 82
5.8 Teacher Topic Coverage on Other Topics and Grade Level Differences........ 84
5.9 Descriptive Statistics on Learning Activities Variables..................... 92
5.10 Factor Loadings and Structure Matrix........................................ 97
6.1 Correlations Between OTL Variables and Classroom Subscale Achievement ... 102

Accountability systems and assessment-driven school reform are becoming key
strategies in policy makers attempts to improve education and raise student achievement.
As McDonnell (1994) noted early in the recent reform movement, policy makers lack
understanding about the appropriate uses of tests and the information they provide.
According to McDonnell, they need to acknowledge that even the best assessments are
imprecise measurement tools with real limits on their generalizeability and appropriate use"
(McDonnell, 1994, p. 42). Inappropriate uses of tests to make decisions about educational
systems and students can undermine the reform agenda and have the potential for
significant harm to individuals and public confidence in education.
By extension, policy makers also lack understanding about the limitations of other
measures of educational processes, particularly those of educational quality. While
student achievement is often treated by policy makers as a definitive outcome measure that
comprises proof of educational quality, the truth is that it cannot stand alone. In the interest
of fairness to students, teachers, and schools, before achievement data are used to inform
policy decisions, concerted efforts should be made to examine the context of the learning
environment (American Educational Research Association [AERA], 1999; Schmidt &
McKnight, 1995; Shavelson & Webb, 1995). It is not fair to hold students accountable for
knowledge to which they have not had access, just as it is not fair to hold schools and
teachers who have not been provided with adequate resources and training accountable for
student achievement.
Indicators of student opportunity to learn may hold considerable potential for
informing policy about the quality and equitable distribution of classroom processes.

Considerable research has been conducted on classroom level opportunity to learn
indicators and their measures, as described by McDonnell (1995). The examination and
analysis of opportunity to learn indicators may not only offer a potential way to improve
fairness in educational policy, but also might provide empirical information about the
relationship between instruction and achievement.
Opportunity to Learn
Opportunity to learn (OTL) as a broad concept makes intuitive sense. At its most
general level, it addresses whether students have had the opportunity to study a particular
topic or to leam how to solve a particular type of problem presented by an assessment
(McDonnell, 1995). Originally OTL was conceptualized by international researchers in the
early 1960s as a way to increase the validity of cross-national comparisons of student
mathematics achievement. Eventually, OTL measures have been refined to address
classroom-specific processes, including whether teachers had taught the content needed to
respond to specific items administered on the test, as well as about their general goals,
beliefs, instructional strategies, and professional preparation (McDonnell 1995; Schmidt &
McKnight 1995).
By the mid-1980's, researchers working on the development of curriculum and
process indicators in the United States were influenced by international findings about OTL,
and made recommendations to include OTL information in the educational indicator data
collected regularly by states and the federal government (McDonnell, 1995). Various studies
and policy reports expressed the need to develop and collect large-scale data about school
and classroom processes (Blank, 1993; Guiton & Burstein 1993). By the early 1990's, OTL
had become part of the policy language of standards-based education reforms, which aimed
for sweeping shifts in the nature of classroom instruction, partially driven by new forms of

assessment. By measuring indicators of OTL (or OTL standards, as they were framed in
policy disputes), it was thought that more equitable learning opportunities could be
developed for students before being held accountable for their achievement (Conference
Report, 1994; McLaughlin & Shepard, 1995; National Council on Education Standards and
Testing [NCEST], 1992). The redistributive potential of OTL standards, however, resulted in
hot bipartisan debate in the mid-1990s about their use and definition, with the result that OTL
was stripped from legislation. Although the construct has been studied for some time, holds
promise for exploring the immediate context of learning, and draws the interest of
researchers, OTL has not resurfaced within the policy context.
In the literature, OTL indicators have been defined by a variety of different models
that vary in complexity. Some models address multiple levels of the educational system,
focusing on different levels of opportunity simultaneously, either in terms of nations, schools,
and classrooms (Schmidt & McKnight, 1995) or schools and students (Wang, 1998). Others
tend to emphasize OTL at the classroom level (Burstein, McDonnell, Van Winkle, Ormseth,
Mirocha, & Guiton, 1995). I address OTL at the classroom level, focusing on indicators that
relate to content coverage and instructional strategies, consistent with a large portion of the
research literature.
The technical qualities of large-scale OTL measures have also been explored, and
most researchers agree that the most feasible large-scale measures are teacher surveys.
Within limits, teacher surveys can provide reasonably accurate information about classroom
processes (Burstein, et al., 1995; Schmidt & McKnight, 1995). However, when examining
technical qualities of OTL measures, the issue of stakes associated with different uses of the
data needs to be addressed.
Stakes are described in the Standards for Educational and Psychological Testing
(AERA, 1999) as the importance of the results of testing programs for individuals,

institutions, or groups (p. 139). Because OTL describes the conditions necessary for
students to learn what is expected on tests, by extension OTL becomes part of the
educational stakes at issue. The Standards note that higher-stakes use of data makes
technical requirements for the quality of the measure more stringent:
In particular, when the stakes for an individual are high, and important
decisions depend substantially on test performance, the test needs to exhibit
higher standards of technical quality for its avowed purposes than might be
expected of tests used for lower-stakes purposes...(p. 139)
One aspect of technical quality might be in how closely the content measured by the test is
reflected in the content covered in the classroom. Another aspect might be how closely test
format relates to the format of student classroom experiences. OTL provides a potential
measure for assessing these aspects of test validity; therefore the requirements for quality of
measurement apply not only to the achievement measure used, but also to OTL measures. I
investigate here the technical qualities of large-scale OTL measures in relation to
OTL Indicators of Effective Practice
One of the purposes described in the OTL literature is to measure what works
instructionally in relationship to student achievement (Brewer & Stacz, 1996). The use of
large-scale measures of OTL to inform policy about what works is consistent with a long line
of research studying relationships between educational processes and achievement.
Generally, this type of research has been characterized by a broad variety of approaches
resulting in mixed results. Early on, Coleman, et al. (1966) conducted a sociological analysis
examining educational factors and student achievement, concluding that variables directly
related to schooling only account for about 10% of the variance in student achievement.
However, the findings of this study have been met with considerable criticism from other
researchers (see Berliners Biddle, 1995; Epps, 1974). Among other approaches,

researchers have examined school level effects, such as that of school resources on student
achievement (Finn & Achilles, 1999; Greenwald, Hedges, & Laine, 1996; Hanushek, 1997;
Payne & Biddle, 1999), and organizational factors (Smith, Hocevar, & Wohlstetter, 1998)
with little agreement about the nature of the variables that support student learning. Other
studies have addressed classroom-specific variables such as teacher quality" (e.g., Darling-
Hammond, 1999) and have come up with similarly mixed results about the specific
constructs and relationships.
Investigating Classroom Practices in Relation to Achievement
Without definitive evidence about what works in classrooms, the Third International
Mathematics and Science Survey (TIMSS) provides an approach to defining and measuring
OTL on a large scale in conjunction with student achievement. The development of the
TIMSS measure grew out of the international research that generated the OTL concept to
estimate the validity of cross-national comparisons, and draws on more than thirty years of
comparative research about opportunities and achievement.
The TIMSS provides policy makers and researchers with a comprehensive and well-
established model of OTL and student achievement. The instruments consist of
comprehensive, widely-studied survey measures of teacher, school, and student background
information, and large-scale comparative measures of achievement in mathematics and
science. Development of the measure is based on an extensive and empirically-based
model of OTL (Schmidt & McKnight, 1995) that underlies both achievement and
contextualizing measures.
Additionally, TIMSS has high visibility for U. S. policy makers, as it served as the
overall indicator for measuring progress toward Goal 5 of the eight widely-touted national
goals formulated in 1994 under the ambitious Goals 2000 education initiative:

Goal 5: By the year 2000, United States students will be first in the world in
mathematics and science achievement (Goals 2000, Sec. 102,1994).
However, TIMSS results were relatively disappointing for U.S. policy makers. Fourth grade
students were closest to meeting the fifth national goal in 1995; they ranked 8th of 25
participating countries but eighth and twelfth grade students performed well below the
international averages for those levels (U.S. Department of Education, National Center for
Education Statistics [NCES], 1996, 1997, 1998).as well as some findings about international
curricular differences have led to much publicity as well a variety of recommendations for U.
S. education policy, largely emphasizing more focus on fewer curricular topics and more
attention to progression and articulation at the middle school level (Macnab, 2000; National
Research Council, 1999; Schmidt, McKnight, Cogan, Jakwerth, & Houang, 1999).
In 1995, when the TIMSS was administered around the world, Colorado opted to
participate with a state-level sample in the study. Representative data are available from the
9 year old population of third and fourth grade students in both mathematics and science,
and student data can be linked with teacher level data. Thus OTL information and
achievement information can be explicitly connected within the database and their relations
assessed directly. This provides the background to the research problem and questions that
guided this study.
Research Problem and Research Questions
The purpose of the study to examine the technical feasibility of using large-scale
measure of OTL to inform policy and practice using the TIMSS database. I define OTL
within the cbntext of classroom-level processes measured in the TIMSS teacher surveys,
focusing on aspects of content coverage and instructional strategies. Additionally, I examine
the relationship between teacher-generated OTL data and classroom-level mathematics
achievement data measured for third and fourth grade students who took the TIMSS. As

part of my analyses, I investigate alternative methods of addressing technical issues. By
using the TIMSS, an established, prototypical measure of OTL and achievement that has
been extensively studied and highly visible, I hope to examine the potential for OTL
indicators to inform practice and policy. To narrow the study, I focus on issues pertaining
only to the mathematics OTL and achievement items. The study questions are:
1. What is the technical feasibility of using large-scale measures of OTL to
inform policy and practice at state and local levels?
2. Using data from TIMSS as an exemplary illustration of OTL and student
achievement measures, what is the nature and extent of the relationships
between OTL and student achievement?
Study Design and Methods
The study consists of a statistical analysis of existing data from the 1995
administration of the TIMSS at grades 3 and 4 in Colorado, obtained from WESTAT, Inc.,
the national data contractors for TIMSS. I developed aggregated measures of student
achievement in mathematics for correlation with OTL variables measured through teacher
surveys. This entailed the exploration and use of a series of variables from two separate
databases, one with student achievement data and one with teacher survey data, and
combining the two. Analyses were complicated by the structure of the data, and, in order to
answer my research question, it was necessary to address a series of technical issues,
including sampling, weighting, and unit of analysis issues; the operationalization and
development of achievement measures; the operationalization and development of OTL
variables, and appropriate methods of exploring the relations between OTL and
achievement. These technical issues illustrate the decision-making processes that are
required to link OTL and achievement data. Decisions arose and are presented in an
iterative fashion. As I used the TIMSS to explore the technical feasibility of OTL measures
relative to student achievement, I found that I needed to make decisions about

operationalizing terms and that decisions I made and emergent data patterns early on limited
later choices about analyses and interpretation. The entire study was exploratory to
investigate the data structures and the technical requirements for such efforts. The technical
decisions described here, while informed by and organized around my overarching
questions, were by no means the only choices available, and they illustrate the technical
values and complexities in this endeavor.
Answering the research questions required the following:
1) operationalizing constructs of achievement and OTL (deciding which TIMSS
items to use);
2) determining the level of analysis and creating variables at that level; and
3) analyzing the interrelationships between achievement and OTL variables.
Overview of the Dissertation
In Chapter 2,1 present a detailed discussion of OTL indicators as they have been
operationalized within research and policy contexts. I particularly focus on four overlapping
ways in which different researchers and policy makers have conceptualized OTL indicators
and describe the current policy context of OTL. The debates and empirical studies
emphasize the importance of OTL for providing contextual information around assessment
and the need for further study and refinement of measures. In the final section of Chapter 2,
I discuss the TIMSS measure as an exemplar of state-of-the-art OTL measurement, as well
as a highly-visible measure of student achievement.
In Chapter 3,1 describe the research methods used. I provide a brief description of
how TIMSS measures achievement and contextual factors and information about the study
design, methods, and data analysis procedures. Additionally, I describe decisions about
construct operationalization, data manipulation, units of analysis, and procedures for
merging achievement and OTL data. I provide a description of and rationale for the creation

of content-specific, classroom level subscales as achievement measures in mathematics.
Then I describe a model of OTL that I developed based both on established OTL research
and the structure of OTL variables in the TIMSS teacher background database. Finally, I
describe the processes by which I merged OTL and achievement data and explored their
Chapter 4 addresses the nature of classroom level achievement subscales and
describes the results of reliability and validity estimates of these subscales. Reliability was
estimated for each subscale using coefficient alpha. Content validity of subscales was
explored by:
1) comparing data patterns across subscales with other achievement
information; and
2) in terms of subscale sensitivity to instructional effects, by examining
differences in classroom scores by grade level.
Chapter 5 describes the data patterns in the OTL model relative to content coverage
and instructional strategies in mathematics. These include patterns of teacher curricular
focus, topic coverage, duration of instruction on topics, and instructional strategies related to
student learning activities. These data patterns were explored in order to see whether they
were similar to other general findings about OTL.
Chapter 6 contains descriptions of the correlations between the TIMSS achievement
measure and elements of OTL as defined above. Significant correlations between OTL and
achievement measures are noted and findings described in relation to the second research
Chapter 7 summarizes overall data patterns and discusses conclusions. The
technical feasibility of using large-scale measures of OTL to inform policy is assessed in light
of the findings described in Chapter 6, the state of current OTL technology, and issues
arising from using measures in ways for which they have not been designed.

This chapter begins with a history of OTL in the TIMSS, then describes its current
application in standards-based education and the political controversies about OTL policies
and uses. The literature is described in two sections: first the literature about the construct
and its uses, and second, empirical studies and technical issues in operationalizing and
measuring OTL in relation to achievement.
OTL: A Conceptual History
McDonnell (1995) reports that OTL originated as a technical research term designed
to increase the validity of cross-national comparisons of student mathematics achievement
conducted by the IEA. IEA researchers realized that they needed to take national curricular
differences into account when conducting comparative studies of international achievement.
OTL was introduced as part of the First International Mathematics Survey in the early 1960's,
but was refined by the second administration of the study (the Second International
Mathematics Study, or SIMS), conducted between 1976 and 1982. During SIMS, teachers
were surveyed about whether they had taught the content needed to respond to specific
items administered on the test, as well as about their general goals, beliefs, instructional
strategies, and professional preparation (McDonnell 1995; Schmidt & McKnight 1995).
By the third administration of the study (the Third International Mathematics and
Science Study, or TIMSS), administered in 1995, the OTL measure was further refined,
using teacher surveys that addressed a broader range of OTL than previously. Item-specific
coverage questions were limited to a subsample of test items at the seventh and eighth
grade levels, while additional general questions about content coverage and instructional

strategies were incorporated. Additionally, alternate data sources were incorporated into the
TIMSS design to provide validity information about the OTL surveys. These included a
videotape study and curriculum and textbook analyses (Schmidt & McKnight 1995).
OTL Within the Context of Standards-Based Reform and Politics
In the mid-1980's, after the administration of SIMS, IEA findings about OTL began to
influence the development of curriculum and process indicators, particularly in the areas of
mathematics and science. Recommendations were made to include such information in the
educational indicator data collected regularly by states and the federal government
(McDonnell, 1995). Apart from IEA efforts, the research about classroom processes tended
to be small-scale and based on unrepresentative samples or else were focused on traditional
resource inputs that did not capture variation at the classroom level (Brewer & Stacz 1996).
A variety of reports (Blank, 1993; Guiton & Burstein 1993; Raizen & Jones 1985; Stecher,
1992) indicated the need to develop and collect data about school and classroom processes
on a broad scale.
OTL became a central concept in standards reforms of the early 1990s because of
the reform focus on the connections among content, assessment and instruction. The
reforms emphasized a thinking curriculum (Resnick & Resnick, 1992) that focused on
higher order activities, such as problem-solving, and similarly complex assessments worth
teaching to. Standards were described to establish common expectations and to improve
learning for all students. Additionally, the reforms emphasized the role of accountability
testing to drive instructional processes. Resnick and Resnick addressed the ways in which
assessment may drive instructional change by discussing OTL as overlap:
In sophisticated discussions of the relationship between testing and
curriculum, there is usually considerable attention to the question of
overlap, the extent to which test items and curriculum activities are the
same. When overlap is high, test scores are high; when overlap decreases,

so do test scores...School districts and teachers try to maximize overlap
between the tests they use and their curriculum by choosing tests that match
their curriculum. When they cannot control the tests-which is increasing the
case when states mandate the tests-they strive for overlap by trying to
match curriculum to the tests, that is, by curriculum alignment.
(Resnick & Resnick, 1992, p. 57)
Since what is assessed tends to drive instruction in this way, the Resnicks argue that, if we
want instruction to measure higher level thinking, we must build higher-level assessments
that measure what we want teachers to teach. They advocate the use of complex,
performance-based assessments.
This way of thinking about instruction and testing is contradictory to psychometric
assumptions underlying norm-referenced testing practices. One of the assumptions in norm-
referenced test construction is that individual test items are drawn from a curriculum-neutral
content domain from which the sampling of items represents the entire domain. By sampling
across the domain, curriculum and instruction-specific effects are supposed to be reduced.
Teaching students how to respond to specific items violates this assumption. Yet numerous
researchers (for example, Frederiksen & Collins, 1989; Haladyna, Nolen, & Haas, 1991)
have noted that teaching to norm-referenced tests is widespread, particularly when test
results are attached to high stakes consequences. The new paradigm of curriculum and
instruction-specific assessment has become a centerpiece of standards reform, replacing the
norm-referenced testing model.
OTL emerged during the national policy debate on standards and testing as school
delivery standards, which were conceived as an integral part of the systemic approach. As
envisioned by the National Council on Education Standards and Testing (NCEST), a full
system of standards included:
content standards that describe the knowledge, skills, and other understandings that
schools should teach in order for students to attain high levels of competency in
challenging subject matter;

student performance standards that define various levels of competence in the
challenging subject matter set out in the content standards
school delivery standards developed by the states collectively from which each state
could select the criteria that it finds useful for the purpose of assessing a schools
capacity and performance; and
system performance standards that provide evidence about the success of schools,
local school systems, states, and the Nation in bringing all students, leaving no one
behind to high performance standards
(NCEST, 1992, p. 13)
Later, the policy terminology changed from delivery standards to opportunity-to-learn
OTL standards proved to be contentious during early conversations about standards
reform. Advocates envisioned them as a way to hold policy makers accountable for
providing adequate opportunities for learning to students traditionally underserved by the
educational system (ODay & Smith, 1992). Without OTL standards, it was feared that
students would unfairly assume all the consequences for the current inequities in curriculum,
instruction, and learning environments, rather than schools and systems. However, others
raised concerns about whether OTL standards were appropriate vehicles for addressing the
equity and quality problems of education (McLaughlin & Shepard, 1995; Traiman, 1993). In
general, the policy debate over OTL standards has tended to center on how such standards
would be defined, what their purpose and use should be, when they should be developed
during the implementation process, and what the role of the federal government should be
(Traiman, 1993, p. 12).
The intensity of conversations about OTL has abated in the policy realm since the
mid-90's. OTL standards were incorporated into law in 1994 and were defined as:
the criteria for, and the basis of, assessing the sufficiency or quality of the
resources, practices, and conditions necessary at each level of the
education system (schools, local educational agencies, and States) to
provide all students with an opportunity to learn the material in voluntary
national content standards or State content standards.
(Goals 2000: Educate America Act, 1994, Section 3)

However, participation in Goals 2000 is entirely voluntary. The legislation specifically states
that the law should not be interpreted to support mandates around school-finance
equalization or school-building standards. McDonnell (1995) notes that, as policy, OTL
standards thus operationalized fall somewhere between a hortatory policy (e.g., indicators of
OTL provide information that is available for people to use voluntarily in guiding practice)
and a weak inducement (e.g., by developing OTL standards, states may receive some small
amounts of federal funding). Additionally, she reports that OTL standards were the major
stumbling blocks to passage of Goals 2000, particularly because of their redistributive
potential, splitting political support along generally partisan lines.
Consequently, since these debates, OTL standards have remained largely outside
the realm of policy discussions around standards. Although the topic was renewed
somewhat in the context of the Clinton administrations 1997 proposal for national student
achievement testing in reading and mathematics, related policy proposals never addressed
OTL explicitly. Weakened by criticisms from Republicans (who wanted a smaller federal
role in education) and Democrats (who expressed more concern with issues of fairness in the
use of test results), the proposal was effectively blocked by the 106th Congress.
Uses of OTL Data
Four main purposes are described for OTL indicators by the writers who focus
mainly on their uses:
1) providing contextual information to inform accountability processes;
2) monitoring the extent to which teachers are implementing reform strategies;
3) measuring and monitoring student learning activities in the interest of equity;
4) providing information about instructional factors that facilitate or hinder
student achievement, or as what works.

Writers who address purposes for OTL use the general definition of OTL as a way to
determine whether all students have been exposed to the learning opportunities they need to
prepare them to meet high academic standards, (Traiman, 1993, p. 5). However, they differ
significantly in their opinions about what those uses should be.
In the literature, the issue of purposes for OTL indicators tends to become confused
with the issue of stakes, or the importance of the results of testing programs for individuals,
institutions, or groups (AERA, 1999, p. 139). Porter (1995), in examining OTL in the context
of standards-based reform, distinguishes between monitoring OTL indicators for
accountability (e.g., comparative and high stakes) purposes and for monitoring for school
improvement purposes. McDonnell provides a policy framework that offers somewhat finer
distinctions, noting that OTL could potentially serve as the basis for a variety of different
policy mechanisms, depending on whether policy is developed as mandates, hortatory
policy, inducements, or as part of a strategy to build local capacity (McDonnell, 1995, p. 313-
314). Mandates around OTL would impose rules and require compliance, like that argued by
ODay and Smith (1992), in which sanctions might be imposed on schools and districts that
have resources, but that do not meet the requirements of the mandates. Hortatory policies
involve the use of indicator data for informational purposes, with the assumption that
individuals will act accordingly. Inducements might involve some sorts of rewards for
systems that work toward OTL systems. Finally, strategies to build school capacity might
entail the use of OTL information as a way to encourage change and to make decisions
about investments-particularly focusing on areas that are weak.
The issue of how OTL data would be used, in terms of stakes, tends to dominate
debate, and add an extra dimension to the more general issues of purposes for OTL data
collection. For each of these purposes, OTL may conceivably influence policy in a variety of
ways, along a continuum of low-to-high stakes scenarios. For example, one purpose

proposed for OTL indicators is to inform policy by providing information about the extent of
reform implementation in classrooms (Brewer & Stacz 1996; Herman & Klein, 1997: Traiman
1993). A mandate around this purpose might entail the requirement that teachers teach
according to reform recommendations, with related criteria for teacher retention and
evaluation. Hortatory policy around such a purpose might involve a school district
administering a current practices survey to characterize instruction, with educators free to act
on the information or not. Inducements might involve cash incentives for teachers who
teach in certain ways, and a capacity-building approach might use OTL data in order to
target professional development needs.
It is difficult to define when test use or use of OTL data might be high-stakes or not.
The AERA testing program standards (1999) note that stakes may be variable and hard to
Even when test results are reported in the aggregate and intended for a low-
stakes purpose such as monitoring the educational system, the public
release of data can raise the stakes for particular schools or districts.
Judgments about problem quality, personnel, and educational programs
might be made and policy decisions might be affected, even though the
tests were not intended or designed for those purposes. (AERA, 1999, p.
OTL to Inform Accountability Processes
One of the broad purposes cited in the research for gathering OTL data is for
informing accountability processes. ODay and Smith (1992) advocate holding schools and
districts directly accountable for the opportunities that they provide their students, as one of a
variety of policy tools to support systemic reform and equal educational opportunity. By
redefining educational opportunity as the opportunity to learn well the content of academic
frameworks or standards, ODay and Smith indicate that, ...Specifically, to meet the school
standards, schools of the poor and otherwise less advantaged may require more and quite
different resources than schools of the more advantaged." (p. 290). Softer accountability

purposes could provide contextual information in addition to student achievement reporting,
and OTL indicators may be particularly useful for providing this information in light of the
current move toward expanded educational accountability systems. Education Week, in a
recent analysis of accountability reform in the United States, notes that 49 of the 50 states
have state assessments to measure student achievement, and that 36 states generate
school report cards. While all 36 states generating report cards report on student
achievement, fewer than half that number report on related contextual factors such as
teacher qualifications, school philosophy or programs, or course-taking patterns (Education
Week. 1999, p. 11). Colorados own accountability measure, SB00-186, has incorporated
some contextualizing information, although none specifically addresses classroom
Schmidt & McKnight (1995) note that information about how learning opportunities
are distributed as well as information about the sensitivity of tests to such learning
opportunities help illuminate the extent to which comparisons of different students
performance are valid (e.g., the extent to which their learning is due to similar instructional
experiences and opportunities). Within the context of accountability, this type of information
can take various roles. Muthen, Huang, Jo, Khoo, Goff, Novak, & Shih (1995) note that
differential item functioning relative to students opportunities to learn (e.g., how well algebra
items, for instance, discriminate between students who have taken algebra and those who
have not) may well play a role in issues of scoring and reporting student achievement: could entertain the provocative idea of using adjusted scores by
allowing these items to have different difficulty parameters. We are then
attempting to measure potential: What can these students do given
opportunities to learn? One could argue that with persons getting such items
right, the ones not in algebra classes should get higher scores than those in
algebra classes. Also, if students not in algebra classes do not get all such
items right, their scores should not be as low as students in algebra classes
with the same responses. This information is useful in addition to knowing
the actual proficiency. (Muthen, et al., 1995, p. 376)

As is frequently the case, this is an instance in which the purpose of OTL information (e.g.,
informing accountability reporting) overlaps with technical concerns. Muthen, et al., concern
themselves primarily with technical issues in operationalizing and measuring OTL effects
relative to achievement, but the two concerns are inextricable, particularly as the purpose for
which OTL information is collected directly relates to technical requirements about the
quality of measurement.
OTL to Monitor Reform Implementation
Another proposed purpose for OTL indicators is to inform policy by providing
information about the extent of reform implementation in classrooms (Brewer & Stacz, 1996;
Herman & Klein, 1997: Traiman 1993). Measuring the impact of comprehensive systemic
education reform efforts is frequently a complex task, one that addresses multiple levels of
implementation. For instance, Smith and ODay (1990), in their systemic school reform
manifesto, identify a variety of system elements at different levels, and propose change in all
of these elements, organizing around newly ambitious standards. By aligning curriculum,
instruction, assessments, and structures such as policy and funding with these standards,
they argue that a new coherence will take place.
The National Forum on Education Statistics (NFES), in its recommendations for a
comprehensive system of multilevel, comparable education indicators (1997), recently noted
the necessity for gathering OTL information. In its report, OTL is primarily addressed as
school process statistics (student access to rigorous courses, school attendance, and the
nature of the classroom or school environment), and education resource statistics (teacher
qualifications and supply) (NFES, 1997, p. A-2). These types of statistics, in addition to
student outcomes and background data, are considered to be part of a set of basic data
elements that provides the information needed to operate schools and districts, support state

and federal program reporting, and guide education policy at all levels. (NFES, p. iii)
However, while district, state, or even school level policy changes may take place to
align with standards and assessments, at least nominally, it has frequently been the case
that such changes may occur far away from the core technology of schooling (Cuban,
1988, 1989; Elmore, 1996). This helps account for substantial variation in classroom
practice, particularly given the complexity of such reforms and the frequently unmet needs of
teachers for time and professional development training in how to incorporate reform ideas
into their teaching (Cohen, 1988; Cohen & Spillane, 1993; Prestine & McGreal, 1997). OTL
indicators offer potential measures of classroom-specific reform implementation activities
and an examination of how these activities relate to student outcomes may help further our
understanding of the educational process and illuminate the inside of the black box. While
other, broader indicators of resources (e.g., input measures and school organizational
structures like grouping or tracking practices) provide a wider swath of contextualizing
information, the definition of opportunity-to-leam standards as focusing specifically on the
enacted curriculum, the appropriateness of content taught, and the quality of pedagogical
strategies used (Traiman, 1993) helps narrow the arena of inquiry to that most directly
related to the teaching and learning process, the classroom.
Herman and Klein (1997) explore the role of OTL data primarily for use in informing
policy makers about how reform policies are operating and for providing essential feedback
on whether assumptions underlying the policy are accurate. (p. 3). They outline the
following policy assumptions underlying assessment-driven reform:
1. Assessment can effectively communicate educational goals and
expectations for student performance.
2. With suitable incentives and sanctions (either external or internal), teachers
and schools will respond to a new assessment by changing their curriculum
and instruction to provide students with appropriate opportunities to learn
what is expected.

3. Students, encouraged by the threat of adverse consequences or the promise
of future reward, will take the standards seriously and will be motivated to
learn what is expected of them.
4. All in the system will take feedback about performance seriously and use
that information to improve what they do. (Herman & Klein, pp. 3-4)
Herman and Klein particularly emphasize the second assumption, noting, that, without data
about OTL, it is impossible to know that necessary changes in instruction and curriculum are
taking place. Additionally, they raise concerns about differential distributions of OTL across
schools and classrooms and note that fairness demands the provision of appropriate OTL for
all students, particularly given the possibility of high-stakes consequences for students
attached to test results. From this perspective, OTL may potentially be used to inform policy
makers in developing education policy more supportive of learning.
It should be noted that, while this model takes into account the possibility of high-
stakes accountability uses of OTL as part of feedback to schools, the primary use discussed
here is within the context of a hortatory perspective. Assumption 4 spells that out clearly,
although it is plausible that the introduction of high-stakes measures may force some to take
feedback about performance seriously and to act accordingly. Within this context, Herman
and Klein have included policy makers participants who may use OTL information for their
own self-improvement purposes, as feedback to revise and maximize effective policy.
However, the assumption that OTL information will be used for similarly effective and
equitable purposes across sites is doubtful, particularly considering capacity differences of
sites to implement OTL-related reforms (Porter, 1995). Aschbacher (1999) is clearer about
the appropriate level of use for OTL data; she advocates school and classroom-level use for
self-evaluation and capacity-building purposes, rather than state-level use.
OTL to Monitor Educational Equity
Other purposes for OTL indicators include the monitoring of equitable access to

learning. (Guiton & Oakes 1995; Herman, Klein, & Wakai, 1996). Howe (1989) provides an
argument for a possible use of OTL indicators for assessing the equity of opportunities
through examining how student outcomes are distributed. He begins by critiquing the
argument that unequal outcomes in education are a result of choices on the parts of students
whether or not to take advantage of opportunities, and argues that outcomes rather than
choices should be used as ways to measure student equality of opportunity. He notes,
however, that these outcomes should be qualified by morally irrelevant characteristics
relating to achievement. Such characteristics might include race, while morally relevant
characteristics might be characterized as academic ability.
Since there are likely to be a number of different opinions about what is morally
relevant and irrelevant, it is important to conduct empirical research about the nature of
educational opportunities related to achievement. Indicators of OTL may provide an
empirical approach to defining morally relevant characteristics of instruction by examining
the specific instructional processes that relate to student achievement. At any rate, in the
interest of fairness and abundant studies that student opportunities vary by school (Kozol,
1991); track placement (Oakes, 1985), and individual classes (Gamoran, 1987), the study of
classroom processes is important to help determine the equitability of student opportunities.
ODay and Smith (1992) advocate a variety of ways in which OTL standards may
inform the distribution of student opportunities. Additionally, OTL information has been
advocated as a way of validating interstate and international comparisons of student
performance (Schmidt & McKnight, 1995; Wiley & Yoon, 1995;) in the interest of fairness of
comparisons. Finally, the use of OTL indicators to measure equity has also been
recommended specifically within the context of informing accountability procedures, with the
option of high stakes use of the data for schools (Herman & Klein, 1997).

One high stakes use of OTL data in the interest of educational equity might
encompass lawsuits against schools that fail to provide adequate opportunity to learn (ODay
& Smith, 1992; Weiner & Oakes, 1996). The precedent for such lawsuits was set in 1981 's
Debra P vs. Turlington, in which the National Association for the Advancement of Colored
People (NAACP) successfully argued that it was unconstitutional to deny high school
diplomas to students who had not been given the opportunity to learn the material covered
by the required graduation test and that requirements of due process must be met.
OTL and Achievement: Shedding Light on What Works
A considerable body of the research tends to focus on the role of OTL indicators in
helping to determine what works relative to student achievement (Brewer & Stacz, 1996;
Hanushek, 1997; Porter, 1995). In addition to looking at what works as a general purpose, I
have grouped those researchers who focus primarily on technical issues of OTL
measurement into this category. All of these researchers focus largely on OTL as it relates
to achievement, with an interest in establishing which indicators are most relevant to student
achievement. Additionally, the validation of OTL measures frequently takes place by
examining the relation of OTL indicators to achievement; in a sense, the validity of OTL in
some studies is supported by a link with achievement. While the concerns of purpose, use,
and technical qualities overlap everywhere in the literature, the primary concerns of these
researchers tend to be on three general, but interconnected areas: the development of
specific models of OTL, exploring technical qualities of OTL measures, and examining the
relations between OTL and achievement in specific studies. Overall, the research in this
area has been mixed, with little unified information about what works in classrooms, but
studies have relied on a variety of methods, measures, and operationalizations of OTL.

OTL Research About Technical Issues
Developing Specific Models of OTL
In contrast to the general definitions of OTL standards addressed by researchers
who emphasize the issues of how they might be used, a distinct subset of studies have
emerged that emphasize specific models of OTL and operationalize different variables in
their models. These models focus on the classroom or address different levels of policy
action that might constrain OTL at lower levels, depending on how comprehensive the nature
of the model is.
At the classroom level, OTL has been traditionally measured through three different
components (Brewer & Stacz, 1996; Herman & Klein, 1997):
1) Curriculum content, which typically includes topic coverage, time spent, and
teacher emphasis on topics;
2) Instructional strategies, which might include methods, pace, questioning
strategies, expectations, grading policies, and content organization; and
3) Instructional resources, e.g., books, supplies, and the physical classroom
Researchers have operationalized these elements of OTL differently; Tate (1995)
summarizes OTL indicators primarily in terms of content (coverage, exposure, and teacher
emphasis) and quality of instructional delivery, defined as how classroom instruction affects
students academic achievement (Tate, 1995, p. 429). Guiton and Burstein (1993), in
describing their early work on the Survey of Mathematics and Science Opportunity (SMSO),
an early survey measure that eventually became the TIMSS OTL measure, note that they
also address the three areas listed above, but also list the complexities inherent in
developing specific survey items to measure those areas. In Colorado, a survey of OTL
around mathematics and science standards measured variables in the three general areas
above, as well as assessment and teacher background information (Sanders, 1998).

Researchers define models of OTL at different levels of specificity in terms of
methodology and measures. Two that focus primarily on highly complex methods and
models include a model developed by Wang (1998) in science, and Muthen, et-al. (1995) in
mathematics. Wangs study utilizes Hierarchical Linear Modeling (HLM) at two levels of
instruction--the classroom and the student level. Wang argues that OTL needs to be
explored both in terms of examining different variables simultaneously and at different
levels. The model used by Wang defined OTL along the 4 dimensions described by Tate
(1995) as well as incorporating student attendance rate into the model as an individual OTL
Muthen, et al. (1995), in comparison, described a set of methods by which one might
analyze OTL and achievement on large scale measures, using data from the National
Assessment of Educational Progress (NAEP) and the National Educational Longitudinal
Survey (NELS), and arguing for the use of multivariate information, similar to Wang. The
illustrations included factor analyses of teacher emphases by type of class (e.g.,
remedial/enriched), the exploration of differential item functioning to explore item sensitivity
to OTL, logistical regression analyses, and the use of path analysis models. Additionally, the
methods and models described were chosen and developed based on the theoretical
underpinning of OTL, the nature of existing achievement measures, and the structure of the
different data sets and their limitations.
Other models address OTL within a larger context than the specific classroom. The
model proposed by Shavelson, McDonnell, and Oakes (1989) for the development of math
and science indicator systems involves three different components that shape OTL:
1) Inouts-fiscal/other resources, teacher quality, student background;
2) Processes-school quality, curriculum quality, teaching quality, instructional
quality; and
3) Outputs-achievement. participation, attitudes, and aspirations.

Wiley and Yoon (1995), in contrast, focus more on educational goals and do not use the
input-output language derived from economics in describing their model. In their study,
which involves exploration of OTL related to the California Curriculum Frameworks and
student achievement on the California Learning Assessment System (CLAS), they list five
different levels:
1) Goals formulated (goals in the California Curriculum Frameworks);
2) Goals and practices communicated (teacher familiarity with reform-oriented
documents and teacher participation in training opportunities);
3) Practices implemented (specific math strategies, types of problems
assigned, and general instructional strategies);
4) Practices experienced (perceptions and experiences of students); and
5) Learning accomplished (student achievement on the CLAS).
In the Wiley and Yoon study, only the interrelationships between elements 2), 3), and 5)
were studied.
The model of opportunity developed by IEA researchers is comprehensive and
addresses a variety of implementation levels, along 11 different elements, which is intended
to address the alignment in countries between the intended curriculum (or official policy
intentions);the implemented curriculum (what is actually taught by teachers); and the
achieved curriculum. Specific elements of this model include:
1) National/reaional curriculum goals (intended curriculum)
2) School goals:
3) Teachers learning goals (implemented curriculum)
4) Official teacher certification Qualifications:
5) Teacher social omanization/environment:
6) Teacher characteristics (including background, subject matter orientation,
pedagogical beliefs, status, incentives;
7) System characteristics (tracking, grade levels, etc.)

School course offerings:
9) Instructional activities (the implemented curriculum)
10) Student characteristics (background, socioeconomic status, attitudes,
activities, and expectations; and
11) Test outcome (the attained curriculum (Schmidt & McKnight, 1995, p. 349)
Validating OTL Measures and the OTL Construct:
Approaches With and Without Achievement Measures
Some OTL studies have utilized classroom observations; however, most large-scale
collection of OTL data has been through surveys because of the costs associated with
classroom observations, the lack of standardized observation protocols, and the minimal
generalizeability associated with such measures (Brewer & Stacz, 1996; Kennedy, 1999;
McDonnell, 1995). In the course of exploring the potential of survey measures of OTL for
informing indicator systems, many researchers have explored the technical aspects of these
measures, in conjunction with achievement measures when possible. In general, the
findings about the quality of OTL surveys have been somewhat mixed, partly reflecting the
variety of measures studied and methods used.
A variety of researchers have addressed validity issues in survey measures of OTL
by triangulating survey data (most frequently administered to teachers) with other artifacts
and data collected about classroom procedures. In perhaps the largest-scale validation
study of OTL indicators to date, Burstein, et al. (1995) gathered teacher assignments to
validate their reports about practice. They found that among other things, teacher reports
about current topics taught were more accurate than those taught over the whole year and
that questions about instructional tools (e.g., tables and graphs, charts and calculators) had
lower reliability than questions about content topics. Additionally, in keeping with Wileys
recommendations (1993) for highly specific OTL questions, Burstein, et al. (1995) found that

more specific curricular topic items were necessary for improved accuracy in content
In general, the study found teacher survey data may provide reasonably accurate
information about topic coverage. If the standard is knowing whether or not a topic has
been taught and, if it has been taught, whether it has been covered over several periods, for
a week or two, or for several weeks, then teacher self-reports are reliable (p. 33). However,
when comparing survey items about tests or homework to actual tests and artifacts,
agreement was low. This suggests that surveys are not very reliable for assessing teacher
expectations for students, particularly in the case of more innovative items reflecting
reform-oriented perspectives (e.g., student-centered activity and construction of knowledge).
Findings about content built on those from an earlier study, in which Yoon, Burstein,
& Gold (1991) explored the reliability and validity of teacher reports about content coverage
over a period of two years, finding that reports tended to be consistent over time. They also
compared teacher reports with student achievement on the IEA Second International
Mathematics Study (SIMS) and found that when content coverage was reported as covered
over two years, that student achievement went up. However, the relationship between
coverage and achievement varies according to topic difficulty. Findings indicate that, while
teacher reports on content may be fairly reliable, the effect of content coverage on
achievement is sensitive to the level of item difficulty. Herman & Kleins approach to
validating teacher reports on practice (1997) found that comparing them with student survey
responses did not shed much light on the validity of content items, but did find that
information about specific classroom practices relative to the 1993 California Learning
Assessment System (CLAS) tended to be corroborated by student reports.
While teacher surveys of OTL tend to provide reasonably valid information about
content coverage, information about instructional strategies seems to be less reliable. For

instance, Wiiey and Yoons study (1995) comparing teacher familiarity with reform-oriented
mathematics documents, classroom practices, and student achievement on the 1993 CLAS
assessment indicated correspondences between reform-oriented instructional practice and
achievement. However, the teachers reporting reform practices were the teachers who were
least familiar with reform documents and least often participated in in-service training. This
raises questions about the extent to which teachers understand the conceptual underpinnings
of the reforms (as phrased in survey measures) relative to their own practice.
Mayer (1999) studied test-retest reliability and validity of teacher survey items
measuring teacher use of strategies consistent with the National Council of Teachers of
Mathematics (NCTM) standards. When looking at composite scales, reliability was high, but
for individual instructional practices, the data indicated that the survey did not provide a
reliable measure of how teachers divide their time among various individual approaches.
Additionally, classroom observations indicate that, while the survey was reasonably valid, in
terms of distinguishing teachers at opposite ends of an NCTM-oriented scale, teachers who
were high in reported use of NCTM strategies did not look alike, similar to findings by Cohen
(1990) and Spillane and Zeuli (1999). This indicates that more work needs to be done to
devise surveys that better distinguish between teachers who perfunctorily use reform
practices and those who use them effectively" (Mayer, 1999, p. 42)
One element contributing to these variable findings is the variability of measures and
the complexity of decisions to be made in designing them. These might entail decisions
about which elements of OTL to measure, the level of specificity at which to measure them,
and the appropriate response frame (Guiton & Burstein, 1993). Sanders (1998), in exploring
these types of measures, notes that the effect of format on validity can be problematic. In
Colorado, an OTL study indicated that content coverage of some topics were not adequately
captured by commonly-used frequency scales, particularly topics covered in short units.

In some of these validation studies, achievement has been included as part of
exploring the validity of the measure. Underlying this, however, is another aspect of validity;
construct validity. Some of these researchers have come at the process of validating OTL
by defining elements of the concept at least partially by virtue of their relations to
achievement. In a sense, this is a circular definition--OTL, as defined on a certain measure,
is assessed by the extent of learning measured on a different measure. Findings along
these lines are mixed, largely because of issues raised above: variations in the achievement
measure-particularly in terms of its sensitivity to specific instruction, as well as the level at
which achievement has been operationalized (e.g., the item-level or at a larger level),
interact with the variations in OTL measures described above by Guiton and Burstein (1993).
Wiley and Yoon (1995) attribute some of the difficulty in linking achievement to OTL to
inconsistencies in OTL measures, noting that OTL effects on achievement should be
interpreted with caution. (p. 369).
Muthen, et al. (1995) emphasize the role of the OTL and achievement measures in
determining the relationship. They note that the OTL information needs to be carefully
analyzed and should be multivariate in nature. They then provide a variety of OTL and
achievement studies designed as examples for future data that will be more likely to show
interesting effects the better the OTL measures. (Muthen, et al., p. 375). These studies
utilize a variety of methods (DIF analysis, path analysis using a predeveloped model of
multidimensionality in mathematics performance, and factor analyses), in which
OTL/achievement relationships varied depending on the measures and analyses used. The
researchers primarily emphasize the importance of OTL sensitivity in individual test items
and its implications for test construction, its significance for scoring and reporting
achievement, and the need to develop more detailed, multilevel information about OTL in
order to make large-scale assessments more useful.

In Wangs (1998) model of OTL in science, as described above, content exposure
was a significant predictor of student achievement on written science tests, while quality of
instructional delivery was the most significant predictor of hands-on science tests. This
again, builds on the circular approach to validating OTL through achievement, although in
this instance, the role that the achievement measure may play in this equation is highlighted.
The TIMSS Achievement Survey: A Prototype OTL Measure
As a measure of OTL, the TIMSS may well be considered the Cadillac of
measures. Consisting both of comprehensive, broadly-studied survey measures of teacher,
school, and student background information, and large-scale comparative measures of
achievement in mathematics and science, the TIMSS represents the culmination of almost
40 years of research that has gone toward refining and validating subsequent measures of
OTL. Additionally, the survey materials have been widely studied, with triangulation studies
conducted, including curriculum analyses and a videotape analysis (Schmidt & McKnight,
The TIMSS is a large-scale comparative study of national education systems in 45
different countries. TIMSS researchers examined mathematics and science curricula,
instructional practices, and school and social factors, as well as conducting achievement
testing of students at three different instructional levels. Data were collected from
representative documents laying out official curricular intentions and plans, mathematics and
science textbooks, and researchers also searched K-12 textbook series for selected in-
depth topics (subareas within the broader subject matter). In six countries, TIMSS
conducted classroom observations, teacher interviewing, and videotaping. (Martin, 1996;
Schmidt, McKnight, & Raizen, 1997). In addition to the international administration of
TIMSS, three U.S. states opted to administer the TIMSS to selected student, teacher, and

administrator populations at three different instructional levels. Colorado was one of these
states, and participated in the TIMSS for age 9 students (sampled in grades 3 and 4).
TIMSS has been organized to measure curriculum as a broad explanatory factor
underlying student achievement and that curriculum is considered to have three
manifestations, originally conceived for the lEAs Second International Mathematics Study
1. What society would like to see taught (or the intended curriculum)-,
2. What is actually taught in the classroom (or the implemented curriculum)]
3. What students learn (the attained curriculum). (Martin, 1996)
While this three-pronged approach does focus attention on three different interpretations of
curriculum, the underlying concept unifying these approaches, according to key TIMSS
researchers, is based on the provision of educational opportunities to students (Schmidt &
McKnight, 1995; Martin, 1996) at different levels of the system. The eleven components of
this model were described above in the section titled Developing Specific Models of OTL on
pp. 25 and 26.
Studies of the U.S. TIMSS data are extensive and reports of findings ongoing
United States Department of Education, National Center for Education Statistics (NCES)
1996, 1997, 1998). The Survey of Mathematics and Science Opportunities (SMSO) project
of Michigan State University, in focusing on the intended curriculum, has generated a
number of findings based on curriculum analyses (Schmidt, et al., 1997). These researchers
have also focused on school level processes (Schmidt, et al., 1999), utilizing achievement
data in connection with OTL information gathered both from a textbook study and survey
data. Several other entities have generated reports based on state-level data, such as
Voelkls (1998) summary of mathematics achievement in the Colorado sample, or Lawrenz &
Huffmans (1998a, 1998b, 1998c) reports on the Minnesota administration of TIMSS.

Additionally, the international TIMSS datafiles are publicly available for secondary
research. The NCES has sponsored several training sessions in the use of public-access
data, including achievement and background datafiles, and data from the videotape study
are available through a joint effort of the NCES and the University of California at Los
Angeles. International TIMSS data files available to secondary researchers address:
student achievement;
interrater reliability on constructed-response achievement items;
results of performance assessments taken by select groups of students;
international curriculum analyses;
videotaped lessons, transcripts, and coding documents from a comparative study of
classroom processes in Japan, Germany, and the United States;
student background questionnaires;
school-level questionnaires; and
teacher questionnaires about classroom processes, pedagogical beliefs and
influences, and content coverage.
In addition to its comprehensive nature and public presence in the research
community, TIMSS has influenced policy considerably. Policy makers consider the TIMSS
data to be very important and its findings have received an unprecedented amount of public
attention, largely due to their emphasis on international comparisons of mathematics and
science achievement. TIMSS broadest findings, which involve ranking of nations by grade
levels in international achievement, have been widely cited by policy makers, researchers,
and other groups involved in public education as an example of the U.S. competitive
vulnerabilityrelative to other nations (Campbell & Clewell, 1999). Frequently, the TIMSS
achievement results are used as justification for a variety of proposed policy changes in
education to reduce this perceived international vulnerability (Wolf, 1998).

Pascal Forgione, then-Commissioner of the NCES, spoke in 1997 of TIMSS as a
way both to provide an international perspective on U.S. educational practices and
diagnostic information about how we can better those practices:
TIMSS provides a lens through which we can view ourselves in an
international perspective. Because of todays global economy, the U.S. must
compete as a nation against other nations. Increasingly, residents of states from
California to Maine are realizing that they have a stake in what schools in other
places are doing. Citizens are expanding their conversations on education to talk
about what should be taught and what should be expected from students. Thanks to
TIMSS, we can engage in a more informed discussion-one enriched by international
comparisons that will contribute to the process of improving our systems of teaching
and learning (Forgione, 1997, p. 3).
U.S. President Bill Clinton echoed this proposed use of TIMSS data as diagnostic
information in his public statement on education standards, calling fourth and eighth grade
results a road map to higher performance. The President incorporated the relatively
optimistic TIMSS results for fourth grade mathematics and science into his call for states to
embrace national education standards, as well as to push for national standards-based
examinations, a theme echoed by Secretary of Education Richard Riley (Riley, 1997, p. 1).
Arguing that student scores had improved since a similar test in 1991, Clintons speech
asserted that those results illustrated what can happen in a few short years if people are
working together for the right things for our children and the future of our country, thus
charging support was needed for national standards movements such as Goals 2000 and
related testing initiatives (The White House, Office of the Press Secretary, p. 2).
One role of TIMSS data was to provide comparative achievement information
relative to Goals 2000, the widely-touted federal education initiative. Goals 2000 was an
inducement to states to pursue standards-based reform by providing incentives and limited
seed money. TIMSS was the comparative measure designed to track progress toward Goal
5, the assertion that the U. S. would be first in the world in mathematics and science, and
its findings were disappointing for policy makers. The data indicated that U.S. fourth graders

were closest to that goal in 1995; they ranked 8th of 25 participating countries, while eighth
graders ranked 20th of 41 participating countries and twelfth graders performed among the
lowest of the 21 TIMSS countries participating at that education level. U.S. students at both
eighth and twelfth grade levels performed below the international averages for those levels
(NCES, 1996, 1997, 1998).
Policy makers tend to discuss these discouraging TIMSS results primarily within the
context of control over classroom instruction. Underlying policy discussions that address
TIMSS results is the implication that the poor international achievement ranking of U.S.
students is an accurate reflection of a comparatively poor educational system. OTL, in
terms of the implemented curriculum, is seen as needing change, possibly along the lines of
implemented curricula in more successful TIMSS countries, most frequently Asian countries
like Japan and Singapore. While TIMSS is depicted as pointing out problems in the
classroom, it is also presumed that, should certain instructional changes occur, a
concomitant improvement in international achievement ranking will result. Thus the
assessment is perceived simultaneously by policy makers as serving multiple purposes
equally well--as an objective diagnostic tool for reform, a motivating factor for change, and a
continuous measure of student and system success (as defined comparatively across
nations). Primarily, however, consistent with the nature of thinking about assessment-driven
standards reforms, it is described as an instrument for driving instructional change.
TIMSS may well be the most comprehensive measure of OTL available. Not only
does it function as a rich source of information about reform implementation and distribution
of resources, information that may be used in different ways, it also holds promise for
exploring relations between instruction and achievement. There are, however, numerous
technical issues inherent in exploring these relations, as outlined by the literature.

Chapter Summary
In this chapter, I have reviewed the literature about four potential purposes for OTL
indicators: 1) OTL indicators might provide contextual information to inform accountability
processes; 2) OTL indicators may be useful to monitor the extent to which teachers are
implementing reform strategies; 3) OTL indicators hold promise for measuring and
monitoring student learning activities in the interest of equity; and 4) OTL indicators may
provide information about instructional factors that facilitate or hinder student achievement.
Second, I reviewed empirical technical studies of OTL, which include: 1) specific OTL
models, 2) validity of OTL models and measures, and 3) the relationship between OTL and
achievement. I then discuss the TIMSS as an exemplary measure of OTL and achievement.
In the next chapter, I describe the methodology of the study.

This chapter describes methods and sampling in the T1MSS and the methods in this
study. The study methods include defining the subsample, methods and specific variables
used to operationalize achievement and OTL, procedures for linking the data and various
intermediate analyses of these operationalizations, and analyses of the relationships
between OTL and achievement. Additionally, this chapter highlights key decision points that
arose specifically in this data set and which are germane to large-scale studies of OTL and
Research Questions
This study addresses two research questions relevant to policy and how it might be
informed by indicators of OTL. The first question is phrased generally, and is informed by
the specific empirical findings that address the second question. The research questions are
stated below:
1. What is the technical feasibility of using large-scale measures of OTL to
inform policy and practice at state and local levels?
2. Using data from TIMSS as an exemplary illustration of OTL and student
achievement measures, what is the nature and extent of the relationships
between OTL and student achievement?
The focus here is on OTL and achievement related to mathematics for nine-year old
students in third and fourth grade classrooms.

Design of the Study
The study is a statistical analysis of existing data from the 1995 administration of the
TIMSS at age 9, in grades 3 and 4, in Colorado, obtained from WESTAT, Inc., the national
data contractors for TIMSS. I developed aggregated measures of student achievement for
correlation with OTL variables measured through teacher surveys. This entailed the use of a
series of variables from two separate databases, one with student achievement data and one
with teacher survey data, and combining the two. Classroom achievement on six
mathematics subscales was correlated with teacher-provided information about content
coverage and instructional aspects of OTL.
TIMSS achievement and OTL data were gathered in spring, 1995, from a
representative sample of Colorado nine-year-old students enrolled in public schools in
grades 3 and 4. Based on the student sample, information about instructional and school-
level processes was gathered from their teachers and administrators as well (Martin, 1996;
WESTAT, 1997). A two-stage sampling design of classes within schools was used; 50
schools were randomly sampled, with a random sample of third and fourth grade classrooms
sampled within schools. All students in the sampled classrooms took the TIMSS survey,
while teachers were selected to complete surveys addressing OTL based on whether they
taught assessed students mathematics and/or science (see Appendix B, WESTATs
sampling report, for clarification).
For the purposes of this study, I used a selective subsample of the larger WESTAT
data, combining data from teacher and student databases and utilizing information only from
those classes with both student achievement and teacher OTL data. Data used are from 104
third and fourth grade classes in 50 schooIs-47 third-grade classrooms, and 57 fourth-grade

classrooms-and account for a total of 2,163 third and fourth grade students who took the
TIMSS achievement test.
Data Analysis
Operationalization of Achievement. Achievement was operationalized using data
from a WESTAT-provided database (INTACHCO) of individual student responses to each
item tested. WESTAT also provided a scoring program that recoded raw responses to
responses correct. Using an identifier of the specific TIMSS form taken by individual
students (IDBOOK), I recoded student achievement data to indicate whether the student had
answered the item correctly, had been tested on the item, but had not answered the item
correctly, or had not been tested on the item. I then cross-referenced individual student data
with information about which items mapped to each of six different math subtopics to create
subscale variables using student means. These variables (WHOLM, FRACTM, MEASM,
PROBM, GEOMM, PATTEM) represented the percentage correct of items on each subtopic
administered for that particular student. Subtopics included:
Whole numbers (WHOLM)
Fractions and proportionality (FRACTM)
Measurement, estimation, and number sense (MEASM)
Data representation, analysis, and probability (PROBM)
Geometry (GEOMM)
Patterns, relations, and functions (PATTEM)
These variables were further aggregated to the classroom level, using a class identifier
(IDCLASS) and again using means for aggregation. Final aggregated variables (WHOLM_1,
FRACTM_1, MEASM_1, PROBM_1, GEOMM_1, and PATTEM_1), or subscale scores, were
correlated with teacher variables of OTL. I provide a more extended discussion of these
procedures and data choices in the section below titled Decisions About Operationalizing
Achievement, on pages 45-50.

Based on standards of current testing practice (AERA, 1999),I conducted reliability
estimates using a coefficient of internal consistency (Crocker & Algina, 1986) on student-
level achievement variables (WHOLM, FRACTM, MEASM, PROBM, GEOMM, PATTEM) by
individual booklet. Additionally, I explored aggregated subscale validity for the same
variables at the classroom level (WHOLM_1, FRACTM_1, MEASM_1, PROBM_1,
GEOMM_1, PATTEM_1) by comparing patterns of achievement across content to published
data using IRT scores and estimating the sensitivity of subscale scores to instructional
effects using a Multivariate Analysis of Variance (MANOVA) to check for differences by
grade level. Based on these findings of technical quality, I made decisions about which
subscales were the best measures of achievement for comparison with OTL data.
Operationalization of OTL. I operationalized OTL using data from a WESTAT-
provided database (COPOP1) of individual teacher responses to an extensive teacher
background survey as content coverage and instructional strategies. (A copy of the
international teacher survey instrument and related WESTAT codes is attached as Appendix
A.). Specific elements of OTL, variables addressing each element, and their construction
are described below.
Content Coverage. Content coverage addressed questions of 1) curricular focus, 2)
topic coverage, and 3) duration of instruction on specific topics. These OTL elements were
measured using data based on a series of variables addressing coverage of 36 different
specific mathematics topics and subtopics (variables ATBMTA through ATBMTT on the
database; items 25a through 25t on the TIMSS population 1 teacher questionnaire, p TQ1--
15-17). Response options on these variables ranged from 0 (the topic was not covered
during the year prior to the TIMSS administration) to 4, which indicated considerable

1. Curricular Focus was operationalized through an additive variable that I
developed by recoding all 36 topic variables so that a value of 0 indicated
noncoverage and a value of 1 indicated coverage of any length of duration.
I analyzed the variable (MATHTOPS) through the exploration of descriptive
statistics and compared means by grade level (IDGRADE) using a one-way
2. Topic Coverage was operationalized using the 36 variables referenced
above. Frequencies of teachers who reported coverage and noncoverage
were analyzed to provide information about the breadth of topic coverage
across the sample by grade level. To explore differences in topic coverage
by grade level, chi-square tests of independence were conducted.
3. Duration of Instruction was analyzed through examining teacher frequencies
on different response options (0 through 4) for variables ATBMTA through
ATBMTT. Empty cells and some cell values of less than five precluded the
use of the chi-square to explore grade level differences. For purposes of
comparing OTL defined by duration of instruction with content-specific
student achievement, I developed variables representing additive duration
PATTDUR) that mapped specifically onto the six subscale content domains.
These duration scores were based on math experts alignment of the 36
specific topic variables (ATBMTA through ATBMTT) on the teacher survey
with the six content areas addressed by the achievement subscales.
Instructional Strategies. Instructional strategies were operationalized using one
block of six variables from the teacher surveys that address specific student learning
activities in the classroom (items 28a through 28f on the teacher survey, page TQ1-20,
the teacher database). Intermediate analyses of these variables included:
1. a Multivariate Analysis of Variance (MANOVA) by grade to determine if
differences varied by level;
2. Examination of response frequencies to characterize data patterns, which
included the exploration of common strategies, defined by most (more than
50%) of the teachers reporting them most (most lessons or every lesson) of
the time.
3. A series of confirmatory factor analyses to examine the extent to which
individual instructional strategies as measured by these variables fell into
recognizeable reform" or traditional" orientations. Teachers were assigned
factor scores based on the final analysis (FAC1_1, FAC2_1) in order to
explore possible relationships between groups of practices and student

The intermediate analyses described above were conducted primarily to explore the
validity of these variables by triangulating them with other findings in the literature. By
exploring the data patterns that emerged through these analyses, it was possible to estimate
to some extent the surveys validity for measuring these aspects of the OTL construct. By
comparing the ways in which data about content coverage and instructional practices related
to other OTL studies using these types of variables, I could see whether the data patterns
converged or diverged with other findings. I provide a description of the OTL model I
developed, specific questions asked for each subtopic of the model, and rationales for these
approaches in more detail in the section below titled Decisions About Operationalizing OTL.
which starts on page 51.
Examining Correlations Between Achievement and OTL Variables. Based on these
preliminary analyses, I ran bivariate correlations on a variety of variables as an exploration
of the feasibility for these indicators of OTL to help predict achievement. For each of the
four reasonably reliable subscale classroom-level achievement measures (WHOLM_1,
FRACTM_1, MEASM_1, and PROBM_1), scatter plots and correlations were calculated by
grade level for the following OTL variables:
MATHTOPS (Curricular focus))
WHOLEDUR (Duration of instruction on whole numbers)
FRACTDUR (Duration of instruction on fractions and proportionality)
MEASDUR (Duration of instruction on measurement, estimation, and
number sense)
DATADUR (Duration of instruction on data representation, analysis, and
ATBMASK1 (Students explain the reasoning behind an idea)
ATBMASK2 (Students represent and analyze relationships using tables,
charts, or graphs)
ATBMASK3 (Students work on problems for which there is no immediately
obvious method of solution)
ATBMASK6 (Students practice computational skills)
FAC1_1 (Teacher reform practice factor)

These OTL variables were selected from an earlier, more extensive list because of resulting
intermediate analyses and logical interconnections. Initially, two more duration variables had
been included in the study (GEOMDUR and PATTDUR) which mapped onto geometry and
patterns subscales. However, these were omitted from the final study because of low
reliability estimates on these two subscales (the lowest booklet reliability for geometry items
was .2201 and the lowest booklet reliability for patterns items was .4256). In terms of
instruction variables, in addition to variables addressing reform and traditional
orientations, I included individual variables that loaded on the reform factor, to assess the
relative usefulness of such a factor in explaining achievement.
Key Technical Decisions: Choices and Rationales
The second research question seems relatively straightforward-to examine the
nature and extent of the relationships between achievement and OTL as they are measured
on the TIMSS. However, this analysis was complicated by the structure of the data. I found
that in order to answer the question, it was necessary to first address a series of technical
issues. Below I describe key decision points around these issues. These types of decisions
are illustrative of the processes that are required when one conducts a comprehensive
exploration of the relations between OTL and achievement. While I describe these points
specifically within the context of my use of TIMSS, these are broader issues that relate to the
use and analysis of OTL and achievement data, regardless of the measure.
Technical issues requiring operational and procedural decisions arose in an iterative
fashion. As I delved into the TIMSS database, I found that decisions I made early in the
exploration based on emergent patterns in the data limited later choices about analyses and
interpretation. The entire study was responsive to constraints caused by the nature of the
TIMSS data structures and also by the sequence of choices I made about what counted."
These choices, while well-informed and organized based on my overarching questions, were

were sampled; technical reports are contradictory on this point (see Appendix B, WESTATs
sampling report; Foy, Rust, & Schleicher, 1996; and WESTAT, 1997).
Teachers were selected to complete questionnaires that addressed a number of OTL
variables based on whether they taught the assessed students. Any teacher linked as a
mathematics or science teacher to any assessed student was eligible to receive a
questionnaire, although WESTAT staff took steps to limit teacher sampling and ensure
teacher independence. However, in some cases, teachers in some cases taught
mathematics to more than one sampled classroom,(WESTAT, 1997) and, in those cases,
teacher data were duplicated in the database.
WESTAT also generated a series of sample weights assigned to each student
selected into the survey and who completed an assessment. I did not use these weights for
this analysis. My rationale for this decision is based on my arena of interest, which is
partially related to the issue of level of analysis. The WESTAT sample weights were
designed to facilitate large-scale comparative analyses of Colorados results along with those
of the nation and other countries. I, in contrast, have framed my level of inquiry at the
classroom level and am interested in how the data patterns showing the relationships
between OTL and achievement might help inform policy. Since I was not making any
comparisons between Colorados students and other students who took the TIMSS, the use
of these weights was inappropriate, but leads to possible differences in my findings from
those of studies that utilize weighted data. (For a full description of WESTATs weighting
procedures, reference Appendix C.)
Additionally, because I focus here on the relationships between OTL and
achievement, I needed to decide about how best to merge the databases measured at two
different levels (teacher-level OTL and student-level achievement). Inconsistencies between
the databases (e.g., OTL data for a certain teacher, but no linkable student achievement

data) meant that, in order to have a consistent database, I needed to delete some data. I
decided to delete all achievement data that did not have corresponding teacher information,
and vice versa. This changed the sample so that the weights would be inaccurate.
Decisions about Operationalizing Achievement
The TIMSS Achievement Measure. The elementary level TIMSS achievement
measure consists of 102 mathematics items and 97 science items (WESTAT, 1997, 3.3).
Based on extensive curriculum frameworks in mathematics and science (Martin, 1996;
Schmidt & McKnight, 1995), these items are designed to address overlapping categories of
content (8 subtopics), performance expectations (5 subtopics) and perspectives (5
subtopics). The same curriculum frameworks provided grounding and orientation for items
on other TIMSS measures.
The nature of TIMSS test items reflects a compromise similar to that demonstrated
in other currently-used large-scale achievement measures, in that it attempts to combine
multiple-choice and open-ended items. Multiple choice items provided students with a stem
and four or five answer choices. Open-ended items asked students to construct their own
responses to the test questions by writing or drawing their answers, and consisted either of
short-answer items or extended response items. The majority of items on the assessment
of nine-year old students were multiple-choice items.
Plausible Values as Achievement Measures. Widely-reported student scores on
TIMSS measures (NCES, 1996; 1997; 1998) are based on item-response-theory plausible
values, generated for each student in each assessed area. These plausible values do not
incorporate or communicate the complexity of the curriculum frameworks used in the
instruments development. They are much more global and do not take into account
variations within the overall mathematics domain of content (e.g., geometry) or varying
performance expectations. In the TIMSS database, 5 plausible values are provided for each

student in mathematics to generate consistent estimates of population proficiency
distributions, in accord with TIMSS primary nature as a cross-nationally comparative
measure of education (Crocker & Algina, 1986; WESTAT, 1997, 7-1).
In addition to the TIMSS comparative focus on national and/or state-level
populations, another reason for the use of plausible values in determining student
achievement in TIMSS is the scope of the content to be measured and the related need for a
complex matrix sampling design. Put simply, it was administratively impossible for all
students to take all 199 items on the elementary TIMSS assessment, particularly when more
complex, extended response items were part of the assessment. TIMSS assigned different
blocks of assessment items to different students using BIB spiraling methods. All in all, eight
different assessment booklets were developed for and administered to sampled students
(WESTAT, 1997, 7-2).
Problems with Plausible Values. Plausible values were not used in this study for two
main reasons. First of all, the plausible values derived related only to general mathematics
proficiency. This broad concept is by no means unitary; it has numerous discrete content-
related subsets, performance on which is potentially influenced by diverse student
opportunities to learn specific content (e.g., geometry, decimal fractions) and which may be
masked by more global achievement measures (Kupermintz, Lee, & Snow, 1999). Schmidt,
Jakwerth, and McKnight (1998) note this masking of variation within the TIMSS data
specifically, as well as the problem of appropriate level of aggregation:
When a tests purpose is to aid educational change, lower level aggregations
[of performance scores] that can be linked to curriculum are more policy
relevant and potentially more useful, even were more global results not
inherently misleading as indicators of educational effectiveness, (p. 514)
Particulary in light of the OTL research that emphasizes the coverage of specific content in
relation to achievement, it is necessary to consider the achievement measure as specifically
as possible relative to content. Some researchers theorize that broadly aggregated scores

tend to measure general mathematics or science ability, rather than achievement that can be
linked to specific curriculum or instruction (Hamilton, Nussbaum, Kupermintz, Kerchoven, &
Snow, 1995; Kupermintz, Ennis, Hamilton, Talbert, & Snow, 1995; Muthen et al., 1995).
When one analyzes aggregated test results, this general ability factor may confound any
correlation between curriculum and achievement (Burstein; 1991).
The second reason for developing alternate achievement measures was based on
my focus on classroom-level processes. Since plausible values are designed to generalize
in ways that are not relevant at the classroom level, I chose to return to the raw achievement
data generated by individual students and aggregated to the classroom level. While such
raw-score-generated information should be viewed with caution, as it does not take into
account item-level variation in difficulty or the complexities of the IRT process, it may still
provide insights about classroom-level achievement that are impossible to gain using more
globalized plausible values.
Decisions About the Appropriate Aggregation of Content. In the TIMSS study
described by Schmidt, et al. (1998) in the section above, disaggregation of eighth grade
performance data along 20 different mathematics topics demonstrates considerable variation
in how countries were ranked. The authors conclude that For policy relevant analyses, uni-
dimensionality is a myth at all but the most specific levels. In interpreting these data, simple
stories that are not misleadingly simplistic are as rare as they are serendipitous." (p. 520).
In deciding the appropriate level at which to aggregate measures of content, I tried
to strike a balance between capturing what the data say simply enough for policy makers to
consider possible action without taking an overly simplistic approach. For these purposes,
item-level data were considered too prone to measurement error, particularly given the unit
of analysis and the complexities of sampling. One alternative might have been the twenty
topic subscales based on the TIMSS curriculum frameworks that Schmidt, et al. (1998, 1999)

constructed and reported on. But information about how these were developed was not
available at the time of the study. Another alternative might have been the creation of
subscales that reflected content area and item format, but this possibility was precluded by
the relatively small numbers of constructed-response items and their uneven distribution
across content areas.
Defining Content-Specific Achievement Measures. I chose to use the six
mathematics content areas used for the international TIMSS reports about the relative
achievement of nine-year old students. They are listed below, along with the number of test
items addressing each reporting area:
Whole numbers 25 items
Fractions and proportionality 21 items
Measurement, estimation, and number sense 20 items
Data representation, analysis, and probability 12 items
Geometry 14 items
Patterns, relations, and functions 10 items
I created subscales for each of these areas. Because it was impossible to subject
TIMSS items to expert scrutiny in order to determine their appropriate content classification
(approximately one-third of the items are being kept secure for possible future use), I used
item maps showing which of the items measured achievement in each of the six content
areas listed above (WESTAT, 1997, 4-18, 4-19) to group items by content area. Since items
were organized differently across the eight different booklets used in the matrix sampling
scheme, I cross-mapped the items by booklet number. Once I had these maps completed, I
could track individual student responses in connection to the actual items on which they were
tested, and thus create a map of student achievement on the different content domains.
I decided that, given the interaction between the sampling structure and booklet
construction, the most reasonable content-specific measure of individual student
achievement to be developed from the raw data would be a percentage score. In other
words, for each student, in each of the six content areas, it was possible to generate an

individual scale score representing the percentage correct of items administered for that
particular student. Scale scores consisted of mean individual student responses on all items
addressing the relevant content area tested in that students particular test booklet, or an
estimation of the percentage of items that each student answered correctly in each content
area, based on the number of items on which the student was assessed for that content area.
Aggregating Subscale Scores to the Classroom Level. I had bounded my definition
of OTL to classroom-level indicators of curriculum and instruction, which influenced my
decision to establish the unit of analysis at the classroom level. OTL information was
provided by teachers, and student data were readily connected to that information by a
linking classroom ID variable. However, I needed to develop a classroom level measure of
student performance. To do this, I took student-level scale scores and aggregated them
using mean student percentage correct for each content area to derive a classroom-level
percentage score. This figure represented the average percent correct that students in each
classroom achieved, of the TIMSS items that they took.
Since each classroom in the study administered at least one copy of each of the
eight different TIMSS booklets, student performance on all assessment items was therefore
included in the database. To check this, I examined distribution of booklets by classroom; all
classrooms except one administered at least one copy of all eight TIMSS test booklets
during the testing session. I deleted that classroom from the database.
The issues that I describe here are not idiosyncratic to the TIMSS measure, but are
important in considering appropriate interpretations of test results across testing contexts.
The sampling procedure, the large domain of items assessed, the combination of complex
and simpler test items, and choices about the level of analysis ail have implications for how
findings may be interpreted. The choices that I made were to reorganize the TIMSS
database to provide classroom level data and to investigate OTL/achievement relationships.

Estimating the Technical Quality of Classroom Subscales. Since the TIMSS
consisted of a complex design, in which plausible values were meant to generalize to state
or national populations rather than to provide information about specific students, it was
important to examine the technical quality of the subscales that I had created. This was
particularly the case since I was building the remainder of my study on them. I decided to
examine their quality in terms of both reliability and validity.
The 1999 Testing Standards (AERA, 1999) note that For each total score, subscore,
or combination of scores that is to be interpreted, estimates of relevant reliabilities and
standard errors of measurement or test information functions should be reported.(AERA,
1999, Standard 2.1, p. 31) Since I had created subscales after the administration of TIMSS, I
needed to compute a coefficient of internal consistency. I used the coefficient alpha method
(Crocker & Algina, 1986) of reliability estimation for individual students subscale scores by
test booklet. Therefore, up to eight different reliability estimates were conducted for each
content area, one for each test booklet used.
To examine how well subscale scores corresponded to the weighted IRT-generated
scores generally used in reporting TIMSS achievement, classroom-level patterns of
achievement across content were compared to published data using IRT scores.
Descriptive statistics of the subscales were explored and compared with Colorado data about
achievement on the same six content areas based on weighted WESTAT data (Voelkl, 1998)
General math achievement patterns were compared by specific content area and by grade
level. If the subscales tend to capture patterns convergent with patterns captured by the
weighted achievement measures, it is likely that they may be relatively valid measures.
Results of these comparisons are described in Chapter 4.
Another way in which to address the validity of subscales was through examining
their sensitivity to instructional effects. TIMSS researchers using weighted international data

have approached this aspect of the measure is by using a quasi-growth or cross-sectional
approach to compare achievement differences between students by grade. The differences
in achievement might be attributed to instructional coverage at the higher grade level. Such
an analysis would require meeting one primary assumption: that there were no cohort
differences between pairs of grades-that third graders were not qualitatively different from
fourth graders, other than having not yet experienced fourth grade (Schmidt, et a!., 1999).
As adjacent grades were sampled in the same schools, and students were tested near the
end of the school year, this seems like a reasonable assumption-that comparing
achievement represents a grades gain.
Information about Colorado that utilizes weighted data has shown differences in
achievement between third and fourth grade students. To check whether subscales
generated convergent or divergent information about achievement by level, I conducted a
series of analyses of variance (ANOVAs) on all subscale scores at the classroom level by
grade, first having conducted an initial Multivariate Analysis of Variance (MANOVA) to
control for the probability of an underestimated Type I error in ANOVAs of grade-level
differences in the six mathematics subscale scores,as recommended by Hair, Anderson,
Tatham, & Black (1998). Results of these analyses are presented in Chapter 4.
Decisions about Operationalizing OTL
The TIMSS teacher survey holds a number of different possibilities for
operationalizing a measure of classroom-level OTL. Variables that might fit into an
operational definition of OTL include time spent on various classroom activities, instructional
focus, generic and content-specific instructional strategies, teacher beliefs about the learning
process and specific aspects of mathematics content, limitations on their teaching, and
homework practices. This is by no means an exhaustive list. Therefore it is necessary to
make decisions about what counts in an operational definition of OTL.

In the literature, as described more extensively in Chapter 2, two predominant
elements of classroom-level OTL are content topic coverage and instructional strategies
(Brewer & Stacz, 1996; Herman, 1997). I chose to focus on these two general elements of
OTL partially because of their extensive coverage in the literature and because they were
well-operationalized in the TIMSS measure. However, it was necessary to explore these two
general elements in more detail, particularly as the TIMSS offers a variety of relevant
variables from which one might choose to operationalize the construct, particularly in the
category of instructional strategies.
A Working Model of OTL and Rationale. I developed an operational model of OTL
to help prioritize my questions. Within the two general aspects of OTL defined above, I
embedded other, more detailed OTL subtopics, mainly in the area of content coverage. For
each subtopic, I generated specific questions to guide me through a series of analyses of
OTL-specific data.
These analyses were designed to illuminate the data patterns captured by the
variables I selected in the subtopic areas of: teacher topic focus, content coverage and
duration, and student learning activities. In some cases, these data patterns were addressed
primarily to explore their connection with achievement to answer my second research
question. In others, I examined data patterns in order to compare them with those found in
other studies of OTL in which OTL was defined in similar ways. In both instances, the
intermediate questions and analyses were developed so that I could get a general sense of
the extent to which various aspects of OTL were being validly measured. Table 3.1 provides
a schematic diagram of the OTL model, general aspects of OTL, specific subtopics, and
intermediate research questions. I describe here the analyses I used for various subtopics in
the model and provide a rationale for their inclusion.

Table 3.1
OTLAn Operational Model and Guiding Intermediate Research Questions
General Aspect of OTL Subtopic Intermediate Research Question
Content Coverage Focus of curriculum What is the extent of teacher focus (number of topics reported covered during the current year) in content?
Topic coverage What topics are most covered (e.g., by most teachers)?
Duration of instruction on topic What are the patterns of duration in topic coverage (e.g., which receive more class time than others)?
Instructional strategies Strategies related to specific student learning activities How do teachers organize student learning activities in their classes? How are these activities distributed across the sample (e.g., which are more common and less common)? Do learning activities fall into recognizable reform" or traditional" orientations?
I also addressed a secondary question for each of these elements that is not explicit
in the model; I explored differences by grade level for each of these intermediate OTL
research questions. This was done for several reasons. First of all, in the initial analyses of
subscales by level, I found that fourth grade classrooms performed significantly better on all
subscales than did third grade classrooms, and I wanted consistency in my analyses of
possible OTL predictors. Additionally, in content-specific subtopics, it is reasonable to
expect third-grade teachers to cover different numbers of topics and different topics than
fourth grade teachers. It is also reasonable to expect differences in duration of instruction
between grades as well, given issues of child development. In the cases where it was
possible, I explored data patterns in content coverage and duration by grade, so that grade
would not be a confounding variable in examining OTL. I also checked the instructional
variables described on p. 55 for differences by grade, for the same reason, using a

MANOVA, and found no significant differences by grade for these variables, so I examined
data patterns based on data from all teachers.
Curricular Focus. This variable is included primarily because of current trends in the
TIMSS literature. One of the key arguments in Schmidt, et al.s (1999) analyses of U.S.
TIMSS data is that the U.S. curriculum is relatively unfocused, compared to other countries,
and that this is connected to the relatively poor performance of U.S. students on TIMSS
measures. The argument is summarized thus:
These results [e.g., poor achievement and lack of focus] suggest that, to a large
extent, we got what we intended to get from US teachers. Curricular goals and
intentions for science and mathematics in US educational systems lacked focus as
they were reflected in official documents and textbooks. US science and
mathematics teachers for nine- and thirteen-year-old students responded to these
diverse intentions by covering a large, diverse, and unfocused collection of topics.
(Schmidt, et al., 1999, p. 55)
When one says we got what we intended to get," this specifically assumes a direct
link between curricular focus and student achievement on the TIMSS. I included this
variable primarily because I wanted to explore whether this relationship is verified using my
own operationalization of curricular focus (MATHTOPS) and the Colorado data in addressing
my second research question. Intermediate analyses involved the exploration of descriptive
statistics by grade level and an ANOVA to examine significant grade level differences.
Topic Coverage/Duration of Instruction. These subtopics were addressed because
of their importance in the OTL literature. Findings by Schmidt, et al. (1997) indicate that, in
general, U. S. teachers tend to concentrate on lower-level topics and that they tend to devote
little time to any one topic. Additionally, these types of items, when phrased relatively
specifically, have been found to be reasonably reliable by other researchers (Burstein, et al.,
1995). On the TIMSS survey, these types of items account for a large section of the survey.
Frequencies were analyzed to provide information about the breadth of topic
coverage across the sample and duration of instruction by grade level. By examining these

patterns and comparing them with other findings about teacher-reported content coverage
and duration, I could make some estimates about the validity of the measure relative to
other findings about OTL. Differences in grade level coverage of individual topics
(measured by individual variables) were analyzed using chi-square tests of independence.
Differences in duration coverage were impossible to analyze using the chi-square because of
gaps in the data and insufficient numbers in individual cells, but I examined frequencies and
describe their patterns.
Specific Student Learning Activities. In the six variables I used to measure student
learning activities, teachers were asked how frequently (on a four-point Likert scale with
response options of 1 = never or almost never, 2 = some lessons, 3 = most lessons, and 4 =
every lesson) their mathematics students participated in six various learning activities.
These activities include:
a) explaining the reasoning behind an idea;
b) representing/analyzing relationships using tables, charts, or graphs;
c) working on problems with no immediately obvious method of solution;
d) using computers to solve exercises or problems;
e) writing equations to represent relationships; and
f) practicing computational skills.
Some, but not all, of these different activities can be interpreted in terms of their relative
orientation to reform-oriented, constructivist practices in mathematics, as exemplified by the
National Council of Teachers of Mathematics (NCTMt Curriculum and Evaluation Standards
for School Mathematics (1989T or more traditional practices that reflect the influence of drill-
and-practice or other more teacher-centered types of pedagogy. Items a), b), and c) are
clearly reform-oriented, emphasizing reasoning, multiple methods of analysis, and problem-
solving, all important tenets of the NCTM documents. The item that relates to practicing
computation, by contrast, is equally clearly linked with traditional practice. The other two
items are more problematic relative to reform practices; it is conceivable that they may be
used in either traditional or constructivist ways.

An initial MANOVA showed no significant differences in the frequency of these
activities by grade level. I examined frequences and then conducted an analysis of
common strategies" in the classroom. This decision, also, was based in the literature
(Lawrenz & Huffman, 1998a; 1998b; 1998c). In general, teachers have been found to
display a mixture of instructional techniques rather than being primarily reform or primarily
traditionally-oriented in their classrooms (Cohen, 1990; Snow-Renner, 1998; Spillane &
Zeuli, 1999), and intermediate analyses of data patterns could provide information about
whether the patterns in these data were similar to other findings.
I conducted a series of factor analyses based on these variables for two reasons. In
part this was done as a way to explore scale validity. In order to confirm empirically that
these types of practices tended to group together in reform and traditional categories in this
data set, it was necessary to identify whether these dimensions were latent in the data. In
this sense, although the final analysis was not truly confirmatory, I used it to evaluate the
proposed dimensionality of reform and traditional classroom practices (Hair, et al., 1998).
Another reason for this analysis was to examine the possibility for reasonable data reduction;
by developing overall factors that combined the elements of specific variables, I could better
focus my selection of variables for final correlations between OTL and achievement.
Additionally, should underlying factors corresponding to reform and traditional factors be
derived, individual teacher factor scores and their relation to achievement might indicate the
usefulness of such descriptors as predictors of student achievement.
One of my interests in the instructional factors lay in their possible usefulness for
policy makers, particularly if they could serve as potential predictors of student achievement.
The initial exploration of the reform factor indicated that this was not the case. However, the
computation variable that I used as a proxy for the traditional factor did correlate significantly
and negatively with student achievement. This type of relationship might be expected if the

reform rhetoric is correct. However, it was important to see if perhaps individual reform-
oriented variables might capture similar relations with achievement, and if they might be
somewhat more useful than a more generic reform factor. Therefore, I incorporated the
individual variables into the final correlation as well.
Once I had developed this model, however, a new problem arose, which required a
further series of decisions. In order to address my second research question, content-
specific topic variables needed to be directly comparable to subscales, and they were not.
The 36 topic variables that I used to operationalize the content-specific aspects of OTL were
not framed at the same level of specificity as subscale categories; they were much less
general and may conceivably have applied to a number of the subscales or to none at all,
depending on the content. I needed to decide how to organize the content-specific elements
of the OTL model so that they mapped onto the domains measured by the classroom
Decisions About Howto Group OTL Content Topics. In some cases, elements of
teacher topic lists were linked clearly to the categories I used for subscale measures.
However, often an obvious one-to-one linkage was lacking, and many of the actual TIMSS
test items were controlled-access. This ruled out an item-by-item analysis of test questions
relative to topics addressed under OTL content. Yet it was necessary to develop a common
scheme for comparing OTL and achievement on the subscales
To solve this problem, I enlisted the assistance of two state-level content experts in
mathematics to reorganize the 36 OTL topics to map onto the six TIMSS achievement
subscales. Both of these individuals had extensive K-16 teaching experience in
mathematics, as well as district and state-level curriculum and standards development
experience. To assist them, I developed a checklist in the format of a grid. Across the top
of this grid, the subscale titles were listed, while down the left side, teacher questionnaire

topic descriptions were written verbatim from the elementary teacher questionnaire. I asked
the content area experts to complete the grid by placing a check in the boxes corresponding
to all achievement reporting areas that mapped specifically to any particular topic on the
teacher questionnaire.
If the experts did not consider any particular OTL topic to map appropriately to any
student performance area at the elementary level, I omitted those variables from the
analysis addressing the second research question of the study-the correlation between OTL
variables and subscale scores. This applied to eleven different topics, and I omitted them
from the final correlation, although I did explore them in terms of content coverage.
Additionally, if OTL topics from the teacher questionnaire mapped onto more than one
reporting area, they were counted as a component of all relevant areas. One mathematics
topic, Problem solving, mapped onto all subscale domains, and I incorporated coverage
information for this variable into duration information for each topic area, as described
These content maps were used to develop additive scale scores related to duration
of instruction on specific topics, addressed briefly in the section titled Topic
Coveraae/Duration of Instruction above. All topics mapped by the experts to a given content
reporting area were combined into a duration score (therefore duration scores for all six
mathematics reporting areas included the duration value for the Problem solving variable).
These scores varied in range and size; data values on individual variables ranged from 0
(recoded variable indicating content had not been covered at all during the current school
year) to 4 (15 or more lessons). Therefore duration scale scores varied according to the
number of variables included in the scale.

Decisions About Linking the Data and Methods of Comparison
To finally address the second research question, the first decision involved howto
develop the subsample and to merge OTL and achievement data. This decision was
informed by my earlier decisions about the unit of analysis and the operationalizations of
achievement and OTL. As the unit of analysis had been determined to be the classroom, I
developed a merged database, where each case represented one teacher (with OTL
information) and classroom achievement data (subscale scores). Since early reliability
estimates of the subscales indicated that the Patterns and Geometry
subscales were unreliable, I did not include them in the final analysis of OTL and
I merged classroom-level subscale scores into the teacher database, using a
WESTAT-generated classroom identification numbers (IDCLASS) as linking variables.
Because of inconsistencies in the data related to matches between teachers and classes
(cases in which teachers either taught multiple sampled classes, team-taught the same
classes, or in which either OTL or achievement information was lacking) I made the following
choices in data cleanup:
Cases of redundancy in teacher data- In the case of two teachers, OTL information
was linked to two different assessed classes each. These cases have been omitted
from the OTL analyses, although they have been retained during the phase of the
study in which the relationships between OTL and achievement data were explored.
Cases of redundant classroom-level achievement data-For 13 initial sets of
classroom-level data, more than one teacher was assigned, which was accounted for
primarily by differences in teachers content assignments (mathematics or science).
The majority of cases with redundant classroom achievement data could be parsed
out by the content taught by teachers, and this is what I did. In these classrooms,
the only relevant achievement data coded for each teacher is data that relates
directly to that teacher's content code. In cases where one teacher taught both
mathematics and science, I made decisions about coding depending on the other
available data for that classroom; if the other assigned teacher taught science, for
instance, I decided to count mathematics achievement data linked to the teacher
coded as teaching both mathematics and science. In this way, I was assured that
achievement data were not duplicated.

Cases in which either OTL or achievement data were missing--\n some cases,
teacher questionnaire data were available without related achievement data. In
other cases, student achievement data were provided, but the associated teacher
questionnaire data were missing. I deleted these cases from the database.
The second choice that needed to be addressed was that of appropriate methods. I
chose to use correlations because the study is designed as an exploration of policy issues
around the local use of OTL and achievement measures, with TIMSS as the prototype
example. It is not a technical measurement study, but a policy study. In that sense, while
measures needed to be both grounded in the OTL literature and certain general technical
requirements met (e.g., reliability and validity), the overall emphasis of the study is on
potential policy uses for and implications about findings about OTL and achievement data.
Therefore, it was necessary to utilize a straightforward approach both to operationalizing
study terms and in determining statistical methods to explore relationships in the data.
While more technical methods, such as HLM (Raudenbush & Bryk, 1989) are available to
correct for the error inherent in nested data designs such as this one, these types of methods
involve a level of abstraction and complexity that is neither readily accessible to policy
makers, nor easily translatable into policy approaches. I decided to use the relatively
straightforward approach of correlations because of their accessibility to policy makers.
Limitations of the Study
The findings here are based on analyses of unweighted TIMSS data. They are not
intended to generalize to the populations of Colorado teachers and students sampled.
These analyses represent feasible and likely approaches by state and local policy users of
OTL data, and, therefore, do not include all possible analyses. The TIMSS data set, as
noted in the conclusions, represents a well-documented and often-cited approach to
collecting OTL data in conjunction with achievement, and its limitations to serve as the
paradigm example of OTL are discussed in the following chapters and the conclusions.

This chapter describes the data patterns in the classroom-level subscale scores
Whole Numbers (WHOLM_1), Fractions and Proportionality (FRACTM_1), Measurement.
Estimation, and Number Sense (MEASM_1), Data Representation, Analysis, and Probability
(PROBM_1), Geometry (GEOMM_1), and Patterns. Relations, and Functions (PATTEM_1).
I also discuss the results of my explorations of the technical qualities of subscale scores. I
explored technical qualities using three different analyses:
1. I estimated the reliability of subscale scores at the student level. I ran these
estimates based on individual student scores because of the variations in
booklet structure within the classroom;
2. I examined evidence for subscale validity based on the relations of subscale
scores to other variables, in this case, information about content-specific
patterns of achievement based on weighted plausible values scores; and
3. I examined the sensitivity of subscales to instructional exposure relative to
reports based on weighted data. I did this by comparing subscale
achievement across grade levels, using an initial MANOVA and multiple
ANOVAs to test for significant differences.
These data patterns and results of my inquiries into the technical qualities of subscale scores
are described below.
Subscale Scores: Patterns of Achievement
Table 4.1 describes mathematics achievement results. The table illustrates the
following general patterns of achievement. The lowest achievement areas for both grades
were 1) Fractions and proportionality (Fractions'): and 2) Measurement, estimation, and
number sense (Measurement). The area of highest achievement across grades was in Data
representation, analysis, and probability (Datal. Across levels, achievement in Whole

numbers was the second strongest subscale area, followed by Geometry. Fourth grade
classes achieved higher average percentage scores than third grade classes on all
subscales, although there is considerable overlap in class performance by grade level. For
example, for the Fractions subscale, the lowest third-grade class score is 15%, compared to
18% for fourth grade classrooms. On the high end of the range on this subscale, the highest
third grade classroom score was 59% correct, compared to 66% correct for the highest-
scoring fourth grade classroom.
Table 4.1
Summary Statistics of Mathematics Subscales
Subscale Content Grade 3 (N = 45) Grade 4(N = 51)
Inclusive M S Inclusive M S
range range
Whole Numbers 23%-81% 48.7% .1337 32 82% 61.5% .1063
Fractions and Proportionality 15%-59% 35.6% .0951 18% -66% 44.9% .1062
Measurement, Estimation, and Number Sense 19%-58% 38% .0914 17% 72% 46.9% .1174
Data Representation, Analysis, and Probability 18%-87% 53.8% .1452 31%-86% 65.8% .1291
Geometry 19%-66% 45.9% .1053 22% 79% 56.8% .1320
Patterns, Relations, and Functions 13%-72% 42.2% .1390 23% 82% 55.2% .1329
To explore these data more closely, I examined box-and-whisker plots for each
subscale by grade and also compared the shapes of classroom score distributions by
examining histograms by grade. I provide Figure 4.1 for Fractions to illustrate.

N = 45 51
3.00 4.00
Figure 4.1. Box-and-whisker plot illustrating Fractions subscale distributions by grade.
The box-and-whisker plot indicates that the low-scoring fourth grade classroom with the 18%
fractions score is an outlier.
Figure 4.2 provides an example of comparative histograms of classroom
achievement on the Fractions subscale. This figure illustrates the very different distributions
of subscale scores at different grade levels. Although the means are not extremely far apart
(35.6% vs. 44.9%), the histograms indicate that more third grade classes perform at the
lower end of the distribution than do fourth grade classes, which are clustered toward the
higher end of the distribution. I conducted similar explorations of the data for all subscales,
with generally similar results. These patterns in the data indicate that subscale scores
capture variation in student performance both in content area and by grade level.

Distributions of class performance on fractions items
3JD un
Estimating Subscale Reliability
To explore how consistently the content subscales functioned, and to establish the
lower bounds of their coefficients of precision, I utilized coefficient alpha to explore subscale
reliability (Crocker & Aigina, 1986). This was done by conducting reliability estimates of
each subscale by booklet, using scale performance information disaggregated to the
individual student level rather than at the classroom level. I did not use classroom level
subscales because of the variations in booklet distribution within the classroom.
Subscale reliability varied by content area and booklet number. Using data for all
students in the Colorado sample, I generated student-level consistency reliability estimates
for each subscale by booklet. The numbers of students who were assigned each booklet
ranged between 296 and 301. Table 4.2 is an aggregated summary of the distribution of
numbers of booklets on which reliability can be calculated for each subscale, the mean
reliability coeffient, the standard error of the mean across booklets, and the inclusive range
of reliability coefficients across booklets.

Table 4.2
Summary of Subscale Reliability Estimates
Subscale Content N of booklets upon which reliability can be estimated (booklets with more than 1 item addressing content area) M SE Inclusive range- Cronbachs alpha (a) across all booklets for which reliability could be estimated
Whole Numbers 8 .7696 .0016 .7013- .8214
Fractions and Proportionality 8 .6456 .0040 .4913- .8131
Measurement, Estimation, and Number Sense 8 .5572 .0039 .4307 .7474
Data Representation, Analysis, and Probability 6 .6710 .0043 .4835 .7754
Geometry 7 .5160 .0056 .2201- .6947
Patterns, Relations, and Functions 4 .5211 .0039 .4256 .5924
Average reliability estimates for most subscales are moderate, with the largest
estimate for Whole numbers, at .7696. This is the case, although the high ends of the
inclusive ranges of Whole numbers. Fractions, and Data are fairly high--between .83 and
.77. In four mathematics content areas, at least seven of the eight booklets contain more
than one item addressing that specific content, although Patterns, relations and functions
(Patterns! and Data are not as extensively covered across booklets. In the content areas of
Whole numbers. Fractions, and Measurement, all eight booklets contain at least three items
addressing each area. Both Geometry and Patterns demonstrate relatively low levels of
reliability, combined with overall lower numbers of booklets that address the content in

This shows that not all content areas are equal within the sampling scheme. In some
cases, not only are items sparsely distributed across booklets, they are also sparsely
distributed within booklets, as is the case with items on the Patterns subscale. Although
these items were included in 7 of the 8 test booklets administered, in 3 booklets, only one
such item was included. Hence an estimate of reliability for those booklets is impossible,
and it is likely that scores for students who were assigned those booklets for this content
category are less stable and reliable.
This is particularly the case when considering the sampling scheme; once scores are
aggregated to the classroom level, the interaction of variability in booklet structure with the
BIB spiraling matrix techniques indicates that such scores are even less reliable
representations of class performance. For instance, in smaller classes, where only one
student may have been assessed on each booklet, it is possible that class percentage scores
on Patterns are assessed based on the performance of seven students, three of whom were
only assessed on one item addressing knowledge about patterns. In a case such as this, the
knowledge about a content area, or lack thereof, of a small group of students may skew
class subscale results. The Geometry subscale is affected by similar considerations,
although not so extreme; 4 of the 8 booklets addressed Geometry using three items or less.
This interaction between the booklet structure and sampling techniques is not unique
to my study; they also pose complications for international comparative studies. Beaton
(1998) addresses this problem specifically regarding the use of content-specific subscales:
In order to extend the coverage of the TIMSS tests not all students were assigned
the same items. The test items for grades 3 and 4 were divided into 26 blocks, and
the blocks were placed into test booklets in a methodic way. One block was placed
in all test booklets. Seven blocks were rotated so that each was in three booklets.
The other 16 blocks were placed in only one booklet each. The result of this method
of booklet construction was that different items were answered by different numbers
of students within a country. The first block was answered by all students, the
second seven blocks by about 3/7 of the students, and the remaining items by

approximately 1/7 of the students... the accuracy of the estimates is expected to vary
depending on the number of students in the sample who were assigned the item.
(Beaton, 1998, p. 533)
To summarize these estimates of reliability, in general, subscale scores are
moderately reliable, with the exception of the Patterns and Geometry subscales. The low
reliability estimates of these scales, combined with their minimal inclusion in the booklet
structure, particularly in connection with the sampling scheme in smaller classes, indicate
that their sensitivity to differences in student achievement is problematic. Otherwise,
subscale reliability varies according to content area and booklet, in relation to the relative
emphasis on different content within specific booklets.
Exploring Evidence for Subscale Validity:
Comparisons with Weighted Colorado Data
To further explore the technical qualities of subscale scores, I compared
achievement measured on the subscales to achievement utilizing the same database, but
using weighted data and plausible values scores. Voelkl (1998) provides a content-specific
summary of Colorados mathematics achievement along the same six content domains
measured by subscale scores. This summary uses the initial statewide sample and the
WESTAT-generated weights described in Chapter 3 and Appendix C. Table 4.3 illustrates
student mathematics performance (in terms of average percent correct by content area) as
reported by Voelkl for Colorado, compared with subscale means from my database, which
are labeled Dissertation.

Table 4.3
Comparison of Subscale Scores with Reports of Achievement Using Weighted Data
Data Source Grade Whole Fractions & Measurement Data Geometry Patterns,
numbers Proportionality estimation, & representation, relations,
number analysis, & &
sense probability functions
(25 items) (21 items) (20 items) (12 items) (14 items) (10 items)
Voelkl, 1998 (weighted N = 103,776 students) 3 54 (-) 33 (-) 37 (-) 53 (-) 55 (-) 48 (-)
4 67(1.6) 45(1.6) 48 (1.5) 68 (1.7) 68 (1.4) 61 (1.7)
Dissertation (N = 96 3 48.7% 35.6% 38% 53.8% 45.9% 42.2%
4 61.5% 44.9% 46.9% 65.8% 56.8% 55.2%
() Standard errors appear in parentheses. Because results are rounded to the nearest whole number, some totals may
appear inconsistent.
(Weighted data derived from Voelkl, 1998, pp. 38, 39)
In comparing the weighted data with subscale-generated data, one descriptive
pattern that emerges is the relative consistency in patterns of strong and weak performance.
This is the case, in general, even though the actual percent correct is different for each data
set. For both data sets, the lowest achievement areas across grades were Fractions and
Measurement, areas in which fourth grade students performed well below the international
averages (Voelkl, 1998). Similarly, in Data, students show relatively high levels of
achievement, particularly at the fourth grade level. However, these broad similarities do not
hold across all content areas and for all grade levels; the relatively high performance of third
grade students that the weighted data show on Geometry, for example, is lost in the
Differences may be partially explained by the low levels of reliability on some
subscales, the weighting process, the nature of the subsample, and in how percentage
correct was calculated relative to the use of weighted or unweighted data. I calculated
percentage correct based on individual student data, as described in Chapter 3. Although
specific technical information about how Voelkl calculated this is lacking in her report, other

researchers (Beaton, 1998) have utilized a different method using TIMSS data. It is likely
that Voelkl used this approach, in which the average proportion correct in each content area
is calculated from each selection of appropriate items (e.g., those mapping to the relevant
content area) to produce national averages. Sampling weights are used to calculate
proportions, in order to compensate for the unequal probability of the students in the TIMSS
samples being selected to take any given item, an artifact of the booklet sampling structure
described in chapter 3.
In the most general sense, this analysis indicates that, while they have technical
limitations, particularly in the areas of Patterns and Geometry, subscale scores can be
somewhat useful for making broad-brush distinctions between levels of performance across
curriculum-specific dimensions of achievement. They capture broad trends in performance
that are roughly congruent with those shown using the weighted data.
Exploring Evidence of Subscale Validity:
Achievement Differences bv Grade Level
Another aspect of exploring the technical quality of subscales involved examining
differences in achievement by grade level. Other researchers exploring the TIMSS and
using plausible values have found systematic differences in performance between upper and
lower-grade students (Voelkl, 1998; Schmidt, et al., 1999). An analysis of whether subscale
scores also demonstrate this sort of pattern may provide further evidence of score validity, in
terms of measure sensitivity to instructional effects. Building on the grade level patterns
described above and titled Subscale Scores: Patterns of Achievement. I further explored
grade-level achievement patterns through an analysis of mean achievement by grade.
As my question about achievement was intrinsically multivariate, focusing on
performance across six different content dimensions, I conducted a Multivariate Analysis of
Variance (MANOVA). Rather than a series of six different ANOVAs (one for each

mathematics achievement measure), I first conducted the MANOVA to control for the
increased likelihood of Type I error inherent in multiple tests (Hair, et al., 1998). The results
of the MANOVA are provided in Appendix D, Tests of Between-Subiects Effects-Subscale
Achievement bv Grade.
Initial exploration of the data indicated two low outliers, and these classrooms were
excluded from the study, as MANOVA is highly sensitive to outliers (Hair, et al., 1998). The
assumption of independence was met, as each classroom was coded as either third or fourth
grade level. To test the assumption of normality of the data, I conducted a Kolmogorov-
Smirnov test of univariate normality on all six subscales. Findings indicated that distribution
of data for all mathematics subscales by grade level were normal. Results of the Boxs test
of equality of covariance matrices on the achievement variables indicated a significance
level of .003, which means that the covariance matrices were unequal, violating one of
MANOVAs underlying assumptions. However, according to Hair, et al. (1998),a violation of
this assumption has minimal impact if the groups are of approximately equal size. The
groups in this case were very nearly equal; the analyses were conducted using achievement
data from 44 third grade classrooms and 50 fourth grade classrooms.
I used the Wilks Lambda criterion to determine whether individual ANOVAs were
warranted, and achieved a significance level of less than .001, indicating that, across the six
subscales, differences were sufficient to warrant six separate ANOVAs without exceeding a
.05 probability of a Type I error. All ANOVAs were significant (F values ranged from a low of
18.854 for the Measurement subscale to 29.636 for the Whole numbers subscale, df for all
subscales = 1), at a significance level of less than .001, indicating that all subscales captured
significant differences in classroom performance by grade level. Student achievement on all
subscales were significantly higher in fourth grade classrooms (please reference Table 4.1,
Summary Statistics of Mathematics Subscales, on p. 62 for clarification).

Summary of Subscale Patterns and Technical Qualities
In this chapter, I described the subscale data patterns. I also discussed the results of
analyses that I conducted in order to estimate the technical qualities of the subscale scores,
addressing issues of subscale reliability and, to a lesser extent, the validity of subscale
measures in capturing variation in achievement across content areas and by grade level. I
found three different things:
1) With two exceptions, mathematics subscales demonstrate reasonable levels of
consistency reliability, as estimated using the coefficient alpha method. The
exceptions are the Geometry and Patterns subscales. Based on these findings, I
omitted these subscales from my analysis of achievement and OTL that was
designed to address the second research question. Reliability estimates for other
subscales vary by content area due to idiosyncracies in sampling and booklet
structure, with mean alphas ranging between .7696 for Whole numbers and .5572 for
2) Subscales capture broad variation in achievement across specific content area
domains, showing areas of relative strength and weakness. Comparisons of these
unweighted data with data based on Colorado weighted achievement data indicated
that subscales captured broadly similar patterns to those described using weighted
data, in terms of relative strengths and weaknesses. The differences captured,
however, did not hold across all content areas, although they were broadly congruent
with content-specific patterns of achievement using weighted Colorado data.
Students at both grade levels performed most strongly in the areas measuring data
analysis skills and most poorly on fractions and measurement-related items.
3) Subscales captured significant variation in student performance by grade level.
Although the distributions of classroom scores overlapped considerably by grade

level, analyses of classroom means (MANOVAs and subsequent ANOVAs) on
subscale scores by grade indicated that fourth grade classrooms performed
significantly better than third grade classrooms on all mathematics subscale
measures. This shows that, in addition to being broadly responsive to curriculum-
specific elements of achievement, these subscale scores are also relatively sensitive
to grade level differences that might be explained by a general instructional effect,
broadly defined.
Taking into account the varying requirements for technical quality depending on the
projected use of the data (AERA, 1999), most of these subscales are reasonable, moderately
reliable, and potentially valid measures of achievement. However, this is only the case
when the projected uses are low-stakes, for instance, for the purposes of learning more
about how to measure achievement. They would be problematic for any higher stakes use
of the data. However, they may be useful in providing broad-brush information about

This chapter describes data patterns along the dimensions of OTL that I described in
Chapter 3 and illustrated in Table 3.1, titled OTLAn Operational Model and Guiding
Intermediate Research Questions. It addresses the findings of the intermediate research
questions that I used to guide my inquiry into how specific variables described classroom
processes. It also describes my various analyses. I conducted these intermediate analyses
for several reasons: to compare these data patterns to those found in other studies of OTL to
get a general sense of the validity with which these variables capture the construct, and also
to develop intermediate variables that would be used in my analysis to answer the second
research question-addressing the extent and nature of the relationships between OTL and
student achievement, using data from TIMSS as an exemplary measure.
Data Patterns in OTL Variables
Within the overall areas of Content Coverage and Instructional Strategies, I
conducted analyses of four different subtopics:
1) Curricular Focus, operationalized by an additive variable, (MATHTOPS);
2) Topic Coverage, based on teacher reports of whether or not they had
covered up to 36 specific mathematics topics and subtopics;
3) Duration of Instruction on Content, based on reports of how much time
teachers spent on these 36 topics during the year prior to TIMSS; and
4) Student Learning Activities, based on the frequencies in which students
participated in six different types of learning experiences in their
mathematics classrooms.

For each subtopic, I conducted tests to determine whether OTL differed along that dimension
by grade level because of measured differences in subscale scores by grade and the logical
expectation that content would differ by grade. In the following section, I describe
intermediate research questions, analyses used, and specific data patterns for each subtopic
Curricular Focus
The intermediate research question that I address in this area is:
What is the extent of teacher focus (number of topics reported covered during the
current year) in content?
I answer this question primarily through the exploration of descriptive statistics on the focus
variable (MATHTOPS). A secondary research question addresses whether teacher focus
differs by grade level, and I use a one-way ANOVA to test for mean differences by grade
level (IDGRADE).
Table 5.1 shows descriptive statistics on the focus variable by grade level.
Table 5.1
Descriptive Statistics: Curricular Focus (MATHTOPS1 bv Grade
Grade Level Mathematics Topics Reported (of 36 possible)
N Min. Max. M S
3 38 8 29 16.39 5.11
4 51 9 35 22.02 6.08
In mathematics, the distribution of topics taught tended in general to be lower for
third grade teachers than for fourth grade teachers. This is verified by analysis of the
differences of means. A one-way ANOVA of mean topics covered confirmed a highly
significant difference (P<.001) between the curricular focus of third and fourth grade
teachers, as shown in Table 5.2, below.

Table 5.2
ANOVA of Curricular Focus-Differences bv Grade Level
Source of Variance (SV) SS df MS F Sig.
Between Groups 688.952 1 688.952 21.285 <.000
Within Groups 2816.059 87 32.368
Total 3505.011 88
These data indicate that Curricular Focus, as defined by numbers of topics taught by
teachers, varies considerably across these classrooms. Some teachers report teaching
almost all topics listed, while others report covering much fewer. The extent of mathematics
focus varies systematically by grade level, with fourth grade teachers teaching significantly
more topics than third grade teachers.
However, these analyses indicate that these data do not correspond to international
reports of curricular focus; the fourth grade teachers in this sample report covering only an
average of roughly 22 topics, considerably fewer than the 32 textbook topics found by
Schmidt, et al. (1997) in the international TIMSS textbook study. Also, these data patterns
indicate that, at least on the macroscopic level, the positive relationship between focus and
achievement put forth by international researchers is not apparent here. We can already see
that fourth grade teachers report covering more topics than third grade teachers, (thus
demonstrating less curricular focus) and we know from analyses of subscale achievement
that fourth grade classes have higher achievement levels.
Topic Coverage
The intermediate research question about Topic Coverage was:
What topics are most covered (e.g., by most teachers)?

This question addressed topic coverage (measured as a dichotomous variable) on 36
different mathematics topics and subtopics. A secondary question addressed differences in
topic coverage by grade level. I analyzed the data patterns by examining the frequencies of
teachers who reported coverage and noncoverage by grade level across the list of topics.
To explore differences in coverage by grade level, I conducted chi-square tests of
independence for all topics. Findings are presented here by subscale content categories,
which entailed mapping these individual variables onto broader content domains, as
described in Chapter 3, in the section titled Decisions About How to Group OTL Content
Topics, beginning on p. 57.
Whole Numbers. OTL variables measuring four specific topics map onto Whole
Numbers. These topics are:
1) Whole numbers:
2) Place value and numeration:
3) Whole number meanings, operations, and properties: and
4) Problem solving strategies-problem solving heuristics and strategies
Whole numbers topics are taught by large proportions of teachers at each grade
level. Additionally, there is little difference between third and fourth grade teachers in terms
of coverage; emphasis is considerable across grades. In fact, for the three topics specifically
related to whole numbers, there were empty cells for teachers who reported not covering the
topics, so x2 tests are not usable for assessing systematic grade-level differences. For the
Problem solving variable, the x2 test indicated no significant difference in topic coverage
across grades. For this variable, the significance level is .382, indicating that there are no
significant differences in the numbers of third and fourth-grade teachers who cover this topic.
Across all topics, including the Problem solving variable, more than 80% of teachers reported
coverage. (The Problem solving variable mapped onto all six content coverage categories,
but I report information about coverage only in this section.)

Table 5.3 illustrates the numbers and proportions of teachers reporting topic
coverage, along with the Pearson chi-square statistics in the tests of independence. In the
instances where values were fewer than 5 in any one cell, I have noted that in the table.
Table 5.3
Teacher Topic Coverage on Whole Numbers and Grade Level Differences
Topic Grade 3 Grade 4 Pearson Chi-Square
N % covering topic N % covering topic Value df 2- tailed Sig.
Whole numbers 38 100% 54 83.3% Insufficient values
Place value and numeration 47 97.9% 55 92.7% Insufficient values
Whole number meanings, operations, and properties 47 97.9% 55 90.9% Insufficient values
Problem solving strategies-- problem solving heuristics and strategies 47 89.4% 54 83.3% .765 1 .382
Fractions and Proportionality. In addition to the Problem solving variable, variables
measuring 10 specific topics map onto the Fractions and Proportionality subscale. They are:
1) Common and decimal fractions:
2) Meanings, representation and uses of decimal fractions:
3) Operations of decimal fractions:
4) Properties of decimal fractions:
5) Meaning, representation and uses of common fractions:
6) Operations of common fractions:
7) Properties of common fractions:
8) Relationships between common & decimal fractions:
9) Finding equivalent fractions and forms: and
10) Ordering of fractions (common and decimals).
Table 5.4 illustrates the numbers and proportions of teachers reporting topic coverage, along
with the Pearson chi-square statistics in the tests of independence. Variables that are
significantly different by grade level are noted.

Table 5.4
Teacher Topic Coverage on Fractions and Grade Level Differences
Topic Grade 3 Grade 4 Pearson Chi-Square
N % covering topic N % covering topic Value df 2- tailed Sig.
Common and decimal fractions ** 47 29.8% 55 61.8% 10.437 1 .001
Meanings, representation and uses of decimal fractions** 47 19.1% 55 41.8% 6.049 1 .014
Operations of decimal fractions** 47 17% 55 41.8% 7.366 1 .007
Properties of decimal fractions** 47 14.9% 55 38.2% 6.901 1 .009
Meaning, representation and uses of common fractions 47 61.7% 55 78.2% 3.315 1 .069
Operations of common fractions* 47 38.3% 55 60% 4.774 1 .029
Properties of common fractions* 47 38.3% 55 63.6% 6.519 1 .011
Relationships between common & decimal fractions** 47 8.5% 55 53.9% 15.701 1 .000
Finding equivalent fractions and forms** 47 34% 55 67.3% 11.211 1 .001
Ordering of fractions (common and decimals)** 47 17% 55 49.1% 11.564 1 .001
These data patterns indicate considerably more variation between grades for
Fractions topics than for Whole numbers topics; all topics but one, Meaning, representation
and uses of common fractions, are covered by significantly more fourth grade teachers than
third grade teachers. Additionally, considerably fewer teachers at both grade levels reported

coverage on Fractions topics than was the case for Whole numbers. This is particularly true
for topics addressing more complex content. The most often-reported topic that is covered
across grades is Meanings, representations, and uses of common fractions, while coverage
of more complex content, such as Operations and Properties of decimal fractions is lee
frequently reported.
Although more fourth grade teachers report covering Fractions content than third
grade teachers, there are much smaller proportions doing so at either grade level than is the
case with Whole numbers topics. At the fourth grade level, for example, less coverage is
reported on topics related to decimal fractions than on topics addressing common fractions.
Compared with around 60% of teachers who report that they have addressed the topics of
Operations and Properties of common fractions, and the generally-phrased Common and
decimal fractions, only 38%-49% of fourth grade teachers cover topics like Operations,
Properties, and Meanings of decimal fractions. Ordering of fractions, or Relationships
between common and decimal fractions.
The chi square analyses indicate considerable grade-level articulation in Fractions.
They measured statistically significant differences by grade on all topics but one: Meaning,
representation and uses of common fractions. On this variable, approximately 78% percent
of fourth grade teachers cover the topic, compared to about 62% of third grade teachers.
Although this difference is not statistically significant at p<.05, it would be if the alpha level
was set at p<.10. These findings are congruent with other studies indicating that, in the
U.S., decimal fractions topics are introduced into the curriculum at fourth grade (Schmidt, et
al., 1999).
Measurement. Estimation, and Number Sense. Initial maps of topic coverage onto
the Measurement category included the following seven variables:

1) Whole numbers:
2) Place value and numeration:
3) Whole number meanings, operations, and properties:
4) Problem solving strategies-problem solving heuristics and strategies;
5) Estimation and number sense, which involves estimating quantity and size, rounding
and significant figures, estimating the results of computations (including mental
arithmetic and deciding if solutions are reasonable), and scientific notation;
6) Measurement units and processes, including ideas of measurement and units,
standard (metric) units; length, area, volume, capacity, time, money and so on, the
use of measurement instruments; and
7) Number theory, which includes topics like prime numbers, factors of whole numbers,
greatest common divisors, least common multiples, permutations, combinations, and
systematic counting.
Because we have already explored the first four variables in the context of the Whole
numbers category, I focus here only on the last three variables specific to the Measurement
subscale. Table 5.5 illustrates the numbers and proportions of teachers reporting topic
coverage on these variables, along with the Pearson chi-square statistics in the tests of
independence. Variables that are significantly different by grade level are noted.
Table 5.5
Teacher Topic Coverage on Measurement and Grade Level Differences
Topic Grade 3 Grade 4 Pearson Chi-Square
N % covering topic N % covering topic Value df 2- tailed Sig.
Estimation and number sense 47 91.5% 55 92.7% .054 1 .817
Measurement units and processes 47 85.1% 55 81.8% .197 1 .657
Number theory** 47 40.4% 55 76.4% 13.616 1 .000
Estimation and number sense is reported as taught by more than 90% of teachers at both
grade levels, similar to the findings on Whole numbers variables. Slightly fewer (80%)
teachers addressed Measurement units and processes. In the topic of Number theory.
however, there is a considerable difference by grade; only 40.4% of third grade teachers

reported covering the topic, compared to more than two-thirds of fourth grade teachers. The
chi-square analyses indicate that, among these topics, this is the only significant difference
in coverage by grade level, with a significance level of p< .001.
Data Representation. Analysis, and Probability. Variables measuring four content
topics map onto this area. They are:
1) Problem solving strategies-problem solving heuristics and strategies;
2) Number theory.
3) Data representation and statistics, which includes collecting data from experiments
and simple surveys; representing and interpreting data (tables, charts, plots, and
graphs); means, medians, and other simple statistics; samples, uses and misuses of
simple statistics; and
4) Probability, addressing concepts of more likely and less likely and computing
probabilities (including informal computations or estimation of probabilities).
The first two variables are described in the sections above. Table 5.6 illustrates the numbers
and proportions of teachers reporting topic coverage on these variables, along with the
Pearson chi-square statistics in the tests of independence.
Table 5.6
Teacher Topic Coverage on Data and Grade Level Differences
Topic Grade 3 Grade 4 Pearson Chi-Square
N % covering topic N % covering topic Value df 2- tailed Sig.
Data representation and statistics 47 83% 55 74.5% 1.065 1 .302
Probability 47 51.1% 55 50.9% .000 1 .988
Similar to findings about whole numbers-related topics, these topics are relatively broadly
covered. More than 80% of third grade teachers and approximately 3/4 of fourth grade
teachers report covering Data representation and statistics in their classes, and
approximately half of the teachers at each grade level reported that they covered Probability
in their classes. The chi-square analyses indicate no significant differences by grade level.

Geometry. Variables measuring a total of five topics map onto this content area,
including the Problem-solvina variable discussed above. The other four topics are:
1) Perimeter, area and volume-which might include perimeter and area of triangles,
quadrilaterals, circles, and other two-dimensional shapes; calculating, estimating and
solving problems involving perimeters and areas; surface area and volume;
2) Basics of one and two dimensional oeomefry-number lines and graphs in two
dimensions; triangles, quadrilaterals, other polygons, and circles; equations of
straight lines; Pythagorean Theorem;
3) Congruence and similarity-concepts, properties and uses of congruent and similar
figures, especially for triangles, squares, rectangles, and other plane shapes; and
4) Transformations and Symmetry-patterns: tessellations; symmetry in geometric
figures; symmetry of number patterns; transformations and their properties.
Table 5.7 illustrates the numbers and proportions of teachers reporting topic coverage on
these variables, along with the Pearson chi-square statistics in the tests of independence.
Table 5.7
Teacher Topic Coverage on Geometry and Grade Level Differences
Topic Grade 3 Grade 4 Pearson Chi-Square
N % covering topic N % covering topic Value df 2- tailed Sig.
Perimeter, area and volume 47 46.8% 55 60% 1.775 1 .183
Basics of one and two dimensional geometry 47 42.6% 55 56.4% 1.933 1 .164
Congruence and similarity 47 53.2% 55 45.5% .607 1 .436
Transformations and Symmetry 47 51.1% 55 49.1% .039 1 .843
Fewer teachers cover Geometry topics than Whole numbers and Data topics. Coverage was
closer to the 50% mark for both third and fourth grade classes, compared to more than 80%
of teachers who reported teaching Whole Numbers topics across grade levels. Additionally,
there is little evidence of articulation between the grades, with similar proportions of third and
fourth grade teachers covering these topics. Chi-squares indicated no statistically significant
differences in topic coverage by grade for any of the geometry topics.

Patterns and functions. In addition to the Problem solving variable, only one topic
was included on this scale:
Functions, relations, and patterns-number patterns; properties, uses, and graphs of
functions; problems involving functions, relations and their properties.
Based on the grade level Ns of 47 third grade teachers and 55 fourth grade teachers, more
coverage of this topic took place at the third grade (in 80.9% of classes, compared to 69.1%
at fourth grade). The chi-square value of 1.845, with one degree of freedom, indicates that
this difference is not statistically significant, with an actual p value of .194.
Other mathematics topics. When variables addressing OTL topics were mapped
onto subscale content areas, eleven areas were considered age-inappropriate. They were
omitted from future explorations of OTL and achievement, but are included here to help
clarify the data patterns emerging from topic coverage. These topics include:
1) Percentages-concepts of percentage; computations with percentage; types of
percentage problems;
2) Number sets and concepts, including integers (negative and positive), rational, real,
and other number sets; number bases other than ten, and exponents;
3) Estimation and error of measurements-esWmaWon of measurements other than
perimeter and area; precision; accuracy, and errors of measurement;
4) Three dimensional figures and construction-constructions with compass and
straight-edge; three-dimensional geometry; conic sections;
5) Ratio and proportiongeneral definition:
6) Ratio and proportion-concepts and meanings1:
7) Ratio and proportion-application and uses-maps and models; solving practical
problems based on proportionality; solving proportional equations;
8) Equations and formulas in general:
9) Linear equations and formulas-representing linear numerical situations; solving
simple linear equations;
10) Other equations and formu/as-representinq other numerical situations, solving other
simple equations; use of algebraic expressions and inequalities; and
11) Sets and logic-sets, set notation and set operations; classification; logic and truth
Table 5.8 illustrates the numbers and proportions of teachers reporting topic coverage on
these variables, along with the Pearson chi-square statistics in the tests of independence.

Table 5.8
Teacher Topic Coverage on Other Math Topics and Grade Level Differences
Topic Grade 3 Grade 4 Pearson Chi-Square
N % covering topic N % covering topic Value df 2- tailed Sig.
Percentages 47 6.4% 55 36.4% Insufficient values
Number sets and concepts 47 25.5% 55 25.5% .000 1 .993
Estimation and error of measurements 47 44.7% 55 50.9% .394 1 .530
Three dimensional figures and construction 47 14.9% 55 27.3% 2.296 1 .130
Ratio and proportiongeneral definition 47 12.8% 55 23.6% 1.976 1 .160
Ratio and proportion- concepts and meanings* 47 14.9% 55 34.5% 5.153 1 .023
Ratio and proportion- application and uses 47 19.1% 55 36.4% 3.691 1 .055
Equations and formulas 47 36.2% 55 49.1% 1.725 1 .189
Linear equations and formulas 47 40.4% 55 52.7% 1.539 1 .215
Other equations and formulas 47 36.2% 55 47.3% 1.281 1 .258
Sets and logic 47 48.9% 55 36.4% 1.643 1 .200
Coverage of all of these topics are low across grade levels, possibly reflecting the
opinion of the experts that these topics are complex and age-inappropriate. For only two
topics (Estimation and error of measurements and Linear equations and formulas) do more
than one-half of teachers at either grade level report coverage-and those teachers are at the
fourth grade level. Larger proportions of fourth grade teachers report coverage than do third

grade teachers for all topics except for Sets and logic: however, only one difference is
statistically significant at p<.05, for the topic addressing Ratio and proportion-concepts and
meanings. Should significance be set at p<.10, the proportion of fourth grade teachers who
cover Ratio and proportion-application and uses would also be statistically higher than third
Summary of Data Patterns in Topic Coverage. These data patterns emphasize topic
coverage that focuses on Whole numbers and the simpler aspects of Fractions. Most
teachers at each grade level address topics related to whole numbers. However, there is
slightly less emphasis on whole numbers at grade four than at grade three; fewer teachers at
grade four report general coverage of whole numbers. This may be explained partially by
the relatively greater emphasis on fractions-related topics at the fourth grade.
Fourth grade teachers are significantly more likely to address aspects of Fractions
than third grade teachers, most of whom do not report coverage of fractions content beyond
meanings of common fractions. This is particularly true for more complex content.
However, fourth grade coverage of fractions is broadly characterized by an emphasis on
lower-level content, particularly on properties of common fractions. Aspects of decimal
fractions and other, more complex content, are left largely unaddressed in most fourth grade
Teachers across grades report that they address issues of Measurement, particularly
relative to a generally-phrased variable about estimation. Roughly two-thirds to three-
quarters of teachers also report coverage of topics related to Data, although Probability is
less broadly addressed than other data-related topics. The topic of Geometry was lightly
covered across the sample; for specific geometry topics, roughly half of the teachers at each
grade reported coverage, and more fourth grade teachers reported coverage than third grade
teachers. These data are similar to findings from the international TIMSS analysis indicating

that U.S. teachers tend to focus more on arithmetic than on geometric thinking (Schmidt, et
al., 1999).
Duration of Instruction on Content
The intermediate question about Duration of Instruction on Content was phrased:
What are the patterns of duration in topic coverage (e.g., which receive more class
time than others)?
Additionally, there was a secondary question addressing whether patterns of duration varied
by level. These questions were analyzed through examining teacher frequencies on
different response options (0 through 4) for variables ATBMTA through ATBMTT. As empty
cells and cell values of less than five in a majority of these variables precluded the use of
the chi-square to explore grade level differences, I relied on the descriptive statistics to
address the issue of grade level differences, but cannot provide statistical evidence about
I provide here a descriptive summary of the results of exploring these frequencies,
organized by subscale reporting area. For reference, I have provided a series of bar graphs,
organized by subscale categories, in Appendix E, titled Frequencies of Duration of
Instruction on Content. These graphs illustrate both the proportions of teachers who cover
specific topics and, of those proportions, the distribution of teachers (in valid percentages)
who report spending 1-5 lessons, 6-10 lessons, 11-15 lessons, or more than 15 lessons on a
specific topic over the course of the current school year.
Whole Numbers. Data on the duration of instruction on Whole numbers topics
further highlight their importance for teachers at both grade levels. Consistent with the broad
coverage of Whole numbers topics across the sample, these topics also take up
considerable instructional time. Across grades, almost two-thirds of teachers who cover the

topic say that they devote more than 15 lesson periods to it. Lessons addressing Place value
and numeration similarly tend to be of longer duration; although fewer teachers spend more
than 15 lesson periods on this topic (37% of third grade teachers and 23.5% of fourth grade
teachers), clear majorities of teachers (67.4% at third grade and 60.8% at fourth grade)
spend 11 or more lesson periods working on this concept. Patterns are similar for the other
variables in the Whole numbers category, including the Problem solving variable; instruction
of these topics accounts for a good deal of instructional time in many third and fourth grade
Fractions. Content-coverage of Fractions topics is much shorter in duration. For
instance, at the third grade level, the most broadly-covered fractions topic (Meanings,
representations, and uses of common fractions) was addressed by just over 60% of the
teachers, and of those teachers, almost half spent 5 lessons or fewer on the topic. This
stands in contrast to the extensive focus on Whole numbers--an area in which more than
60% of all third grade teachers spent more than 15 lesson periods. At the fourth grade, in
addition to more teachers covering fractions, they tended to spend more time, as well. Of
the 78.2% of fourth grade teachers who had taught Meanings, representations, and uses of
common fractions, approximately one-third spent more than 10 lesson periods on the topic,
compared to only 21 % of the third grade teachers who covered it. Duration of instruction
varies across more complex topics for teachers at both levels, although fourth grade
teachers seem to cover topics in more depth than do third grade teachers.
It is at the more complex levels of fractions-related content (e.g., Relationships
between common and decimal fractions. Meaning, representation, and uses of decimal
fractions, and Operations of decimal fractions) that the few (generally 40-50% at fourth
grade) teachers who do address these topics tend to do so in relatively short-term ways. Of
the teachers who address these topics, between 52% and 60% spend less than 6 lessons on

them. More than 80% spend fewer than 11 lesson periods on any of them. However, more
students receive longer-term instruction on Properties of decimal fractions (of the 38.2% of
fourth grade teachers who cover decimal fractions, almost one-half that spend 11 or more
lessons on the topic) than on the other topics.
Measurement. Estimation, and Number sense. As noted above in discussing
patterns related to the Whole numbers category, coverage of Whole number oriented topics
on this subscale is widespread and tends to be long-term. Additionally, duration of
instruction on the generally-phrased topic of Estimation and number sense follows a similar
pattern: roughly 30% of teachers at each level spend more than 15 lessons on these topics.
However, for the two more content-specific topics on the Measurement scale (Measurement
units and processes and Number theory) instruction is of shorter duration. Only one-third of
third grade teachers and one-quarter of fourth grade teachers who do address Measurement
units and processes spend more than 10 lessons on it. In Number theory, the duration of
coverage at fourth grade tends to be slightly longer than at third grade; only one-third of
fourth grade teachers teach this topic in 5 lessons or fewer, compared to almost one-half of
the third-grade teachers.
Data Representation. Analysis, and Probability. Of the roughly 75%-80% of
teachers who report coverage of Data representation and statistics roughly two-thirds
dedicate 10 or fewer lessons to it. However, more than 30% of third grade teachers who
report coverage teach more than 10 lessons on the topic, compared to only about one-
quarter of fourth grade teachers. Probability receives less instructional time; approximately
two-thirds of the teachers who cover Probability do so in five lessons or less, and this pattern
holds across grade levels.
Geometry. Geometry-specific topics receive similar patterns of duration at both third
and fourth grade levels. At the third grade level, most (approximately 60% to 80% of

teachers who do report coverage) geometry-specific topics are taught in five lessons or less,
while fourth grade instruction is generally of slightly longer duration. Of the fourth grade
teachers who cover Perimeter, area, and volume, more than one-half do so in fewer than 6
lessons, but roughly another 40% spend from 6 to 10 lessons on this topic, and almost 10%
say they spend more than 11 lessons on it. Basics of one and two-dimensional geometry
receives some extended instruction as well at both grade levels; approximately 35% of third
grade teachers and 60% of fourth grade teachers report covering basic geometry topics in
more than 5 lesson periods. Overall, on the Geometry variables, the predominant pattern is
that of short-term coverage at both levels, although there is more evidence of extended
coverage at the fourth grade level.
Patterns and Functions. Instruction in topics related to Functions, relations, and
patterns is relatively longer than in the Geometry. Measurement, and Data categories. In the
more than 80% of third grade classrooms and almost 70% of fourth grade classrooms that
cover Functions, relations, and patterns, more extended instruction is the norm. At third
grade, more than one-third of teachers who report coverage spend 11 or more lessons on
the topic, and another 30% spend between 6 and 10 lessons. At the fourth grade level, 42%
of teachers who report coverage say they spend 1-5 lessons on the topic, but 29% spend
more than 11 lesson periods on it.
Summary of Data Patterns on Duration of Instruction. These data exhibit patterns of
considerable variation, similar to international findings (Schmidt, et al., 1999). However,
some consistent patterns exist. Across grade levels, large proportions of teachers report that
they spend extended amounts of time on relatively simple topics like Whole numbers and
Measurement (a category that includes many whole-number-oriented variables). For more
complex topics, particularly on the Fractions subscale domain, or topics addressing
Geometry, the pattern of duration is much more short-term. Of the teachers that cover these

topics, larger proportions teach these types of things for shorter durations than is the case for
simpler topics that are also more broadly covered by teachers.
Student Learning Activities
Student learning activities were measured using one block of six variables from the
teacher survey (items 28a through 28f, page TQ1-20), described in Chapter 3, p. 55. In this
section of the study, l conducted three different analyses in order to answer three different
intermediate research questions. I went through these processes in order to explore the data
patterns that these variables captured and to compare these findings with other OTL studies
as a way of estimating the validity of these variables for capturing aspects of OTL.
Each intermediate analysis in this section of the study addressed a separate
question. One overarching question addressed the extent to which these learning activities
varied systematically by grade level, and to inform the remainder of these analyses, I first
conducted a Multivariate Analysis of Variance (MANOVA) by grade to determine if individual
ANOVAs were warranted for the six variables. Once I established whether grade level
differences existed, I examined teacher response frequencies to characterize data patterns,
an analysis which included the exploration of common strategies, defined by most (more
than 50%) of the teachers reporting them most (most lessons or every lesson) of the time.
Finally, I conducted a series of confirmatory factor analyses to examine the extent to which
individual instructional strategies as measured by these variables fell into recognizeable
reform or traditional orientations, and assigned factor scores to teachers to explore the
possible relations between groups of practices and student achievement. I describe each of
these processes below, with descriptions of intermediate research questions, analyses
conducted, and findings.

Exploring Grade-Level Differences in the Organization of Learning Activities. To
explore potential differences in instructional practice around the six learning variables, I
conducted an analysis of mean differences in frequency by grade. As this question focused
on multiple learning activities, I conducted a Multivariate Analysis of Variance (MANOVA) to
control for the increased likelihood of Type I error inherent in multiple tests (Hair, et al.,
The assumption of independence was met, as each classroom was coded as either
third or fourth grade level. To test the assumption of normality of the data, I conducted a
Kolmogorov-Smirnov test of univariate normality on all six variables. Findings indicated that
distribution of data for all instructional variables by grade level were normal. Results of the
Boxs test of equality of covariance matrices on the achievement variables indicated a
significance level of .256, which means that the covariance matrices were equal across
Using the Wilk's Lambda criterion to determine whether individual ANOVAs were
warranted, I achieved a significance level of .922, indicating that, across the six variables,
separate ANOVAs to check for grade-level differences were unwarranted. This indicates
that there are no statistically significant differences on the organization of learning activities
practices by grade level. Therefore, I conducted the following three analyses of instructional
aspects of OTL using data from all participating teachers, rather than by grade level.
Data Patterns in Learning Activities. The first intermediate research question was:
How do teachers organize student learning activities in their classes?
To investigate this question, I analyzed descriptive statistics and frequencies on the six
variables, assuming ratio level data from the 4-point Likert response format. Findings are
described for all third and fourth grade teachers in Table 5.9 below.