Citation
Reliability, validity, user acceptance, and bias in peer evaluations of self-directed interdependent work teams

Material Information

Title:
Reliability, validity, user acceptance, and bias in peer evaluations of self-directed interdependent work teams
Creator:
Thompson, Robert S
Publication Date:
Language:
English
Physical Description:
xv, 160 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Group work in education ( lcsh )
Engineering students ( lcsh )
Peer review ( lcsh )
Self-directed work teams ( lcsh )
Engineering students ( fast )
Group work in education ( fast )
Peer review ( fast )
Self-directed work teams ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 155-160).
General Note:
School of Education and Human Development
Statement of Responsibility:
by Robert S. Thompson.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
47035205 ( OCLC )
ocm47035205
Classification:
LD1190.E3 2000d .T46 ( lcc )

Full Text
RELIABILITY, VALIDITY, USER ACCEPTANCE, AND BIAS IN PEER
EVALUATIONS OF SELF-DIRECTED INTERDEPENDENT WORK TEAMS
by
Robert S. Thompson
Professional Engineers Degree, Colorado School of Mines, 1969
M.B.A., University of Houston, 1977
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Educational Leadership and Innovation
2000


This thesis for the Doctor of Philosophy
degree by
Robert S. Thompson
has been approved
by
Rodney Muth
tuth Streveler
Date
Brent G. Wilson


Thompson, Robert S. (Ph.D., Educational Leadership and Innovation)
Reliability, Validity, User Acceptance, and Bias in Peer Evaluations of Self-
Directed Interdependent Work Teams
Thesis directed by Professor Laura D. Goodwin, Ph.D.
ABSTRACT
Teamwork education has become increasingly important over the last decade.
Peer evaluations are being used as a source of information to provide feedback to
team members as well as a mechanism to determine an individual grade from a
group grade. Peer evaluations have an appeal because the team members are in the
best position to observe the team skills of their fellow team members. Despite this
advantage, peer evaluations can be abused and may have undesirable effects on
individuals in the group. Because of these concerns and the corresponding use of
peer evaluations, there is a need for a better understanding of the reliability,
validity, and bias in peer assessments of students working on self-directed
interdependent problem-solving teams.
Generalizability theory techniques, structured interviews, and survey data were
used to answer research questions related to the reliability, validity, user
acceptance, and bias in peer evaluations. Seniors from a multidisciplinary team
focused engineering design course were selected for the study. The level of
consensus in the peer evaluations was relatively low. Expected behavior is
believed to account for a significant proportion of the rater effects but further
research is needed to confirm this hypothesis. Friendship bias was a factor in the
peer evaluations but was not significant in reducing the reliability or validity of the
ill


measurements. In the case of multiple measurements of the same construct,
residual variance accounted for approximately 60% of the total variance in the peer
ratings. The relatively low level of consistency between multiple measures is
believed to reflect the contextual nature of teamwork. Ratings at any given point in
time more closely reflect the current stage of team development and the current
activities of the team. Convergent validity for effort applied to the task and
technical knowledge applied to the task was high. User reaction to the peer
evaluations was generally positive with only limited targeting of specific
individuals by team members. Although common practice is to use peer
evaluations for evaluative purposes, the emphasis should be placed on
developmental uses of the ratings rather than evaluative.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Laura D. Goodwin
IV


DEDICATION
I dedicate this dissertation to my parents, Leonard and Arlene Thompson, my wife,
Geri, and my daughter, Katy for their love and support for me. My daughter, a
student at the University of Colorado at Boulder, is waiting for the day she can call
her dad, Dr. Dad. I am too!


ACKNOWLEDGMENT
My thanks to the faculty of the Graduate School of Education at the University
of Colorado of Denver. The Schools core values of learning should lead to
effective practice....learning should be applied and focused on difficult problems of
practice....and students plans of study must be individualized to reflect their
interests and roles were achieved. Thanks for the support over the past five years.
In particular, thanks to my advisor, Laura D. Goodwin, for her guidance, Brent G.
Wilson for his continuing enthusiastic support over the last five years, Rodney
Muth for his candid reviews, and to Ruth Streveler (Director of the Center for
Engineering Education at the Colorado School of Mines) for her review and
support.
The Multidisciplinary Petroleum Design Course, the source of data for this
research, has evolved over the years. There are three points in time in which
significant contributions were made by our graduate students. The pilot years were
influenced by former graduate student Andrew L. Prestridge. He spent long hours
contributing both ideas and elbow grease. Later, Major John Sutton developed the
strategy for multidisciplinary problem-solving and an instruction methodology that
is currently the backbone of our multidisciplinary education efforts. More recently,
graduate student Carole Edwards Knight refined the peer evaluation instruments


and developed the training program for team skills. Carole Edwards Knight
deserves another thanks for conducting the voluntary confidential interviews used
as a data source for this research.
My thanks to the faculty team for the Multidisciplinary Petroleum Course. In
particular, John B. Curtis from the Department of Geology and Geological
Engineering and Tom Davis from the Department of Geophysics and Geophysical
Engineering. Thanks to Jennifer Miskimins, my teaching assistant for the course,
for making possible the seamless transition from data collection to student feedback
that was required for this research. Thanks to Clark Huffman, a graduate student,
for automating the printing of the peer assessment instrument, and to Chris
Cardwell, our department secretary, for printing and coordinating the collating of
the instrument. Finally, my thanks to our Department Head, Craig W. Van Kirk.
He has always supported our efforts to improve multidisciplinary education at the
Colorado School of Mines.
Last, but not least, thanks to the Spring 2000 class taking the Multidisciplinary
Petroleum Design Course. They provided a rich data set for my research. Over the
years, our graduates have made us look good. We appreciate their hard work.
And of course, Keep up the good work
If it is not obvious, let me finish by saying, It is a gratifying experience to be
part of a great team effort.


CONTENTS
Figures.................................................................xii
Tables..................................................................xiv
CHAPTER 1: INTRODUCTION...................................................1
General Problem.......................................................1
Specific Problem and Research Questions...............................3
Methodology Overview..................................................4
Participants.......................................................5
Data Collection....................................................6
Data Analysis......................................................6
Theoretical Framework.................................................8
Theoretical Model..................................................9
Model Assumptions.................................................14
Model Implications................................................15
Consensus.........................................................15
Validity..........................................................19
Variance Partitioning.............................................21
Structure of Dissertation............................................22
CHAPTER 2: LITERATURE REVIEW.............................................24
Overview.............................................................24
Methods of Peer Evaluation........................................24
Typology for Review...............................................25
User Acceptance and Bias......................................25
Reliability and Validity......................................26
Purpose, Confidentiality, and Method of Peer Evaluation.......26
Small Group Peer Assessment in Higher Education......................27
User Reaction and Bias............................................27
Reliability and Validity..........................................31
Purpose, Confidentiality, and Method of Peer Evaluation.............35
Summary..............................................................37
via


CHAPTER 3: METHODS.........................................................39
Participants...........................................................39
Course Overview....................................................40
Description of Teamwork............................................41
Description of Projects............................................42
First Project...................................................42
Second Project..................................................43
Team Composition...................................................44
Methods................................................................46
Generalizability Theory............................................46
Research Design....................................................47
Statistical Model Description......................................48
Relative Variance Partitioning..................................50
Relative Rater Effect...........................................50
Relative Ratee Effect...........................................53
Relative Relationship Effect....................................53
Zero Relative Variance..........................................53
Individual Level Correlations...................................54
Dyadic Correlation..............................................56
Model Limitations...............................................56
Procedures.............................................................57
Instructional Phase................................................59
Strategy for Multidisciplinary Integration......................59
Team Skills.....................................................59
Project Phase......................................................62
Individual Interviews..............................................64
Survey.............................................................65
Summary................................................................65
CHAPTER 4: RESULTS.........................................................66
User Reaction Research Question 1....................................67
Interviews.........................................................68
Mid-Point First Project.........................................68
End First Project...............................................69
Mid-Point Second Project........................................72
End Second Project..............................................74
Survey.............................................................76
Summary............................................................78
Variance Partitioning Research Question 2............................80
Relative Variance Partitioning.....................................81
Reliability of Rater and Ratee Effects.............................83
ix


Bias Research Question 3..........................................85
Interviews.......................................................86
Survey...........................................................91
Statistical Model................................................92
Summary..........................................................95
Stability of Peer Evaluations Research Question 4.................96
Validity of Peer Evaluations Research Question 5.................101
Impact of Group Membership Research Question 6...................102
Summary............................................................104
Research Question 1 User Reaction to Peer Evaluations...........104
Research Question 2 Variance in Peer Evaluations................104
Research Question 3 Bias in Peer Evaluations....................105
Research Question 4 Stability of Peer Evaluations...............105
Research Question 5 Validity of Peer Evaluation.................105
Research Question 6 Impact of Group Membership..................106
CHAPTER 5: CONCLUSIONS.................................................107
Discussion of Findings.............................................108
User Reaction...................................................108
Reliability Variance Partitioning.............................110
Ratee Effects.................................................Ill
Rater Effects.................................................114
Relationship Effects..........................................120
Summary.......................................................120
Bias.............................................................121
Stability of Ratings.............................................123
Validity...........................................................126
Group Membership...................................................127
Limitations........................................................128
SRM Specifications...............................................128
Interviews....................................................129
Survey........................................................131
Summary.......................................................133
Team Skills Instrument...........................................133
Motivation.......................................................135
Recommendations for Further Work...................................135
Conclusions........................................................136
Implications for Practice..........................................138
x


APPENDIX
A: TEAM SKILLS INSTRUMENTS.....................................140
Back-up Behavior............................................141
Communication...............................................142
Coordination................................................143
Feedback....................................................144
Team Leadership.............................................145
Team Orientation............................................146
Effort Applied to the Task..................................147
Technical Knowledge Applied to the Task.....................148
B: STUDENT INTERVIEW QUESTIONS.................................149
Mid-Point First Project.....................................149
End First Project...........................................150
Mid-Point Second Project....................................151
End Second Project..........................................152
C: END OF COURSE SURVEY QUESTIONS..............................153
REFERENCES.....................................................155
xi


FIGURES
Figure
1.1: Conceptual model for inter-rater consensus after [Kenny, 1991 #703].10
1.2: Consensus as a function of acquaintance and overlap. The
weighting factors for unique impression (k) and for stereotype (w)
are 0.0. Similar meaning systems (r2) is assumed to be 0.50 and
limits the level of consensus. Within rater consistency (ri) is
assumed to be 0.10..................................................16
1.3: Consensus as a function of acquaintance and overlap. The
weighting factor for unique impression (k) is 1.0 and for stereotype
(w) is 0.0. Similar meaning systems (r2) is assumed to be 0.50 and
limits the level of consensus. Within rater consistency (ri) is
assumed to be 0.10..................................................17
1.4: Consensus as a function of acquaintance and overlap. The value for
within rater consistency (ri) is increased to 0.50 compared to 0.10
in Figures 1-2 and 1-3. The weighting factor for unique impression
(k) is 1.0 and for stereotype (w) is 0.0. Similar meaning systems
(r2) is assumed to be 0.50 and limits the level of consensus........18
1.5: Consensus as a function of acquaintance and communication at zero
overlap. The parameter values are the same as Figure 1-2. Similar
meaning systems (r2) is assumed to be 0.50 and limits the level of
consensus for the no communication case.............................19
1.6: Consensus and accuracy as a function of acquaintance. No
communication and 100% overlap. The parameter values are the
same as Figure 1-2. Similar meaning systems (r2) is assumed to be
0.50 and limits the level of consensus..............................21
3.1: Team Task Performance Strategy for Multidisciplinary Teams..........60
3.2: Average peer evaluations for Team A, Mid-point of the first
project. Unique, confidential, codes were assigned to each team
member..............................................................63
3.3: Self-evaluations for Team A, Mid-point first project. Unique codes
were assigned to each team member...................................64
4.1: Framework for presenting results of peer evaluations................67
Xll


4.2: Survey Question 1, is the peer assessment process fair? 48/49
Responded...........................................................76
4.3: Survey Question 2a, Did you personally benefit from process?
47/49 responded.....................................................77
4.4: Survey Question 2b, did the feedback make any difference? 48/49
responded...........................................................77
4.5: Response to Survey Question 3, was there bias in any of your peer
evaluations? 49 (100%) responded....................................91
5.1: Response to survey question, communication between raters about
how you were going to rate each other. Forty-nine (100%)
responded..........................................................132
5.2: Response to survey question, communication between raters on
rating a specific individual. Forty-nine (100%) responded..........132
xni


TABLES
1.1: Description of Parameters in Theoretical Model for Consensus.........12
2.1: Summary of Purpose, Confidentiality, and Method of Peer
Evaluations in Higher Education........................................36
3.1: Gender and Discipline of Students Enrolled in Multidisciplinary
Petroleum Design Course, Spring 2000 Semester.........................40
3.2: Team Discipline and Gender Composition First Project..................45
3.3: Team Discipline and Gender Composition Second Project.................45
3.4: Round Robin Research Design...........................................48
3.5: Example of 100% Relative Rater Effect................................50
3.6: Example Demonstrating Rater and Ratee Effects Accounting for
Missing Self Data......................................................51
3.7: Example of 100% Relative Ratee Effect.................................53
3.8: Example of 100% Relative Relationship Effect..........................54
3.9: Example of Zero Relative Variance.....................................54
4.1: Summary of Research Questions Peer Evaluation in Self-Directed
Work Teams.............................................................66
4.2: User Reaction to Peer Assessment Mid-Point First Project (n = 8)......69
4.3: User Reaction to Peer Assessment End First Project (n = 9)............70
4.4: User Reaction to Peer Assessment Mid-Point Second Project (n =
10)....................................................................72
4.5: User Reaction to Peer Assessment End Second Project (n = 8)...........74
4.6: Portion of Variance Due to Rater, Ratee, and Relationship Effects
Mid-Point First Project................................................82
4.7: Portion of Variance Due to Rater, Ratee, and Relationship End First
Project................................................................82
4.8: Portion of Variance Due to Rater, Ratee, and Relationship Effects
Mid-Point Second Project...............................................83
4.9: Portion of Variance Due to Rater, Ratee, and Relationship Effects
End Second Project.....................................................83
4.10: Reliability of Rater and Ratee Effects Mid-Point First Project.......84
4.11: Reliability of Rater and Ratee Effects End First Project.............84
4.12: Reliability of Rater and Ratee Effects Mid-Point Second Project......85
xiv


4.13: Reliability of Rater and Ratee Effects End Second Project............85
4.14: Bias in Peer Assessment Ratings Mid-Point First Project (n = 8)....86
4.15: Bias in Peer Assessment Ratings End First Project (n = 9)..........88
4.16: Bias in Peer Assessment Ratings Mid-Point Second Project (n =
10)......................................................................89
4.17: Bias in Peer Assessment Ratings End Second Project (n=8)...........90
4.18: Relationship (Friendship Bias) Correlation Mid-Point First Project.93
4.19: Relationship (Friendship Bias) Correlation End First Project.......94
4.20: Relationship (Friendship Bias) Correlation Mid-Point Second
Project..................................................................94
4.21: Relationship (Friendship Bias) Correlation End Second Project......95
4.22: Proportion of Variance that is Stable and Unstable Peer Ratings
Mid-Point and End First Project..........................................99
4.23: Proportion of Variance that is Stable and Unstable Peer Ratings
Mid-Point and End Second Project........................................100
4.24: Correlation Coefficients for Criteria and Ratee Effect First Project.101
4.25: Correlation Coefficients for Criteria and Ratee Effect Second
Project.................................................................101
4.26: Proportion of Variance that is Stable and Unstable for First and
Second Project........................................................103
5.1: Summary of Research Questions Peer Evaluation in Self-Directed
Work Teams..............................................................109
5.2: Partitioning Relative Rater Effect into Response Set and Residual
Rater Effect............................................................118
5.3: Comparison of Sources of Variance Found in Hennen Research
with Current Research...................................................125
5.4: Communication Between Raters Mid-Point First Project (n = 8).......129
5.5: Communication Between Raters End First Project (n = 9).............130
5.6: Communication Between Raters Mid-Point Second Project (n = 10).....130
5.7: Communication Between Raters End Second Project (n = 8)............131
xv


CHAPTER 1
INTRODUCTION
General Problem
Teamwork education has become increasingly important over the last decade.
In 1996, the Accreditation Board for Engineering and Technology (ABET), the sole
agency responsible for the accreditation of engineering programs, approved new
standards for accreditation reviews. The new standards, Engineering Criteria 2000,
require programs to demonstrate specific skills. One specific criterion is the need
to demonstrate that graduates have an ability to function on multidisciplinary
teams (ABET, 2000, p. 32). In a recent survey conducted at the Purdue School of
Engineering, over 76% of the students responded that they had been involved as
members of student work teams (486 out of 1,953 responded) (Goodwin & Wolter,
1998). This emphasis on teamwork skills stems from the widespread use of teams
in industry.
Effective teams are characterized by team members that exhibit the following
team skills: cooperation, feedback, backup behavior, coordination, team
orientation, and team leadership (Morgan, Glickman, Woodard, Blaiwes, & Salas,
1986). In addition to these team skills, team members must apply effort to the task,
1


and technical knowledge to the task (Hackman & Morris, 1975). Individual
accountability, including team member effort, is a critical component for
cooperative group work (Slavin, 1990). Although theory has been developed to
minimize free-rider effects (Kerr & Bruun, 1983), free-riders are a concern to
educators responsible for team oriented projects. A related issue is the problem of
group grading. Millis and Cottell (1998) recommend against undifferentiated
group grades for group projects. This recommendation stems from the need for
individual accountability and a desire for fairness in grading.
Thus, improving team performance and accounting for individual contributions
to a group project are two concerns for teamwork educators. Peer evaluations are
being used as a source of information to provide feedback to team members
(McGourty, Dominick, & Reily, 1998; Thompson, 2000) as well as a mechanism to
determine an individual grade from a group grade (Andersen, 2000).
Peer evaluations as a source of information for small self-directed group work
have an appeal because the team members are in the best position to observe the
team skills of their fellow team members. Despite this advantage, concerns have
been levied against the use of peer evaluations. Abson (1994), for example,
suggested that peer evaluations can be abused and have undesirable effects on
individuals in the group. Mathews (1994) studied peer assessment of small group
work in a management studies program. He noted patterns of response included
giving all group members the same score, collusion between group members, and
2


potential ganging-up on one member. Mathews also noted that perceptions can
vary between people accounting for some of the variability. Mathews comments
were based on his observations. He did not report any statistical data to support his
claims.
Design projects are a common source of teamwork in engineering education.
Teamwork in these settings is characterized by three attributes: team members
having a common goal, dependence on each other to achieve their goal, and intense
work over an extended period of time. Because of the extended nature of the group
work and the interdependence among team members, friendships have time to
develop over the duration of the project. In many cases, friendships were formed
prior to the group work. With this nature of the teamwork and the corresponding
use of peer evaluations, there is a need for a better understanding of the reliability,
validity, and bias in peer assessments of students working on these interdependent,
self-directed, problem-solving teams. These concerns are the basis for the specific
research questions that are proposed in the following section.
Specific Problem and Research Questions
The research focuses on the specific problem of the reliability, validity, bias,
and user acceptance of peer evaluations in small interdependent self-directed
problem-solving teams in education. Generalizability theory techniques, structured
3


interviews, and survey data were used to answer the following six specific research
questions:
Research Question 1
What is the user reaction to peer evaluations?
Research Question 2
How much of the variance in peer evaluations is due to rater, ratee, and
relationship effects?
Research Question 3
What is the level of bias in peer evaluations?
Research Question 4
What is the stability of peer evaluations?
Research Question 5
What is the level of validity in peer evaluations for the team skills technical
knowledge applied to the task, and effort applied to the task?
Research Question 6
What is the impact of group membership on the level of consensus and stability
in peer evaluations?
Methodology Overview
This section provides an overview of the participants in the study, data
collection, and data analysis.
4


Participants
Seniors from the disciplines of geology and petroleum engineering at the
Colorado School of Mines were selected for the study. There were 49 students in a
senior capstone design class (Multidisciplinary Petroleum Design) taught during
the Spring 2000 semester. The Multidisciplinary Petroleum Design course is a
team focused applied problem-solving course. During the semester, student teams
worked on two major design projects. Each project was approximately 6 to 7
weeks in duration. Team assignments were random with the constraint that each
team should have one nonpetroleum engineer. Ten teams were formed for each
project. For each project, there were nine teams with five members and one team
with four members. In many cases, team members have known each other for over
two years. The projects were sufficiently long for friendships to form, regardless of
whether or not the team members knew each other at the beginning of the project.
The assigned problems were open-ended and required the input from multiple
disciplines and data sources. Students define specific objectives, plan, and
schedule their work to meet deadlines set by the faculty team. Thus, the teamwork
can be described as interdependent self-directed teamwork. Based on these
parameters, the participants were ideal for studying the reliability, validity, user
acceptance, and bias in peer evaluations conducted on interdependent self-directed
work teams.
5


Data Collection
Two peer evaluations were conducted for each project: one near the mid-point
of the project and one at or near the end of each project. The peer evaluation
feedback was kept confidential. Each student was presented a copy of the average
peer evaluations given on each of eight identified team skills. Student names were
not included on the evaluation feedback. At the beginning of the semester, students
were given a unique code.
The confidential peer feedback data were given to the students at the project
mid-point and end of project review sessions. These sessions follow in the same
week the peer data were collected. Concurrent with the review sessions,
confidential interviews were conducted by a third party. Thus four confidential
interview sessions conducted. Eight to ten students were interviewed at each
session. A total of thirty-five students were interviewed. The interview questions
focused on user acceptance, bias in the ratings, and communication between raters.
Finally, the structured confidential interviews were augmented by an end of course
survey completed by all the participants. The survey questions paralleled the
interview questions.
Data Analysis
The data from the four peer evaluations were analyzed using the statistical
technique referred to as generalizability theory (Cronbach, Gleser, Nanda, &
6


Rajaratnam, 1972; Goodwin & Goodwin, 1991). Reported data on reliability and
validity commonly use correlation techniques, percentage of agreement,
comparison of average scores (with ANOVA), and rank order comparisons. There
are limited examples applying generalizability theory to peer ratings of small
groups (Hennen, 1996; Kenny, Lord, & Garg, 1983; Montgomery, 1986). The
group work in the Montgomery research was limited in duration to two 15-minute
sessions. Kenny, Lord, and Garg used data from research conducted by Lord,
Phillips, and Rush (1980). In the Lord, Phillips and Rush study, the group work
was limited to four 15-minute sessions. Hennen (1996) was the only example
found that applied the generalizability technique to interdependent self-directed
group work in an education setting. In Hennens research, each team performed
simulated self-directed interdependent group work. The groups worked together
for three class periods.
Generalizability theory focuses on identifying multiple sources of variation that
occur simultaneously in any measure. The focus is on variance and correlations
rather than differences in means. The social relations model (SRM) (Kenny & La
Voie, 1984) was used in this research. The model uses a two-way random effects
ANOVA. The two factors, rater and ratee, are the independent variables. Each
factor is one of the roles in the dyadic process. Each level for each factor is one of
the team members where the team members are randomly selected. The model
partitions the ratings into three sources of variation that are of interest in peer
7


assessments: 1. What is the tendency for raters to give similar ratings (rater effect)?
2. What is the level of consensus among raters (ratee effect)? and 3. What is the
variance unaccounted for by the rater and ratee effects (rater by ratee interaction)?
Using correlations derived from variance/co-variance matrices, the SRM measures
the interdependencies that exist in the peer assessment data.
Theoretical Framework
A theoretical model developed by Kenny (1991) provides a framework for
understanding the interdependency of factors that determine the level of consensus
and validity in peer assessment data. The theoretical model was used to understand
the factors that influence the partitioning of the variance into rater effects, ratee
effects, and rater by rater interaction. This mathematical model is a modified
version of Andersons (1981) weighted average model. The model variables are
presented in the following paragraphs. The model, as presented, is a generalization
of the Spearman-Brown prophecy formula from classical test theory. This
relationship will also be demonstrated in subsequent paragraphs.
Consensus in peer ratings is a measure of the extent of agreement between two
raters on their impressions of a specific person being rated. The person being rated
is referred to as the ratee in this research. Although consensus and validity are
related topics, consensus does not necessarily imply validity (Kenny, 1991).
Generally, validity does imply consensus. It is both theoretically (Hastie &
8


Rasinski, 1988) and empirically (Kenrick & Stringfield, 1980) possible for two
raters to not agree and be partially accurate. This can occur when there are two
independent sources of variance to which the raters have differential access. In this
case, the raters may disagree but both can be partially accurate. Accordingly,
Kenny (1991) stated that, technically, consensus is neither a necessary nor
sufficient condition for validity. However, questions of consensus and validity are
closely related.
Theoretical Model
The theoretical model, Figure 1.1, includes three components that influence
rater perception and inter-rater consensus: meaning attached to observed behavioral
acts, meaning attached to stereotypes, and meaning not attached to observed
behavioral acts or stereotypes (unique component). The schematic shown in Figure
1.1 is for two raters interacting with one ratee. The following terminology is used
in the model:
An, the nth behavior act
Sjn, meaning (scale value) given by rater j, act n
Ij, the impression formed by jlh rater
Sju, unique meaning (scale value) not attached to observed behavioral acts or
stereotypes by rater j
k, the weighting factor for the unique meaning
9


Sjp, stereotype meaning (scale value) by rater j
w, weighting factor for stereotype meaning
a, the degree to which raters influence one another
Figure 1.1: Conceptual model for inter-rater consensus after Kenny (1991)
The variables that determine the level of consensus are defined as follows and
summarized in Table 1.1:
1. Acquaintance (n). Acquaintance is the amount of information to which the
rater is exposed.
2. Overlap (q). Overlap is the extent that two raters observe the ratee at the
same time.
3. Consistency within a rater across acts (ri). Within rater consistency,
correlation between Sn and Si2.as shown in Figure 1-1. This can also reflect
the consistency of the ratees acts.
10


4. Shared meaning systems fa). The extent to which an act is given the same
meaning by two raters, correlation between S12 and S22,as shown in Figure
1-1.
5. Consistency between-raters across acts fa). The model assumes the between-
rater consistency correlation equals n x r2.
6. Agreement between raters about stereotypes fa). To what extent do the
raters agree with each other about stereotypes. It is the correlation between
Sip and S2P, the meaning (scale value) attached to stereotypes by rater 1 and
2.
7. Assumed consistency within a rater between stereotypes and an act fa). It is
the correlation between Sip and Su.the meaning (scale value) attached to
stereotypes and behavior act 1 by rater 1.
8. Consistency between a raters evaluation of a stereotype and another raters
evaluation of an act fa). This parameter can be viewed as a kernel of truth
since it represents the correlation between truth (the ratees behavior) and
the stereotype that the rater has about the ratees behavior. The kernel of
truth correlation equals r4x v$.
9. Communication between raters (a). The degree that raters influence one
another.
The rater impression is the weighted average of each of these components. This
relationship is shown in the following equation for two raters observing one ratee
over n behavioral acts:
n
I, = [(wSIp + kS,u + S,j /(w + k + n)] + al2 Equation 1-1
j=l
where:
11


Table 1.1
Description of Parameters in Theoretical Model for Consensus
Parameters_________________________________Description______________________________
n Acquaintance (Number of acts each rater observes)
q Overlap
r, Within-rater consistency (Correlation between and S12)
r2 Similar meaning systems (Correlation between S12 and S22)
r3 Between-rater consistency (r3=r,xr2)
r4 Agreement about streotypes (Correlation between S1P and S2P)
r5 Assumed "kernel of truth" in stereotypes (Within-a rater) Correlation between
S^P and S
r6 Kernel of Truth (Between-raters) r6 = r4xr5
w Weighting factor for stereotype
k Weighting factor for unique impression
a The degree that raters influence (communicate) each other
c Consensus, correlation between two raters over multiple targets
c' Consensus, correlation between two raters over multiple targets adjusted for
communication between raters
11, impression formed by rater 1
w, the weighting factor for stereotype impressions
Sip, the stereotype meaning attached to ratee by rater 1
k, the weighting factor for unique impressions (impressions not attached to
observed behavior acts or stereotypes)
Siu, the unique meaning attached to ratee by rater 1 that is not based on
observed behavioral acts or stereotypes
£S(j > the summation of the meaning attached to n behavioral acts by rater 1
a, is the degree that the two raters influence each other
12, the impression formed by rater 2
12


As shown, the behavioral acts are weighted equally while individual weighting
factors are applied to the stereotype (w) and unique (k) impression variables. The
degree of consensus (c) between a pair of raters evaluating a common set of ratees
with no communication effect (a = 0) is defined by the following equation (Kenny,
1994).
w2r4 +2wnr6 +qnr2(l-rl)-f-n2r,r2
k2 + w2 + n2r, + n(l r,)+2wnrs
where:
Equation 1-2
w, the weighting factor for stereotype impressions
k, the weighting factor for unique impressions (impressions not attached to
observed behavior acts or stereotypes)
n, the number of behavioral acts observed by each rater
q, the fraction of observations that the raters have in common
ri, consistency within a rater across acts. AJso can reflect the consistency of the
ratees behavior.
r2, the extent that an act is given the same meaning by two raters
r4, agreement between two raters about stereotypes
rs, consistency within a rater between a stereotype and an act
r6, consistency between a raters evaluation of a stereotype and another raters
evaluations of an act
13


This equation for consensus reduces to the Spearman-Brown prophecy formula
(Crocker & Algina, 1986, p. 119) if q, k, w, u, rs, and r 1.0. In this case, ri is analogous to the inter-item correlation in the Spearman-
Brown prophecy formula, and n is analogous to the number of items. This
reduction is shown in the following equation using the nomenclature used in the
consensus model:
c =
nr,
^ + ~ ^r' Equation 1-3
Thus, the theoretical model for consensus is a generalization of the Spearman-
Brown prophecy formula (Kenny, 1991) used to estimate the reliability of a
composite of parallel tests when the reliability of one of these tests is known. The
consensus model adds the unique impression, stereotype, and r2, r3, r4, rs, and re
parameters.
Finally, the consensus correlation adjusted for communication between raters is
shown below:
c + a c + 2a
1 + a2 +2ac
Equation 1-4
Model Assumptions
The model contains several assumptions (Kenny, 1991). First, individual acts
are weighted equally with equal weights for both raters. Second, the number of
14


acts observed is the same for each rater. Third, the unique impression for a rater is
independent of the unique impression of the other rater. Fourth, the consistency
between raters across acts ) is assumed to be the product of the within rater
consistency correlation (n) and the extent to which an act is given the same
meaning by two raters (r2). This assumes that the partial correlation of rater is
scale value for an act j with rater ks scale value for act m, controlling for rater ks
scale value for act j, is zero. Finally, it is assumed that the communication (a)
between raters influences only the impression not the individual acts or scale
values.
Model Implications
The theoretical model is used to demonstrate several important determinants of
consensus and accuracy in peer evaluations. These implications are presented in
the following sections.
Consensus. The limiting factor (maximum value) for consensus is r2, the
extent to which an act is given the same meaning by two raters. This statement
assumes that there is no communication between raters. Consensus increases
rapidly for cases where overlap is high. Thus, if communication between raters is
zero, overlap and similar meaning systems control the level of consensus. In
Figure 1.2, stereotypes and unique impression are given weighting factors of zero.
This figure demonstrates the importance of overlap in the observations, and the
15


maximum value of consensus being equal to the assumed value of 0.50 for F2
(similar meaning systems).
Figure 1.2: Consensus as a function of acquaintance and overlap. The weighting
factors for unique impression (k) and for stereotype (w) are 0.0. Similar meaning
systems fa) is assumed to be 0.50 and limits the level of consensus. Within rater
consistency fa) is assumed to be 0.10.
Figure 1.3 is presented to demonstrate the impact of adding a non zero value for the
unique impression weighting factor. Again, the value for T2 (similar meaning
systems) is assumed to be 0.50. This figure again demonstrates the importance of
similar meaning systems and overlap. The impact of adding a non zero unique
impression weighting factor is to increase the number of observations required to
16


reach a given level of consensus. This is most noticeable in the 100% overlap (q =
1.0) case.
Figure 1.3: Consensus as a function of acquaintance and overlap. The weighting
factor for unique impression (k) is 1.0 and for stereotype (w) is 0.0. Similar
meaning systems fa) is assumed to be 0.50 and limits the level of consensus.
Within rater consistency (n) is assumed to be 0.10.
In Figures 1.2 and 1.3, the level of within rater consistency (n) was assumed to be
0.10, a low value. Keeping the value for ri (similar meaning systems) at 0.50 and
increasing ri from 0.10 to 0.50, the level of consensus can reach the maximum
value of 0.50 at relatively low levels of acquaintance. This is demonstrated in
Figure 1.4. The weighing factor for unique impression and stereotypes are 1.0 and
0.0 respectively. The important concept here is that acquaintance may not be a
17


significant variable in reaching the maximum value for consensus in peer ratings
when there are moderate levels of within rater consistency. Within rater
consistency also reflects the consistency of the ratees behavior.
Figure 1.4: Consensus as a function of acquaintance and overlap. The value for
within rater consistency (n) is increased to 0.50 compared to 0.10 in Figures 1-2
and 1-3. The weighting factor for unique impression (k) is 1.0 and for stereotype
(w) is 0.0. Similar meaning systems (r2) is assumed to be 0.50 and limits the level
of consensus.
Finally, communication between raters can mask all the other parameters. This
is demonstrated in Figure 1.5. In this figure, the assumptions are the same as in
Figure 1.2. Again, the limiting value for consensus is 0.50, the value for r2 (similar
meaning systems), when there is no communication between raters. Under the
same assumptions but with the communication between raters factor of 0.50 added,
18


the level of consensus increases rapidly to 0.90. Figure 1.5 demonstrates the
importance of understanding the level of communication between raters that takes
place in peer evaluations.
Figure 1.5: Consensus as a function of acquaintance and communication at zero
overlap. The parameter values are the same as Figure 1-2. Similar meaning
systems fo) is assumed to be 0.50 and limits the level of consensus for the no
communication case.
Validity. Some researchers assume that consensus implies accuracy (Kenny,
1991). As Kenny points out, this is true if the raters do not communicate and there
is no overlap in the observation of the acts (q = 0). In this case, the square root of
the consensus correlation can be used to determine the maximum level of accuracy.
This fits classical test theory, where for two equally valid measures of a construct,
19


the validity coefficient equals the square root of the correlation between the two
indicators. For any other situation, it is unlikely that consensus can be used to
forecast accuracy.
Accuracy is the square root of the consensus correlation with the overlap terms
dropped. This equation is shown below:
c =
w2r4 -f-2wnr6 + n2r, r2
kz + w2 + n2r, + n(l q)
Equation 1-5
As demonstrated in the section on consensus, as the level of acquaintance
increases, consensus reaches a limit equal to r2, the extent to which an act is given
the same meaning by two raters. This is not the case for the level of accuracy.
Accuracy will continue to increase since it is not affected by overlap. These
statements assume that n (within rater consistency) is positive and less than 1.0. A
comparison of accuracy and consensus (assumes 100% overlap) is demonstrated in
Figure 1-6. Under these assumptions, it is clear that consensus is not a proxy for
accuracy (Kenny, 1991).
20


0.8
Accuracy
Figure 1.6: Consensus and accuracy as a function of acquaintance. No
communication and 100% overlap. The parameter values are the same as Figure 1-
2. Similar meaning systems fa) is assumed to be 0.50 and limits the level of
consensus.
Variance Partitioning. The theoretical model has implications for the variance
partitioning that is performed in the statistical model (SRM). Stereotypes that are
unique to a rater but apply to all ratees are reflected in SRMs rater effect. The
level of consensus is reflected in SRMs ratee effect. As discussed in the previous
sections, consensus (and therefore ratee effect) is attributable to overlap in the
observations, similar meaning systems between raters, consistency within a rater
(includes consistency between acts), agreement about stereotypes, and
communication between raters. Finally, the rater-ratee interaction (referred to as
21


relationship effect in this research) is attributable to unique impressions, lack of
similar meaning systems, and the lack of overlap in observations. The rater by
ratee interaction also captures ratings that are unique because of friendships that
may exist. The SRM model estimates the correlation between pairs of raters. For
example, what is the relationship between how rater A rates B and rater B rates A
for a given variable. This correlation is used as an indication of the level of
friendship bias in the ratings (Kenny et al., 1983). Interpretations of SRMs rater,
ratee, and relationship effects as well as the dyadic relationship correlations are
discussed in more detail in Chapter 3, METHODS.
Structure of Dissertation
The following is an outline of the structure of the dissertation.
Chapter 1 introduces the problem and the specific research questions. An
overview of the research methods, participants, data collected, and data analysis to
answer the specific research questions are also presented. This is followed by a
theoretical model that is used to help understand the factors impacting rater, ratee,
and rater by ratee interaction in peer evaluations.
Chapter 2 presents a summary of the literature related to peer evaluations in
self-directed interdependent problem-solving teams in education. An overview of
three methods of peer evaluation is presented along with a typology for organizing
22


the reported research results. The final section summarizes the salient issues
discovered in the review.
Chapter 3 provides a detail description of the participants in the study, the
statistical model, procedures used to collect the peer assessment data and to provide
feedback, instruction in the use of the peer evaluation instrument, and questions
included in the structured interviews and final course survey.
The results of the statistical and qualitative data analysis of the peer data are
presented in Chapter 4. The chapter presents the results organized by each of the
six research questions.
In the final chapter, Chapter 5, the results are discussed. Chapter 5 also
includes a discussion of the limitations of the study, recommendations for further
work, conclusions, and implications for practice.
23


CHAPTER 2
LITERATURE REVIEW
The focus of the literature review is on the uses of peer evaluations in the
context of interdependent, self-directed, problem-solving teams in higher
education. First, an overview of peer evaluation methods and the typology for
organizing the review are presented. This is followed by a description of peer
evaluation literature using the typology, and a summary of the literature review.
Overview
Methods of Peer Evaluation
There are three basic methods of peer assessment: peer nomination, peer rating,
and peer ranking (Kane & Lawler, 1978). Peer nomination consists of having each
member of a group select a member or members as possessing the highest standing
on a rating dimension. Often the group members are asked to select the member or
members that have the lowest standing on the same rating dimension. If multiple
members are selected, a priority ranking is often performed. Self-nominations are
usually excluded in the nomination process.
24


Peer rating consists of having each member of a group rate each other member
of the group on a given set of performance characteristics. Behaviorally anchored
rating scales are common practice. Each interval on a behaviorally anchored rating
scale is anchored with a description of the level of the construct being rated. A
modification of peer ratings is forced peer ratings. In some peer rating examples,
the evaluators are directed to allocate a total of 100 points among group members.
These cases will be referred to as forced peer ratings.
Peer ranking is the final assessment method. This method consists of each
member of the group ranking all the other members of the group from high to low
on a set of performance characteristics.
Typology for Review
The literature review findings are organized into three categories; user
acceptance and bias, reliability and validity, and reported purpose, confidentiality,
and method of peer evaluation. The definitions for each category are described
below.
User Acceptance and Bias. User acceptance is the degree to which members of
the group react positively or negatively to the peer assessment process. A method
is considered free from bias if there is no systematic tendency for the peer
assessment scores to be influenced by anything other than the behavior being
measured.
25


Reliability and Validity. Reliability in this research is focused on the internal
consistency or agreement among multiple raters and the stability or agreement
between two or more measures of the same characteristics made at two different
points in time. Validity is the degree of relationship between a measured behavior
and a criterion that is believed to be a measure of the true standing on the measured
behavior.
Purpose. Confidentiality, and Method of Peer Evaluation. The reported purpose
of the evaluation will be classified as either evaluative, feedback, or combination.
Peer evaluations used to determine a component of a team members grade are
classified as evaluative. Peer evaluations used for developmental purposes are
classified as feedback. If the purpose includes both evaluation and feedback
objectives, the purpose is classified as a combination of evaluative/feedback.
The level of confidentiality captures the privacy of the evaluator in the peer
evaluation. If the rater and corresponding ratings are made public, then the level of
confidentiality would be classified as public. If the rater and corresponding ratings
are kept confidential, then the level of confidentially would be classified as
anonymous.
The method of peer evaluation is classified as either peer nomination, peer
ranking, or peer rating using the definitions provided by Kane and Lawler (1978).
In some peer rating examples, evaluators are directed to allocate a total of 100
26


points among group members. These cases will be referred to as Yorced peer
ratings.
Small Group Peer Assessment in Higher Education
The following sections present a description of cases found in higher education
in which peer evaluations were used in the context of teams working on self-
directed, interdependent tasks. The literature review is focused on reported
descriptions of user acceptance and bias, reliability and validity, and purpose,
confidentiality, and method of peer evaluation. In each section, the references are
listed in temporal order from oldest to the most recent.
User Acceptance and Bias
Burnett and Cavaye (1980) reported on the user perception of peer evaluations
used by fifth year medical students in the Department of Surgery at the University
of Queensland. The authors stated that approximately 78% (175 students
responding) felt that they had made a fair and responsible assessment of their peers,
2% indicated that they had not, and 20% were not sure.
Falchikov (1988) reported on the use of peer evaluations to determine
individual contributions to group work for four students working on a project in a
psychology course. Falchikov stated that students felt that the calculation of the
final mark was fair and accurately reflected the working of the group.
27


Farh, Cannella, and Bedeian (1991) used a quasi-experimental design to
determine the effects of purpose (evaluative versus developmental) on peer rating
quality and user acceptance. User acceptance based on student recommendations
for future use was more favorable under the developmental conditions compared to
the evaluative conditions. There were no significant differences in level of
friendship bias between the two treatment groups based on responses to a survey
question.
Keaten and Richardson (1993) used peer evaluations in a speech
communications class to assess individual contributions to a group project. They
reported that approximately 88% of the 110 participants thought peer assessment
was fair. Approximately 79% of the students thought that the peer evaluation form
was accurate. The authors recognized the need to assess the reliability and validity
of their instrument. The authors also stated a need for peer assessment training.
Finally, the authors stated the desirability of administering the instrument twice
during the semester.
Saavedra and Kwun (1993) reported on peer evaluations of business students
working on self-managing teams. Saavedra and Kwun investigated the potential
bias in peer evaluations based on individual contributions to the group work. They
reported that individuals making strong contributions to the group project were
more discriminating in the peer evaluations than below-average or average
contributors. The outstanding contributors perceived the peer rating system to be
28


most fair. Saavedra and Kwun advised caution in using peer evaluations for
evaluative purposes in group work with a moderately high level of task
interdependence and group-level rewards (p. 459). This is based on the finding
that the peer-rating process in a self-managing group context promotes anchoring
and adjustment heuristics as well as self-enhancement biases that compromise the
interrater reliability of performance ratings (p. 459-460). Saavedra and Kwun also
stated that peer ratings may serve as useful feedback under conditions of pooled
and sequential interdependence (p. 460).
Abson (1994) studied one self-directed team of five working on a marketing
research project. Abson was concerned about individual accountability in group
project work. Interviews were conducted approximately one month into the project
and again at the end of the course. Abson reported a definite bias in the peer
ratings. Abson also believes that in this particular case, the peer assessment
process created dysfunctional behavior. Interview responses indicated that one
peer was targeted by one rater. In this case, the ratings were reduced. Some team
members felt that the dynamics changed for the worse after the first evaluation.
Some worked harder after the first evaluation but others only worked hard enough
to get a good grade. One team member admitted that he tried to be seen as doing
more even-though he probably was not. The results from this dysfunctional team
lead Abson to conclude that peer evaluations may make students work harder in
29


some cases. However, peer evaluations may also generate dysfunctional behavior
(targeting of students) and thus impair the validity of peer evaluations.
Mathews (1994) reported on the development and use of a peer evaluation
system for small group work in a marketing course at the University of Teesside,
England. The group project was a significant effort in the class and addressed a
real management problem supplied by industry. The author was interested in the
extent to which the final individual grade reflects individual contributions to the
group project. From a review of the responses, there were indications of leniency
in the ratings, no variance in the ratings, collusion or targeting of individuals, and
varying perceptions between raters. Mathews concluded that peer evaluation forms
a part of the assessment but is not sufficiently robust to be the sole source of
informational input (p. 19). Statistical data were not reported.
In an introductory engineering design class at Humbolt State University,
Eschenbach (1997) conducted confidential peer evaluations at the mid-point and at
the end of the semester. The final peer evaluations accounted for 15% of the
students final grade. Each team member received an anonymous copy of
comments made by other team members. Later, Eschenbach and Mesmer (1998)
reported on the results of a questionnaire completed by 55 out of 150 students in
the engineering design class completing the peer evaluations. Approximately 81%
of the students responding found the evaluation useful. Approximately 73% felt
30


that they had learned more about team dynamics by completing the peer
evaluations.
Druskat and Wolff (1999) conducted peer appraisals for developmental
purposes in self-directed work groups. Druskat and Wolff reported that peer
appraisals can have a positive effect on relationships and task focus and are
influenced by the timing of the evaluations relative to the project deadline.
Layton and Ohland (2000) used peer evaluations in project teams where the
majority of the students were African-American. The peer evaluations were used
to assign individual grades from group grades for design projects in a junior-level
mechanical engineering course. Data analysis includes ratings given and ratings
received organized by gender and race/ethnicity. Layton and Ohland reported a
definite trend in the ratings given and received by minorities. The highest ratings
were given by minorities to non-minorities and the lowest are given by non-
minorities to minorities. The authors hypothesized that students seem to base
ratings on perceived abilities instead of real contributions (p. 6).
Reliability and Validity
Burnett and Cavaye (1980) reported on the validity of peer evaluations used by
fifth year medical students in the Department of Surgery (University of
Queensland). The authors calculate the correlation coefficient (r = 0.99) between
the average peer evaluation and the final course grade as an indication of validity.
31


Falchikov (1988) reported on peer evaluations of a four person group working
on a cooperative project in a developmental psychology course. Rank ordering was
used as an indication of validity. In each case, the rank ordering of peers was
identical. Falchikov (1993) in a follow-up study reported on the rank ordering of
task and maintenance functions for a group of seven. Kendalls coefficient of
concordance for the task and maintenance functions was 0.70 and 0.28 respectively.
Clark (1989) investigated the reliability and validity of a peer evaluation
instrument used to assign grades for group projects in marketing research course.
The methodology relied on rank comparisons using the Spearman rank correlation
coefficient. Group rankings based on average peer evaluations were compared to
students perceived rank using self-evaluations. The Spearman rank correlation
coefficients ranged from 0.70 to 0.94 (four groups) for this analysis. Group
rankings based on average peer evaluation were also compared to students
perceived rank using a pair comparison technique. The Spearman rank correlation
coefficient ranged from 0.89 to .96 (six groups) for this comparison. Finally, Clark
used a rank comparison of the peer ratings at the midterm to the final rankings as
an indication of reliability. The Spearman rank correlation coefficients in this case
ranged from 0.61 to 0.88. The peer evaluations were used to determine 30% of the
students grade in the course.
Farh, Cannella, and Bedeian (1991) used a quasi-experimental design to
determine the effects of purpose (evaluative versus developmental) on peer rating
32


quality. Peer rating quality included leniency, uniformity bias (raters give peers
nearly the same rating), halo error (raters do not discriminate across dimensions for
a given ratee), and inter-rater reliability. Farh, Cannella, and Bedeian reported that
purpose of peer ratings had a significant impact on the quality of peer ratings. Peer
ratings for evaluative purposes had greater leniency, greater halo effect, more
uniformity, and less inter-rater reliability than peer evaluations conducted for
developmental purposes.
Rafiq and Fullerton (1996) used peer evaluations in a civil engineering course
at the University of Plymouth, United Kingdom to determine an individual grade
from a group grade using a complex two part method developed by Goldfinch
(1994; 1990). The authors found that it was possible for an individual to
contribute little .. .and to receive unfairly high marks causing the marks of the
high performing student to suffer (p. 73). A modification to the method was
tested using the same data. This method gave better results according to Rafiq and
Fullerton. Rafiq and Fullerton stated that the method was useful in the fair
assessment of group work in large classes... and the students perceived that it was
fair and relevant (p. 79). Reported statistical support was sparse.
Hennen (1996) applied generalizability theory and the Social Relations Model
(SRM) (Kenny, 1994) to study peer appraisal in simulated self-directed work
groups. Undergraduate students rated each other four times. One evaluation took
place at the beginning of class before group work commenced. The other
33


evaluations were completed at the end of three working sessions. The group task
was to manufacture words and package them into sentences. A raw material
word or phrase was provided from which new words and phrases were developed.
The group task required cooperation among the group members. Hennens
research was the only study found that provides a direct basis for comparison to the
research in this study. Reliability measured by the proportion of variance that is
due to consensus was 15%. Hennen measured convergent validity by comparing a
criterion value to the level of consensus (ratee effects). The correlation coefficients
ranged from 0.28 to 0.33 for the construct individual performance.
McGcurty, Dominick, and Reily,(1998) reported on peer evaluation data
collected from 158 engineering students working on team design projects at New
Jersey Institute of Technology. Faculty ratings on the same constructs were
significantly correlated with students average team peer ratings across all learning
outcomes. The size of the correlations was not reported.
Van Duzer and McMartin (1999) addressed the issue of response set in peer
ratings in an engineering teamwork setting. Reduced response set occurred by
reversing several items. Lack of variance in some items was addressed by
removing these items from the assessment. Improvements were noted by changing
the scale from strongly agree, agree, disagree, strongly disagree to agree, tend to
agree, tend to disagree, disagree (p. 3). Improvements were measured by
comparing the variability in the ratings in the pilot instrument to a revised
34


instrument. Van Duzer and McMartin also reported that the accuracy of the ratings
was validated by interviews with team members after they had completed the
instrument. They reported strong agreement between the interview data and the
instruments. One exception noted by Van Duzer and McMartin was the lack of
correlation between a members effort as discussed in the interview and as rated on
the forms (p. 6).
Ohland and Layton (2000) applied a nested single-facet generalizability study
(G-study) design (Crocker & Algina, 1986) to estimate inter-rater reliability for two
peer evaluation instruments. The G-study results are an estimate of how well a
single raters score approximates the true score that would be obtained if enough
raters evaluated each student (p. 3). Ohland and Layton reported inter-rater
reliabilities of 0.34 and 0.41 for the two instruments.
Purpose. Confidentiality, and Method of Peer Evaluation
The purpose, confidentiality, and method of peer evaluation are summarized in
Table 2.1. The results summarized in the table indicate clear patterns for the
purpose, confidentiality, and method of peer evaluation. The reported purpose was
dominated by evaluative applications. The issue of grading and individual
accountability in small group work is the main reason for the emphasis on
evaluative peer evaluations. Essentially all of the peer evaluations were collected
and reported anonymously. The only noted exception was the work by Druskat
35


and Wolff (1999). Finally, peer ratings are the dominant method of peer
evaluations. There was one example where forced peer ratings were used.
Table 2.1
Summary of Purpose, Confidentiality, and Method
of Peer Evaluations in Higher Education
Author Purpose Method Confidentiality
Abson (1994) Evaluative Peer ratings Anonymous
Burnett and Cavaye (1980) Evaluative Peer ratings Not reported
Clark (1989) Evaluative Peer ratings Anonymous
Druskat and Wolff (1999) Feedback Peer ratings Public
Eschenbach and Mesmer (1998) Evaluative Peer ratings Anonymous
Falchikov (1988) Evaluative Peer ratings Not reported
Falchikov (1993) Evaluative Peer ratings Not reported
Farh, Cannella, and Bedeian (1991) Evaluative and Feedback treatment groups Peer ratings Anonymous
Hennen (1996) Simulated Evaluative (grades not assigned) Peer ratings Anonymous
Keaton and Richardson (1993) Evaluative Peer ratings Anonymous
Layton and Ohland (2000) Evaluative Peer ratings Anonymous
Mathews (1994) Evaluative Peer ratings Anonymous
McGourty (1998) Feedback Peer ratings Anonymous
Rafiq and Fullerton (1996) Evaluative Peer ratings Anonymous
Saavedra and Kwun (1993) Evaluative Forced peer ratings Anonymous
Van Duzer and McMartin (1999) Not Reported Peer ratings Not reported
36


Summary
Clear patterns emerge from the literature review of peer evaluations in self-
directed interdependent work groups in higher education. Reported user
acceptance was positive, the common method of evaluation was peer ratings, and
the evaluations were anonymous. Several authors reported a bias in the ratings or a
concern over bias. The evaluative use of the peer evaluations is driven by the
desire to identify individual contributions in group work. Comprehensive research
by Farh, Cannella, and Bedeian (1991) indicate a potential trade-off in using peer
evaluations for evaluative purposes. The trade-off was greater leniency, greater
halo effect, more uniformity, and less inter-rater reliability in peer evaluations
conducted for evaluative versus developmental purposes. Each of these factors
reduces peer evaluation reliability.
There was a range of methods used to estimate the reliability and validity of
peer evaluations. Falchikov (1988; 1993) and Clark (Clark, 1989) relied on rank
correlations. Farh, Cannella, and Bedeian (1991) used the Pearson-Product-
Moment correlation and ANOVA methods. Ohland and Layton (Ohland & Layton,
2000) and Hennen (1996) applied generalizability technology.
Ohland and Layton applied a nested single-facet generalizability study (G-
study) design (Crocker & Algina, 1986) to estimate inter-rater reliability for two
peer evaluation instruments. The inter-rater reliability was 0.34 and 0.41 for two
peer evaluation instruments. Hennen used a round robin design and the same
37


statistical model (Kenny, 1994) used in this research. Hennen reported relatively
low levels of consensus when peer measurements were partitioned into variance
due to the rater, the ratee, and rater by ratee interaction. The variance attributed to
consensus (a measure of inter-rater reliability) was 0.15. Hennen also reported
relatively low convergent validity correlations (ranged from 0.28-0.33). Clark
(1989), using rank correlations, reported relatively high reliability (ranged from
0.61 to 0.88) and validity (ranged from 0.70 to 0.96).
In summary, there is a clear need for a better understanding of the reliability
and validity of peer ratings conducted in self-directed, interdependent group work
in education.
38


CHAPTER 3
METHODS
This chapter has three main sections: participants, methods, and procedures.
The Participants section includes a description of the participants, an overview of
the course in which the participants were enrolled, the type of teamwork involved,
and the gender and discipline composition of the teams included in the study. The
next section, Methods, provides an overview of generalizability theory, the research
design, and a description of the Social Relations Model (SRM) statistical model.
The final section, Procedures, describes the peer evaluation process, the peer
evaluation instruments, team skills training, and the anonymous peer feedback
procedure. Also included are the questions for the structured confidential
interviews that were conducted at four points during the semester, and the questions
included on a final course survey.
Participants
Seniors from the disciplines of geology and petroleum engineering at the
Colorado School of Mines were selected for the study. The participants were
students in a senior capstone design class (Multidisciplinary Petroleum Design) that
39


was required for all students in the Petroleum Engineering program and was an
option for students from the Geology and Geological Engineering program. The
course was also open to seniors from the Geophysical Engineering program.
However, no students from the Geophysical Engineering program were enrolled in
this particular semester. There were 49 students in the course. The foliowring table
summarizes the demographic characteristics (discipline and gender) of the students
enrolled in the course during the Spring 2000 semester. As shown in the table,
petroleum engineers and males dominated the composition of the teams.
Table 3.1
Gender and Discipline of Students Enrolled in
Multidisciplinary Petroleum Design Course, Spring 2000 Semester
___________________Gender________
Discipline Male Female Count Percentage
GE 6 1 7 14%
PE_________33__________9__________42________86%
Count 39 10 49
Percentage 80% 20% 100%
Note: PE = Petroleum Engineering, GE = Geological Engineering
Course Overview
The Multidisciplinary Petroleum Design course was a team focused applied
problem-solving course. The instruction team consists of faculty from the
programs of Geology and Geological Engineering, Geophysical Engineering, and
Petroleum Engineering. Each program has a Teaching Assistant (TA) assigned to
the course. There was a training phase lasting approximately 2 1/2 weeks. During
this phase, students were instructed in the use of a performance strategy (Sutton &
40


Thompson, 1998) for working open-ended problems in a multidisciplinary
environment, the identification and application of team skills, and the use of a team
skills instrument. The assigned problems require information from multiple
sources to solve and do not have a unique solution.
The training phase was followed by two major projects. Each project was
approximately 6 to 7 weeks in duration. Students were instructed to assume that
the faculty and TA team were management. Management does not have set of
lectures but was available for support if there were content items that teams believe
need to be addressed at the class or subgroup level. However, requests for support
must be made to management. Self-learning was encouraged. The following
sections describe the type of teamwork, and the gender and discipline of the
students in each of the teams included in the research.
Description of Teamwork
The teams in the Multidisciplinary Petroleum Design course were similar to
self-directed work teams (SDWTs) found in industry. SDWTs (also referred to as
taskforces) are characterized by (Gersick & Davis-Sacks, 1990):
1. a limited life,
2. are usually heterogeneous because of the diverse needs of the project,
3. have a limited time frame to solve a specific problem,
4. have members that may not know each other and their capabilities,
5. must perform non-routine work, and
6. have a mix of autonomy (self-directed) and dependence (client).
41


Teams structured in this way are a challenge to the team members. There was no
clear path to the goal, members may not know the capabilities of the individual
team members, and the team may not know the contribution that can be made by
each of the disciplines. In these cases, the skills of team members-and those who
establish and manage the teams-in dealing with uncertainty and heterogeneity
strongly influence the ultimate effectiveness of task forces'* (Gersick & Davis-
Sacks, 1990, p. 154).
Description of Projects
Two major projects were worked by the student teams during the semester.
Each project lasted approximately six to seven weeks. The following sections
describe each of the projects.
First Project. The first project was the unitization of a small oil field. The goal
of a unitization project is to determine the equitable sharing of production revenues
and costs in cases where there are multiple owners before unitization. For example,
Company A may own Well 1 and 2, and Company B owns Well 3, prior to
unitization. After unitization, the oil field is operated as if there were a single
owner. With unitization, owners are only concerned about total production and
costs and not with the production from a specific well (e.g wells previously owned
by Company A before the unitization). Unitization permits more efficient
development of natural resources in cases where fluids, such as water, are injected
42


into the reservoir to improve the recovery of oil. Since unitization removes
individual well ownership, the selection of which wells to convert to water
injection is based on efficient oil recovery for the entire reservoir.
Teams were randomly assigned ownership in different wells in the oil field.
Teams were asked by management to conduct a pre-negotiation unitization study.
Teams were to propose a unitization formula that best represents their Companys
equitable share of the oil field. Teams were also asked to estimate a range of
ownership expectations for management. Data available for the analysis included
production data, well logs, core, two-dimension (2-D) seismic lines, fluid
properties, and reservoir pressures. The data set provides the opportunity to
integrate data and information from multiple sources. The data were needed to
estimate the spatial distribution (and ownership) of the reservoir fluids. Thus, the
project requires the cooperative efforts of the team members. There was also a
competitive component between teams. The actual negotiations were not a part of
the task.
Second Project. The second project was an offshore field development project.
Management requested that teams select two potential development well prospects,
compare the two development sites, and recommend one of the locations. Data
available for the analysis included production data, well logs, core, three-dimension
(3-D) seismic volumes, fluid properties, and reservoir pressures. The 3-D data and
analyses were critical components for development well selection. New well
43


performance was based on analogy from existing wells and reservoir performance.
The project requires the integration of data and information from multiple sources
and the cooperative effort of the team members. Compared to the first project,
there was greater uncertainty in the data and information.
Team Composition
Ten teams were formed for each of the two major projects. There were nine
teams having five members each and one team with four members. To the extent
possible, teams had one non-petroleum engineer. The ideal goal was one geologist
and one geophysicist per team. This semester, TAs and faculty from Geophysical
Engineering acted as the team member for each of the teams. In particular, they
provided three-dimension (3D) and two-dimension (2D) seismic interpretations.
Team assignments were made randomly subject to the constraints just listed. There
was a complete change in team membership for the second project with the
exception of one team where two members had worked together on the first team.
Students worked in several groups during the instructional phase. Team
composition, organized by discipline and gender, is presented in Tables 3.2 and 3.3
for the first and second projects, respectively.
44


Table 3.2
Team Discipline and Gender Composition
First Project
Discipline Gender
Team PE GE MALE FEMALE
1 5 0 3 2
2 4 1 5 0
3 4 1 5 0
4 4 1 4 1
5 4 1 3 2
6 4 1 4 1
7 3 1 3 1
8 5 0 4 1
9 4 1 4 1
10 5 0 4 1
Total 42 7 39 10
Note: PE = Petroleum Engineers, GE
= Geological Engineers
Table 3.3
Team Discipline and Gender Composition
Second Project
Discipline Gender
Team PE GE MALE FEMALE
1 4 1 3 2
2 3 1 2 2
3 4 1 5 0
4 5 Q 4 1
5 4 1 5 0
6 5 0 5 0
7 4 1 4 1
8 5 0 4 1
9 4 1 4 1
10 4 1 3 2
Total 42 7 39 10
Note: PE = Petroleum Engineers, GE
= Geological Engineers
45


Methods
A statistical approach was used to answer the questions related to the reliability,
validity, and bias in peer evaluations. Specifically, the model partitions the total
variance in the dependent variable into rater, ratee, and relationship (interaction)
variance. Thirty-five individual interviews were conducted over the course of the
semester to answer questions about user acceptance and to augment information
about friendship bias. The interviews were supplemented by an end of course
survey completed by all 49 students in the class. The following sections focus on
the statistical model and its use in analyzing the peer evaluation data. The
structured interviews and end of course survey are discussed in the section covering
procedures.
General izabilitv Theory
The statistical model used in this research is specifically designed for small
group research. The model is an application ofgeneralizability theory.
Generalizability theory is a relatively new approach to reliability estimation
(Goodwin & Goodwin, 1991). Generalizability theory focuses on identifying
multiple sources of variation that occur simultaneously in a set of data. The focus
is on variance associated with various factors and their interactions rather than
differences in means. There are three sources of variation that are of interest in
peer assessments: 1. What is the tendency for raters to give similar ratings (rater
46


effect)? 2. What is the level of consensus among raters (ratee effect)? and 3. What
is the variance unaccounted for by the rater and ratee effects (rater by ratee
interaction)?
Research Design
A round-robin research design was used for this analysis of the peer assessment
data. A round-robin research design is one in which observations are made on
every possible dyad in the team. The data layout for this research design is shown
in Table 3.4. The statistical model, referred to as a social relations model (SRM)
(Kenny, 1994; Kenny & La Voie, 1984; Warner, Kenny, & Stoto, 1979) is
particularly suited for the stated research objectives. The statistical model is
closely related to Cronbachs (1955) component model for interpersonal data.
Cronbach partitioned variance of raters ratings into a rater effect, a trait effect, and
a rater by trait interaction. The social relations statistical model partitions the
variance in the raters rating into rater effect, ratee effect, and rater by ratee
interaction (referred to as relationship effect by Kenny (1994)). Cronbachs model
examines individual differences between raters given a set of ratees and traits. On .
the other hand, the social relations model examines a specific trait for a set of rater
and ratees (Warner et al., 1979). For example, on a multidisciplinary team, each
team member plays both roles of rater and ratee for observable team skills.
Correlations in the social relations statistical model are used to measure the
47


interdependence that exist in the rater-ratee relationship in peer assessments. The
following sections present a description of the statistical model including
interpretations of each of the partitioned effects (rater, ratee, and relationship) and
the correlations.
Table 3.4
Round Robin Research Design
_________________Ratee_________________
Rater Subject 1 Subject 2 Subject 3 Subject 4
Subject 1 X X
Subject 2 X X
Subject 3 X X
Subject 4 X X X
Statistical Model Description
The statistical model, referred to as a social relations model (SRM), was
developed by Kenny and La Voie (1984). The model is analyzed using a two-way
random effects ANOVA. The two factors, rater and ratee, are the independent
variables. The level for each factor is one of the team members. The team
members are randomly selected. Using correlations derived from variance/co-
variance matrices, the SRM measures the interdependencies that exist in the data
from a round-robin design. In generalizability theory, the focus is on variance
partitioning, not on the variable means. The SRM specifically partitions the
variance into rater effects, ratee effects, and relationship effects. The goal is to
generalize the results to other individuals.
48


A dyadic variable in the social relations model is represented as the sum of
three variables and a constant. The variables are rater effect, ratee effect, and the
relationship effect. The constant is the group mean. For example, assume the dyad
consists of John and Mary. The rater effect is a measure of how an individual rater
(e.g. John) evaluates others on a behavior or skill. The ratee effect is a measure
of how the individual in the dyad (e.g. Mary) is rated by others on the same
behavior or skill. The relationship effect measures Johns unique rating of Mary on
the trait or skill. In symbolic form the dyadic variable can be represented as:
Xijk H- +a i +P j +7 ij +e ijk
Where (using feedback as the observed skill):
Equation 3-1
Xyk= individual is measure of individual js observed feedback skills at time k
u = mean level of feedback skill in the team. A constant that represents the
norm for the group
a; = individual is rater effect is a measure of how individual i rates others on
the feedback they provide the team
3j = individual js ratee effect is a measure of how others rate j on the team
skill feedback
Yij = the relationship effect and accounts for the uniqueness of individual i
with j. The relationship effect is the variance that is left over after
accounting for rater and ratee effects
Cijic = the residual or noise between rater i and ratee j at time k. The residual is
imbedded into the relationship effect if the behavior or skill is only
measured once
49


The rater and ratee effects resemble the main effects in a two-way ANOVA.
The relationship effects resemble the interaction effects in a two-way ANOVA.
The following sections discuss the relative variance partitioning, interpretation of
the relative rater, ratee, and relationship effects, and interpretation of specific SRM
correlations used to account for the interdependencies in peer assessment data.
Relative Variance Partitioning. The relative variance for the rater, ratee, and
relationship effects is calculated by dividing the absolute variance for each effect
by the sum of the absolute variances.
Relative Rater Effect. The relative rater effect measures the deviation of the
rater (row) means from the grand mean. In peer evaluations, the relative rater
effect is a measure of the similarity in the ratings for a given rater. If raters give
each individual they rate the same rating, and the ratings given by at least one rater
are different than the ratings given by the other raters, the relative rater variance
would be equal to 1.0. An example is presented as Table 3.5.
Table 3.5
Example of 100% Relative Rater Effect
Ratee
M, Mean
Rater Jim John Bev Bill Row i
Jim 3.00 3.00 3.00 3.00
John 2.00 2.00 2.00 2.00
Bev 1.00 1.00 1.00 1.00
Bill 4.00 4.00 4.00 4.00
M.j Mean of Column j 2.33 2.67 3.00 2.00 2.50
50


At first glance, looking at the column means, the example above looks like it
also includes a component of ratee effect (a difference in the column means from
the grand mean of 2.50). One of the unique features of the SRM model is its ability
to remove the biasing effect of the missing self-rating. In the following table, rater
effects and ratee effects calculated by the SRM model have been included. The
rater effect reflects the difference between the rater mean for each row and the
grand mean when the biasing effect of the missing self-data is taken into account.
The ratee effect reflects the difference between the ratee mean for each column and
the grand mean when the biasing effect of the missing self-rating is taken into
account.
Table 3.6
Example Demonstrating Rater and Ratee Effects Accounting for Missing Self Data
Ratee
Ml Mean Rater
Rater Jim John Bev Bill Row i Effect
Jim 3.00 3.00 3.00 3.00 0.50
John 2.00 2.00 2.00 2.00 -0.50
Bev 1.00 1.00 1.00 1.00 -1.50
Bill 4.00 4.00 4.00 4.00 1.50
Mj Mean of M..Grand
Column j 2.33 2.67 3.00 2.00 2.50 Mean
Ratee Effect 0.00 0.00 0.00 0.00
From Table 3.6, Ems average rating given is greater than the grand mean by
+0.50 (rater effect). Since Jim does not rate himself, the value for his average
rating received (2.33) would be understated. When the adjustments are made for
the missing self-data, there is no difference between the column mean for Jim and
51


the grand mean of 2.5. This is indicated by the zero ratee effect for Jim. When the
biasing effect is taken into account, the example is one where the relative rater
effect is 100%. All of the variance is the result of differences in the row means
compared to the grand mean.
Although the self-ratings could be included, this is not done because self-
ratings may be qualitatively different from other ratings (Kenny et al., 1983). Thus,
the empty diagonal cell in the round robin design (Table 3.4) complicates the
calculation of the rater and ratee effects (Warner et al., 1979). Because of the
empty cell, the rater effect (row effect), cc, is not simply the row mean less the
grand mean. The empty cell results in a bias in the estimate that is corrected by
weighting the rater mean (M,-..), the ratee mean (M.,\, the column that excludes the
rater), and the grand mean (M...) using the following relationship:
Rater (row) effect:
a j = (n + ty- M.j.-h ^ M...
n(n 2) n(n 2) ' (n-2)
where n is the number of team members on a team
A similar approach is taken for the ratee effect as shown below:
Ratee (column) effect:
Equation 3.3
J n(n 2) r n(n 2) J" (n-2)
Equation 3.4
52


Relative Ratee Effect. The relative ratee effect is a measure of the consensus
among the raters. High ratee variance indicates a high degree of consensus among
raters for a given individual. If raters agree on the rating for each individual, the
relative ratee variance would be equal to 1.0. This is true as long as each rater does
not give each ratee the same rating. This would be an example of zero variance.
An example of 100% ratee effect is presented in Table 3.7
Table 3.7
Example of 100% Relative Ratee Effect
Ratee
Rater Jim John Bev Bill Ml Mean Rowi Rater Effect
Jim 2.00 1.00 4.00 2.33 0.00
John 1.00 1.00 4.00 2.00 0.00
Bev 1.00 2.00 4.00 2.33 0.00
Bill 1.00 2.00 1.00 1.33 0.00
M j Mean of Column j 1.00 2.00 1.00 4.00 2.00 M.. Group Mean
Ratee Effect -1.00 0.00 -1.00 2.00
Relative Relationship Effect. The relationship effect is the variance that is left
over after accounting for rater and ratee effects. High relative relationship effect in
peer evaluations may be the result of unique friendships, or may reflect behavior
that is observed by only a subset of the group. This could result from a delegation
of team tasks. Table 3.8 is an example where all the variance is unique to each
rater and the relative relationship effect is 100%.
Zero Relative Variance. For cases where each rater gives each ratee the same
rating, the relative variance would be zero. Zero relative variance in the
evaluations is demonstrated in Table 3.9.
53


Table 3.8
Example of 100% Relative Relationship Effect
Ratee
Rater Jim John Bev Bill
Jim 3.00 2.00 1.00
John 3.00 1.00 2.00
Bev 2.00 1.00 3.00
Bill 1.00 2.00 3.00
Table 3.9
Example of Zero Relative Variance
Ratee
Rater Jim John Bev Bill
Jim 3.00 3.00 3.00
John 3.00 3.00 3.00
Bev 3.00 3.00 3.00
Bill 3.00 3.00 3.00
Individual Level Correlations. The SRM model calculates three correlations at
the individual level. The individual level correlations were not needed to answer
the questions posed in this research. The following brief discussions are included
to demonstrate some of the features in the SRM model. The individual level
correlations are referred to as the rater-ratee correlation, the rater-rater correlation
and the ratee-ratee correlation. The symbolic form of the dyadic variable, Equation
3-1, will be used to clarify these correlations.
Rater-Ratee Correlation. The rater-ratee correlation represents the
correlation between the rater and ratee effect for individuals and a given trait. The
following equations demonstrate the correlation for individual i and individual
J-
54


Xjik ~^~&r j i +Y ji jik
X
xijk =M- + i +P j +Y ij +e ijk
This correlation was not used in this research.
Rater-Rater Correlation. The rater-rater correlation represents the
relationship for individual raters across measured skills. As an example, assume
the team skills of feedback and leadership.
Xjik =V- j +P i +Y ji +£ jik (Variable for feedback)
Xjik K ~*~a j i ji e jik (Variable for leadership)
This correlation gives the relationship between the rater effects across variables.
In peer evaluations, the rater-rater correlation can be used as an indication of the
level of correlational bias (also called response set) in the ratings if at least one of
the variables is independent of the other variables. If one of the variables is
independent, then the level of rater-rater correlation between the independent
variable and the other variables can be used as an estimate of the fraction of the
rater effect that is response set. This correlation was not needed to answer the
research questions.
Ratee-Ratee Correlation. The ratee-ratee correlation represents the relationship
for individual ratees across measured skills. Again, as an example, assume the
team skills of feedback and leadership.
55


(Variable for feedback)
Xjflc H- +a j +| i +Y ji +e jflc
Xjik y- a j i +Y ji e jik (Variable for leadership)
This correlation was not used in this research.
Dyadic Correlation. The SRM model includes a correlation for pairs of
evaluators. This dyadic correlation is between the relationship effects for a given
variable. The correlation represents the relationship between how rater A rates B
and rater B rates A for a given variable. This correlation is used as an indication of
the level of friendship bias in the ratings (Kenny et al., 1983). This relationship in
symbolic form is shown below.
Xjik = y- +a j + P i +Y ji +e jik (for a gjven variabie)
ijk y- a i P j Y lj e ijk (for a given variable)
Model Limitations. The model assumes that individuals do not communicate
with each other on how they are going to rate a specific individual. Since this is an
important model assumption, the issue of communication between raters was
included in the individual interviews and the final course survey. The individual
interviews and survey are introduced in the Procedures section of this chapter.
Other model assumptions include the need for random samples, the assumption that
effects (rater, ratee, relationship) are additive, and finally the dyadic correlation is
assumed to be linear.
56


Finally, the round robin design requires a minimum of four per group and a
minimum of six groups (Kenny, 1990). Large group sizes are desirable if the focus
is on relationship effects. There are four or five individuals per group and ten
groups in this study.
Procedures
The Multidisciplinary Petroleum Design course consisted of an instructional
phase lasting approximately three weeks followed by a project phase. The project
phase included two major open-ended design problems. At the end of the
instructional phase, students were informed that data collected during the project
phase would form the basis for a doctoral dissertation by Professor Thompson.
Participation in the research was voluntary. All the data collected have been
standard practice in the course for several years with the exception of confidential
interviews conducted at four points in time during the semester. The interviews
were conducted by a third party. The course instructors did not know when
specific individuals would be interviewed until after course grades were completed.
Students were also informed that in reporting research results, student identity
would remain anonymous. Students were also advised that they could contact any
faculty member or teaching assistant if they had any concerns about the research
being conducted.
57


The objectives of the research were presented to students as the course
objectives included in the course syllabus. These objectives are presented below:
1. Understanding, development, and application of task-related skills.
Task-related skills include analyzing data and information from your
discipline and the integration of this information with data and
information from other disciplines.
2. Understanding, development, and application of meeting management,
brainstorming, and critical team skills. Critical team skills include:
communication, coordination, leadership, feedback, back-up behavior,
and team orientation.
3. Understanding and application of a task performance strategy for
multidisciplinary teamwork.
Thus, the students were led to believe that the research objectives were broader
than the specific research questions related to the reliability, validity, and bias in
peer evaluations.
The following sections present the scope and content of the instructional phase,
and the timing of peer and project feedback for each of the two major projects.
58


Instructional Phase
The instructional phase had two major objectives. First, instruction in the use
of a task performance strategy developed to facilitate dialogue and problem solving
in a multidisciplinary team environment. Team skills are the focus of the second
objective. The team skills component focuses on peer evaluations of team skills
and anonymous feedback. These two objectives are discussed in the following
sections.
Strategy for Multidisciplinary Integration. Sutton (1997) developed and tested
a method of instruction for multidisciplinary teamwork that utilizes the team task
performance strategy shown as Figure 3.1. The application of the strategy was
practiced during the instructional phase of the course utilizing problems supplied
by industry. The instruction strategy is to teach the method in small steps and
provide frequent feedback.
Team Skills. Peer evaluations of team skills and feedback are used for
developmental purposes. A specific grade was not assigned to the peer evaluations.
Peer evaluations focus on the following eight attributes of effective teams:
cooperation, feedback, leadership, communication, coordination, back-up behavior,
effort applied to the task, and technical knowledge applied to the task. The
following sections describe the peer evaluation instrument, and team skills training.
59


Task Performance Strategy for Multidisciplinary Team
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Define the Problem and Objectives

Identify Critical
Information
What Data Are Available?
Identify Areas of Uncertainty
Revise Objectives and Scope?
J
Identify Critical Data Needs
^ Develop Criteria for Judging Value of Incremental Data
Identify and Compare Alternatives
Develop Criteria for Comparing Alternatives
Recommendations Plan of Action
Figure 3.1: Team Task Performance Strategy for Multidisciplinary Teams
Peer Evaluation Instrument. Measures for the variables of cooperation,
feedback, leadership, communication, coordination, and back-up behavior were
originally developed by Morgan et al. (1986). Anhalt (1995) modified the
instrument for use as a peer evaluation tool. Anhalt reported that the reliability for
peer and self-reports ranged from 0.79 to 0.84. The instrument utilizes a 1-5 Likert
scale. The instrument used in this research is a further refinement of Anhalts
work. Pilot peer evaluation research using a modification of Anhalts work
resulted in reliabilities (Cronbachs a) ranging from 0.89 to 0.94 (Thompson,
2000). Based on the pilot study, additional modifications were made to the
instrument. The behavioral anchors were clarified and two new measures were
added: effort applied to the task, and technical knowledge applied to the task.
60


Cronbach as for the instrument used in this research ranged from 0.90 to 0.95. The
constructs and behavioral anchored scales for each of the eight measured teams
skill are presented in Appendix A.
Team Skills Training. The team skills game Who Wants to Be a Team
Player? was developed to introduce the eight team skills considered to be critical
for effective teamwork. These team skills are: communication, coordination,
backup behavior, feedback, team leadership, team orientation, effort applied to
task, and technical knowledge applied to task. The game helps students focus on
these team skills by asking them to identify which team skills are being
demonstrated in a series of hypothetical teamwork scenarios.
During a class period, the game coordinator explains the rules of the game to
the students by reading the following:
1. I am going to throw this cush ball out into the classroom. Whoever catches
it listens to a scenario and answers a question. If no one catches the ball, the
person closest to the ball picks it up and answers the question.
2. I will read one work-related scenario with four choices of answers A), B),
C), or D) and you will tell us your answer or answers. Some scenarios have
more than one correct answer. You will have 30 seconds to give your answer.
3. In turn, I will tell you whether or not you have given the correct answer and
reveal the correct answer or answers.
4. At this point, you will throw the ball out into the classroom and whoever
catches it will be given the next scenario.
5. The game continues until we run out of time or scenarios.
61


Before the exercise begins, students were given a copy of the peer evaluation
instrument. During the exercise, class discussion was used to clarify and solidify
components of each team skill. Examples were provided on how to best apply
these skills in team situations. Class discussion was also used to clarify the use of
the team skills instrument. Scenarios covered each of the listed team skills. An
example scenario is presented below:
You are on a team charged with looking for re-completion and in-fill drilling
potential in your region. John is on your team. John is having trouble
evaluating well logs and asks for help. You have expertise in this area.
What could you offer John to help him?
A) Your sympathy
B) Your calculator
C) Your assistance* (demonstrates Backup Behavior)
D) A three inch thick log analysis manual
The asterisk is used to denote the correct answer.
Project Phase
There were two major design projects during the project phase. Each project
was approximately 6-7 weeks in duration. Each team was scheduled for
management reviews at the mid-point and at the end of each project. Peer
evaluations were completed during the class prior to these mid-project and end of
project review sessions. This semester the mid-project evaluations took place at the
beginning of the third week of teamwork for each project. The end of project peer
evaluations took place at the beginning of the 7th week of teamwork for the first
62


project. For the second project, the peer evaluations took place at the beginning of
the 6th week of teamwork.
The peer evaluations were anonymous. Confidentiality was maintained by
providing each student a code. Thus, students know the average results given to
them for each team skill but do not know who gave them a specific rating. An
example of the feedback provided to each student at the review session is presented
as Figures 3.2 and 3.3. As shown, each student received anonymous feedback on
the average evaluation (excludes self-evaluation) given to them by their team
members but was also given a copy of the self-evaluations for the team.
Team A
Average Peer Evaluation
Mid-Point First Project
374.00
575.00
631.00
790.00
83200
IBackup
Leadetship
^STeam Orientation
Code
Figure 3.2: Average peer evaluations for Team A, Mid-point of the first project.
Unique, confidential, codes were assigned to each team member.
63


Team A
Self Evaluations
Mid-Point First Project
Leadeiship
HWTftam Orientation
Figure 3.3: Self-evaluations for Team A, Mid-point first project. Unique codes
were assigned to each team member.
Individual Interviews
Confidential interviews were conducted at four points in time during the
semester. The interviews coincided with the mid-point and end of project reviews
held by the faculty and TA team. The interviews were conducted by a third party.
The course instructors did not know when specific individuals were interviewed
until after course grades were completed. Thirty-five separate interviews were
conducted during the semester. Two individuals were interviewed twice.
Appendix B includes a copy of the interview questions for each interview session.
64


Survey
A survey questionnaire was completed by all students during the same class
period that the students completed the end of project peer evaluations for the
second project. This survey questionnaire was designed to parallel the individual
interview questions. The survey questions addressed user acceptance and bias in
the peer evaluations. The survey questions are shown in Appendix C.
Summary
Peer evaluations from multidisciplinary teams of geologists and petroleum
engineers were used to investigate the reliability, validity, friendship bias, and user
reaction to peer evaluations in self-directed interdependent work teams. The
statistical model (SRM) partitions the variance in the ratings into rater effect, ratee
effect, and relationship effect. The SRM model is specifically designed to take into
account interdependencies that exist in small group work. Specifically for this
research, the model estimates a correlation between pairs of raters and ratees that is
an indication of the friendship bias that may occur in peer ratings. The statistical
data were augmented by individual interviews and a final course survey.
65


CHAPTER 4
RESULTS
This chapter addresses the six specific research questions related to user
acceptance, reliability, validity, and friendship bias in peer evaluations in self-
directed work teams. The six specific research questions are summarized in Table
4.1 which follows:
Table 4.1
Summary of Research Questions
Peer Evaluation in Self-Directed Work Teams
Number_______________________Research Question___________________
1 What is the user reaction to peer evaluations?
2 How much of the variance in peer evaluations is due
to rater, ratee, and relationship effects?
3 What is the level of bias in peer evaluations?
4 What is the stability of peer evalutions?
5 What is the level of validity in peer evaluations for
the team skills, technical knowledge applied to the
task and effort applied to the task?
6 What is the impact of group membership on the level
of consensus and stability in peer evaluations?
The framework for presenting the results of the study is shown in Figure 4.1.
First, the question addressing user acceptance is presented. This is followed by a
discussion of issues related to reliability (Research Questions 2, 3, and 4), and
66


validity (Research Question 5). Finally, the impact of group membership on the
level of consensus and stability (Research Question 6) is presented. The research
questions are addressed in numerical order.
User Reaction
Research Question 1
User Acceptance to Peer Evaluations
Reliability
Research Question 2
Variance Partitioning
Rater, Ratee, Relationship Effect
V______________________________________J
S

----------------------S
Research Question 3
Bias
Research Question 4
Stability of Ratings
Validity__________
Research Question 5
Validity of Peer Evaluations
Knowledge Applied to Task
Effort Applied to Task
Research Question 6
Impact of Group Membership
Figure 4.1: Framework for presenting results of peer evaluations
User Reaction Research Question 1
Two methods were employed to assess students reaction to the peer evaluation
process: individual interviews, and an end of course survey. The following
sections present the results for each of these sources of information.
67


Interviews
As described in Chapter 3, each team was scheduled for a 15-minute review
with the management team (faculty and TAs) at the mid-point and at the end of
each project. Concurrently, selective individual interviews were conducted by a
third party. The TAs were instructed to select approximately ten individuals (one
from each team) at each of the four points in time (mid-point and end of each
project). At the mid-point and end of first the project there were eight and nine
individuals interviewed respectively. At the mid-point and end of the second
project, there were ten and eight individuals interviewed respectively. There were a
total of 35 individual interviews conducted during the semester. Two individuals
were interviewed twice.
In some cases, a team member was interviewed before the team level feedback.
Since the anonymous team peer feedback was provided to each team member at the
team review session, some team members were interviewed before receiving the
anonymous peer evaluations. The opposite was also possible. The interview
results are presented in the order of the interviews and are summarized in Tables
4.2 4.5.
Mid-Point First Project. The following question was asked during the mid-
point interview for the first project. For this round of interviews, the N/A for some
questions indicates that the student was interviewed before receiving the
anonymous confidential team peer feedback. This was not done for subsequent
68


interviews since all individuals had received at least one sample of the evaluations
they were receiving from their peers.
Interview Question 1: Do you think the peer assessment process was fair?
Why or why not? Please give examples if applicable.
Table 4.2
User Reaction to Peer Assessment
Mid-Point First Project (n = 8)
Question
Interview Question 1: Do you think the peer assessment process was fair? Why or why
not? Please give examples if applicable._________________________________________________
Discipline Gender ESL Comments
PE Male No 1. Dr Thompson restricted how you can rate on effort must be an average of *3. I wanted to give more than an average of three, but I didnt.
PE Female No 1.N/A
PE Male Yes 1. Feedback was fair.
PE Male No 1. Yeah.
PE Male No 1. Yeah, its a valuable asset. It helped but some anchors on the scales are too general, not enough detail, too vague.
GE Male No 1.1 think it will be useful.
GE Mate Yes 1. Nothing is ever completely fair.
PE Female No 1. N/A
Note: PE = Petroleum Engineer, GE = Geological Engineer, ESL = English is Second
Language
End First Project. The following two questions were asked during the end of
first project interview:
69


Interview Question 1: Do you think the peer assessment process is fair?
Why or why not? Please give examples if applicable.
Interview Question 2: From your perspective, have you benefited from the
peer assessment process? If so, how? Did the feedback make any
difference? If not, how could the process be changed to be more beneficial?
Table 4.3
User Reaction to Peer Assessment
End First Project (n = 9)
Questions
Interview Question 1: Do you think the peer assessment process is fair? Why or why not?
Please give examples if applicable.
Interview Question 2: From your perspective, have you benefited from the peer assessment
process? If so, how? Did the feedback make any difference? If not, how could the process be
changed to be more beneficial?
Discipline Gender
PE Male
PE Female
ESL____________________________Comments___________________________
No 1. Peer assessment is fair, but there is a gray area. Good to give
feedback in the middle of the project
2. Hasnt changed anything. Our group has been really good.
We had high ratings on our entire team.
No 1. Pretty fair, it incorporates essential characteristics needed to
be part of team leadership. It did a really good job of determining
how well teams work together and for yourself.
2. Really has...to see how peers evaluate you and vtfiat you think
of yourself. I ranked myself lower than my peers on most, but
peers did not rate me as high on some dimensions as I thought
The peer assessment process relates to the real world (worked
for a company that used peer assessment).
GE Male No 1. Fair enough most feedback was about team.
2. For me no. For others, hard to say. When poor performance
came back for one individual, there was no change in behavior of
that individual.
PE Male No 1. Pretty fair. I don't like being assigned a number. Sometimes
team members dont realize skill sets, sometimes whichever
team member is the loudest is recognized.
2.1 have pretty low evaluation from team members. I did a lot of
work Friday and Saturday on my own and they were pissed.
Getting that team balance is hard to do.
70


Table 4.3 (Cont.)
User Reaction to Peer Assessment
End First Project (n = 9)
Discipline Gender ESL_____________________________________Comments
PE Male No 1.1 think peer assessment is fair. I don't think it can be represented in this class students hanging out with each other for 4 years. I like the idea but not with friends trying to get grades. 2. For me personally, no. My peer ratings were higher than self- ratings. I think we need mandatory team meetings and more structure. We need to meet with the profs and share feelings about who's doing what
PE Female No 1. It is fair, asks the right questions. 2.1 guess so. I got marked down but I agreed with the ratings. I knew I was going to be low in Technical Skills...its fair. I don't see the benefit. I think it affected the dynamics of the group negatively or in some cases, people didn't care.
PE Male No 1. Its fair, yes. Not sure if it is helpful; I got graded better than I thought. 2. To me, not reaily.
PE Female Yes 1. Overall its fair. Due to the fact that weve been together for
PE Male No four years....then if someone doesn't like you you get screwed. 2. Yes, from the feedback what peers think I am lacking in. I was doing a little of everything and I got dinged for that, because I'm not an expert in any one thing. 1. The first peer assessment occurred so early in the game. It should have occurred now instead of so early. If it affected grades I would say it's not fair because there are too many friends. 2. No not to me. It was done too early in the game and on my team they're all friends except for me. They would schedule things the last minute. Under normal circumstances, you might think you are doing well because of the high ratings when really you're not
Note: PE = Petroleum Engineer, GE = Geological Engineer, ESL = English is Second
Language
71


Mid-Point Second Project. The following two questions were asked during the
mid-point interview for the second project. A question addressing the timing of the
peer evaluation was added to the first question and is shown in bold below. This
question was added because of comments from some students that they did not
have enough information to make an informed peer evaluation. The faculty also
observed that there was an apparent decrease in the variability in the peer ratings.
Interview Question 1: Do you think the peer assessment process is fair?
Why or why not? Please give examples if applicable. Was it premature
in the project?
Interview Question 2: From your perspective, have you benefited from the
peer assessment process? If so, how? Did the feedback make any
difference? If not, how could the process be changed to be more beneficial?
Table 4.4
User Reaction to Peer Assessment
Mid-Point Second Project (n = 10)
Questions
Interview Question 1: Do you think the peer assessment process is fair? Why or why not? Please
give examples if applicable. Was it premature in the project?
Interview Question 2: From your perspective, have you benefited from the peer assessment
process? If so, how? Did the feedback make any difference? If not, how could the process be
changed to be more beneficial?________________________________________________________
Discipline Gender ESL________________________________Comments_________________________
PE Male No 1. Yes definitely, peer assessment process is fair, but premature
in project. I was guessing by the performance of that person in
other classes. I mainly wrote 3's*.
2. Not at all. (my peer ratings were at the bottom).
PE Female No 1. Yes, fair. Probably premature.
2. My readings were positive; a nice reinforcement
72


Table 4.4 (Cont.)
User Reaction to Peer Assessment
Mid-Point Second Project (n = 10)
Discipline Gender ESL_______________________________Comments________________________
PE Male No 1. Yes, it is fair. If we would have been on schedule and had
something more to rate, it wouldn't have been premature. But
since we werent, it was premature.
2. Yes definitely yes. There are things I need to improve on.
PE Male No 1. Yes (it is fair). Yes it was premature!
2.1 benefited from the first one especially because you see what
the team is not seeing from you.
PE Male Yes 1. Yes, for a normal person. Except if there is a lack of
communication then you don't know what that person can do.
Yes, it was premature. We haven't done anything yet The
presentation was premature, too.
2. Did look at my rating. I am trying to improve myself and learn
something new everyday. It helped me a lot.
PE Male No 1. Yeah, I think so. It was premature.
2.1 don't think so, there's not much variance in ratings. I found
myself giving S's*. It was hard to judge leadership on others.
Only one rating at the end of each project would be better.
PE
GE
PE
Male Yes 1. Most times, yes, its fair as long as it does not go into the
grading. If it did, I would be concerned. Yes, very premature.
Hardly got into this project
2. Yes, I have looked at my ratings. I have thought about where
I could improve.
Male No 1. The process is not entirely fair. I never give out 5's. I give
3's'' and *4's a lot Other than that I have no problem with it
Yes, actually, it was premature. We are only 1 1/2 weeks into
the project and I hardly know my team members.
2. Yes, I tried to improve areas Im lacking in.
Male No 1. Yeah, in general. There are always cases where a person is
likable and gets rated highly and yet doesn't do much work.
Yeah, way premature. I based my ratings on assumptions of
what team members would do.
2.1 look at my ratings, but I would rather have people tell me
directly during the project how they think I'm doing. The peer
assessment has not benefited me.
73


Table 4.4 (Cont.)
User Reaction to Peer Assessment
Mid-Point Second Project (n = 10)
Discipline Gender ESL___________________________Comments___________________________
PE Female No 1. Yeah, I do. Generally people take it seriously and are honest
It helps you stay on your toes. Yes, I do think it was premature,
definitely.
2. Yeah, I have benefited from it I have more of an idea of how
to contribute to the group.
Note: PE = Petroleum Engineer, GE = Geological Engineer. ESL = English is Second
Language
End Second Project. The following two questions focusing on user reaction to
the peer evaluation process were asked at the end of the second project:
Interview Question 1: Do you think the peer assessment process is fair?
Why or why not? Please give examples if applicable.
Interview Question 2: From your perspective, have you
benefited from the peer assessment process? If so, how? Did
the feedback make any difference? If not, how could the
process be changed to be more beneficial?
Table 4.5
User Reaction to Peer Assessment
End Second Project (n = 8)
Questions
Interview Question 1: Do you think the peer assessment process is fair? Why orwhy not? Please
give examples if applicable.
Interview Question 2: From your perspective, have you benefited from the peer assessment
process? If so, how? Did the feedback make any difference? If not, how could the process be
changed to be more beneficial?__________________________________________________________
Discipline Gender ESL___________________________Comments___________________________
PE Female No 1. Yes as long as kids are honest.
2. Yes I looked at it and I have made improvements. Feedback
did make a difference in some areas overall.
74


Table 4.5 (Corn.)
User Reaction to Peer Assessment
End Second Project (n = 8)
Discipline Gender ESL Comments
PE Female No 1. No, because you take into account that someone ticked you off and that's not fair. We had five people and my whole team did nothing and I did almost everything. The process doesnt represent how it actually works. 2. No. I found the feedback to be frustrating. You know the ratings should be something else and it's not fair.
PE Male No 1. Yes. 2. I don't know. As far as feedback, sure.
PE Female Yes 1. Not really, I still think friendship overlays how they rate. If you don't like someone, its hard to give them high ratings. 2. I don't think so. I was not benefited nor harmed. The feedback was helpful. You could actually see what others think of you.
PE Male No 1. Not really, you get people who are subjective. It's not objective if you are friends.. 2. No. Feedback no. I look at the process as depending on the objectivity or subjectivity of others. Being in the military, a few graphs aren't going to change anything for me. As far as improving, you could assign a point value to make it significant, but then all the points would go up. My view of most of this process is that it's a waste of time.
GE Female No 1. Yes I do, you get honest feedback. 2. Yeah, I guess so. As far as feedback, I learned I have to talk more with team members.
PE Male No 1. Yes. More balanced view of what you've done. It's an easy form of assessment. 2. Didn't make much difference because of the team dynamics. We had personality conflicts.
PE Male No 1. Yeah I wish it were more anonymous. I was too dose to my peers this time when I was filling out the assessment, as in sitting right next to each other. I felt really self consdous about being totally honest with my ratings. The other times it was no problem because I was not right next to my team members. 2. No, didn't get any feedback that surprised me and the feedback didn't affect the way I did my second project People are too unwilling to say how they feel especially people who didn't do well tend to give high ratings and hope that they get them in return. It's also hard when you are rating your friends.
Note: PE = Petroleum Engineer, GE = Geological Engineer, ESL = English is Second
Language
75


Survey
A survey questionnaire incorporating the interview questions was completed by
all students (49) at the end of the course. In a few cases, some questions were left
unanswered. The exceptions are noted. The following questions focused on the
issue of user reaction to the peer evaluation process:
Survey Question 1: Overall, do you think the peer assessment process is fair?
Survey Question 2a: From your perspective, have you benefited from the
peer assessment process?
Survey Question 2b: Did the feedback make any difference?
The following figures summarize the results for these questions.
-100-----------------------------------------------------------------
00'
Yes No
Assessment Process is Fair?
Figure 4.2: Survey Question 1, is the peer assessment process fair? 48/49
Responded.
76


Figure 4.3: Survey Question 2a, did you personally benefit from process? 47/49
responded
Did Peer Feedback Make a Difference?
Figure 4.4: Survey Question 2b, did the feedback make any difference? 48/49
responded.
77


These survey results support the argument that students felt that the process was
fair and in many cases the students benefited from the process.
Summary
Overall, from the perspective of the students, the peer assessment process was
perceived to be fair. Approximately 77% of the 48 students responding to the
survey question about fairness felt that the process was fair. There were concerns
over fairness expressed in the interviews and final survey as discussed in the
following paragraphs.
First, there is a clear pattern of concern over friendship as a bias in the peer
assessment process. Students also expressed concern over using peer assessment as
part of the course grade. As presented in the methodology chapter, the peer data in
this course were used for developmental purposes. From the survey, students for
whom English is a second language, were inclined to feel that the peer evaluation
process was unfair. Finally, one student expressed concern over different grading
standards among the team. Kenny, Lord, and Garg (1983) discuss this problem.
These concerns (taken from Interview Question 1 in the summary tables) are
captured in the following responses made during the individual interviews:
Nothing is ever completely fair (mid-point first project).
I think peer assessment is fair. I dont think it can be represented in this
class students hanging out with each other for 4 years. I like the idea but
not with friends trying to get grades (end first project).
78


Overall its fair. Due to the fact that weve been together for four
years....then if someone doesnt like you, you get screwed (end first
project).
If it affected grades I would say its not fair because there are too many
friends (end first project).
Most times, yes, its fair as long as it does not go into the grading. If it did,
I would be concerned (mid-point second project).
The process is not entirely fair. I never give out 5s. I give 3s and 4s
a lot. Other than that I have no problem with it. (mid-point second project).
Yeah, in general. There are always cases where a person is likable and gets
rated highly and yet doesnt do much work (mid-point second project).
Responses to the interview questions addressing the benefit of peer assessment
varied from very positive to neutral with only one comment indicating a potential
negative impact on the team dynamics. Using the interview responses as a source,
the benefits of the peer assessment and feedback process outweigh any potential
negative aspects. Generally, students benefited or were neutral to the process.
Some examples of the responses (taken from Interview Question 2 in the
summaries presented earlier) are presented below:
Really has.. .to see how peers evaluate you and what you think of yourself.
I ranked myself lower than my peers on most, but peers did not rate me as
high on some dimensions as I thought. The peer assessment process relates
to the real world (worked for a company that used peer assessment) (end
first project)
For me no. For others, hard to say. When poor performance came back for
one individual, there was no change in behavior of that individual (end first
project).
79


I dont see the benefit. I think it affected the dynamics of the group
negatively or in some cases, people didnt care (end first project).
Not at all. (my peer ratings were at the bottom) (end first project).
Yeah, I have benefited from it. I have more of an idea of how to contribute
to the group (mid-point second project).
I benefited from the first one especially because you see what the team is
not seeing from you (mid-point second project).
No. I found the feedback to be frustrating. You know the ratings should be
something else and its not fair (end second project).
Yeah, I guess so. As far as feedback, I learned I have to talk more with
team members (end second project)
Finally, the end of course survey data confirm the interview responses. The
response to Survey Question 2a indicates that over 50% of the 47 responding to this
question benefited from the peer assessment process. In response to Survey
Question 2b, nearly 60% of the 48 responding to this question indicated that the peer
assessment process made a difference in someway. The response to this question
was interpreted to mean at the individual or at the team level.
Variance Partitioning Research Question 2
This section discusses reliability from the perspective of variance partitioning
into rater, ratee, and relationship effects. Relative rater effect is a measure of the
similarity in the ratings for a given rater. If raters make very little distinction
between ratees, but each rater gives a different rating, there would be a high rater
80


effect. The relative ratee effect is a measure of the consensus among the raters
where a high ratee variance indicates a high degree of consensus among raters.
Finally, the relative relationship effect is the variance that is unaccounted for by the
rater and ratee effects. Thus, the relative relationship effect measures the
uniqueness of the ratings between individual i with j. The following sections
present the results of the relative variance partitioning, and the reliability of rater
and ratee effects.
Relative Variance Partitioning
The variance in the variable estimates is partitioned into rater, ratee, and
relationship (interaction between rater and ratee) effects. Relative variance is the
portion of the variance due to each effect. The relative variance partitioning for the
mid-point and end of each project is summarized in Tables 4.6 4.9. These results
indicate a wide range in values for the average rater effect (0.21 to 0.65), ratee
effect (0.12 to 0.36), and relationship effect (0.24 to 0.51). The average rater, ratee,
and relationship effect over the four evaluations was 0.38, 0.23, and 0.39
respectively.
81


Table 4.6
Portion of Variance Due to Rater, Ratee, and Relationship Effects
Mid-Point First Project
Relationship
Variable Rater Effect Ratee Effect Effect
Backup 0.19 * 0.31 * 0.50
Communication 0.34 * 0.20 * 0.46
Coordination 0.24 * 0.42 * 0.34
Feedback 0.40 * 0.36 * 0.24
Leadership 0.23 * 0.42 * 0.35
Team Orientation 0.31 * 0.31 0.37
Effort 0.17 * 0.42 * 0.41
Know 0.26 * 0.42 * 0.32
Average 0.27 0.36 0.37
Note: p < 0.05
Table 4.7 Portion of Variance Due to Rater, Ratee, and Relationship
End First Project Variable Rater Effect Ratee Effect Relationship Effect
Backup 0.20 * 0.22 0.58
Communication 0.27 0.20 * 0.53
Coordination 0.25 * 0.22 * 0.53
Feedback 0.27 * 0.27 * 0.46
Leadership 0.16 * 0.33 * 0.51
Team Orientation 0.34 * 0.24 * 0.42
Effort 0.03 0.39 * 0.58
Know 0.14 * 0.40 * 0.46
Average 0.21 0.28 0.51
Note: = p < 0.05
82


Table 4.8
Portion of Variance Due to Rater, Ratee, and Relationship Effects
Mid-Point Second Project
Relationship
Variable Rater Effect Ratee Effect Effect
Backup 0.61 * 0.16 * 0.23
Communication 0.66 * 0.09 0.25
Coordination 0.77 * 0.09 0.14
Feedback 0.67 * 0.07 0.26
Leadership 0.55 * 0.17 * 0.29
Team Orientation 0.74 * 0.05 0.20
Effort 0.53 * 0.14 * 0.34
Know 0.62 * 0.16 * 0.22
Average 0.65 Note: = p < 0.05 0.12 0.24
Table 4.9
Portion of Variance Due to Rater, Ratee, and Relationship Effects
End Second Project
Relationship
Variable Rater Effect Ratee Effect Effect
Backup 0.50 0.08 * 0.42
Communication 0.43 * 0.13 * 0.44
Coordination 0.51 * 0.12 * 0.37
Feedback 0.51 * 0.10 0.39
Leadership 0.31 * 0.17 * 0.52
Team Orientation 0.56 * 0.02 0.41
Effort 0.05 0.37 * 0.59
Know 0.44 * 0.26 * 0.30
Average 0.41 0.16 0.43
Note: = p < 0.05
Reliability of Rater and Ratee Effects
The SRM statistical model also calculates an estimate of the reliability of the
mean rater and ratee effects. The reliability represents the percentage of the
variation that is attributable to variation in true scores. The reliability coefficient is
83


calculated by talcing the ratio of the obtained variance for the effect (rater or ratee)
and the expected variance for the effect (rater or ratee). The reliability provides a
sense as to whether one can meaningfully interpret the rater and ratee effects for a
given variable (Kenny et al., 1983). The reliability estimates generated by the
SRM program are summarized in the Tables 4.10 4.13. The average reliability of
the rater and ratee effects ranges from a low of 0.51 to a high of 0.90 indicating a
moderate to high level of reliability in the rater and ratee effect estimates.
Table 4.10
Reliability of Rater and Ratee Effects
Mid-Point First Project
Variable Rater Ratee
Back-up 0.56 0.68
Communication 0.72 0.60
Coordination 0.71 0.81
Feedback 0.85 0.84
Leadership 0.70 0.81
Team Orientation 0.74 0.74
Effort 0.60 0.79
Knowledge 0.75 0.83
Average 0.70 0.76
Table 4.11
Reliability of Rater and Ratee Effects
End First Project
Variable Rater Ratee
Back-up 0.54 0.56
Communication 0.62 0.55
Coordination 0.60 0.57
Feedback 0.67 0.66
Leadership 0.50 0.68
Team Orientation 0.74 0.66
Effort 0.15 0.69
Knowledge 0.50 0.74
Average 0.54 0.64
84


Table 4.12
Reliability of Rater and Ratee Effects
Mid-Point Second Project
Variable Rater Ratee
Back-up 0.90 0.71
Communication 0.91 0.57
Coordination 0.96 0.71
Feedback 0.90 0.47
Leadership 0.87 0.67
Team Orientation 0.93 0.48
Effort 0.85 0.59
Knowledge 0.91 0.72
Average 0.90 0.62
Table 4.13
Reliability of Rater and Ratee Effects
End Second Project
Variable Rater Ratee
Back-up 0.80 0.40
Communication 0.76 0.50
Coordination 0.83 0.54
Feedback 0.82 0.47
Leadership 0.69 0.55
Team Orientation 0.82 0.17
Effort 0.22 0.69
Knowledge 0.85 0.78
Average 0.72 0.51
Bias Research Question 3.
Three methods and data sources were used to investigate the extent of bias in
the peer evaluations. These methods and data sources include the end of course
survey, the individual interviews, and the SRM statistical model incorporating the
round robin design of the peer assessment data. The results, summarized in Tables
4.14 4.17, are presented in the following sections.
85