New Tests, New Questions

New teacher evaluations include state test scores, raising questions about fairness for teachers in non-tested grades and subjects. What happens when teachers create those assessments?

The challenge:

Delaware had adopted a new teacher-evaluation system that tied ratings to student performance. Because many teachers are teaching traditionally un-tested subject areas, educators came together across the state to create hundreds of new local assessments to measure how much students learned in class. The state needed to know whether the assessments were fair.

The intervention:

SDP Fellow Shanna Ricketts investigated the reliability and validity of the new assessments to determine whether they served as a good measure of what students knew before and after their time in a teacher’s class.

The impact:

Ricketts conducted an in-depth statistical analysis and found that the vast majority of assessments were reliable and valid for measuring both student growth and teacher effectiveness. She identified key reasons for their success: Delaware teachers and experts had worked together to develop the assessments, and the state set an ambitious schedule of continually reviewing and revising them.

The Challenge:

New Tests, New Questions

Like many states, Delaware has overhauled its professional evaluation system for public-school teachers in order to determine which teachers are more and less effective in the classroom. Since 2011, the state’s 10,000 teachers have been assessed annually under the Delaware Performance Appraisal System II, which combines classroom observations with evidence of student learning in order to rate teachers’ effectiveness.

Evaluators had a few different sources of information to measure student learning: statewide test results, student learning objectives, and pre- and post-tests for a variety of subjects that had been developed by teams of Delaware teachers, experts at the state department of education, and consultants.

After surveying teachers about their impressions of the new system, many raised questions about the reliability of those homegrown pre- and post-tests. While the tests had been designed to be aligned to classroom content—there were more than 100, with tests for subjects ranging from German Level III to Sheet Metal I—questions remained. Were the tests a valid measure of an individual teacher’s contribution to student learning? How could their results be used to set goals for teachers and students?

The Intervention:

A Validity Check

SDP Fellow Ricketts reviewed the growth assessments and results in depth in order to measure their reliability. Through in-depth statistical analysis, she found that the tests were internally consistent and consistent with other measures of student learning, such as state tests of the same subjects.

In her review, she looked for statistical outliers that would reveal an out-of-range question or aspect of the tests that did not match its intended purpose and audience. Her review determined that the tests were the right level of difficulty, were equally difficult for different subsets of students, and were sequenced properly, with students earning overall higher scores on the second test, not the first.

In addition, Ricketts reviewed whether the tests were valid in light of their intended purpose: to assess individual teacher performance. Technically, the tests were reliable enough on their own to show differences in student performance on pretests and posttests; because they were just one part of a multiple-measures system, she found them to be a valid source of information to make decisions about teacher performance.

The Impact:

Lessons Learned

This analysis by SDP Fellow Ricketts at the Delaware Department of Education revealed lessons from the Delaware experience with implications for states across the country—not only from her statistical analysis, but from an overall look at how next-generation evaluation systems can address questions of fairness.

First, the history of the tests’ development was key in establishing their validity. They had been developed in large part by teachers, which was not only important in terms of establishing their validity, but also essential to their acceptance by teachers and principals. Second, Delaware’s ongoing commitment to continually review and revise the tests was critical.

And finally, the tests’ strong reliability ratings were supported by their inclusion in a multiple-measures system. Other states have found disparities among sources of data about teacher performance, such as value-added and classroom observation ratings. It is imperative that many data points by included in evaluations, to avoid one measure unjustifiably influencing teacher ratings.

SDP Resources:

Find the full capstone report by Ricketts, along with case studies with SDP fellows in Indiana and South Carolina.