I appreciate the recent blog discussion on student value-added methods for teacher evaluation, and I hope it can perhaps help bridge some of the “insider/outsider” rift on this topic.

My point in noting problems with value-added test score data is not to derail those efforts but to improve them. Using these data in some manner is certainly better than having no information at all about how teachers influence student achievement. But, we have to be careful, because the data have many flaws – they are necessarily only a sample of the “true” quality of a teacher, which is very hard to know, when we don’t observe them, in class for 180 days per year, six hours per day. That, of course, is impossible.

My concern (and a problem with the LA Times publication of such data) is that some people now think we have ironclad, precise data on teacher influence on student achievement, which we can now just plug in to evaluate the teacher. The recent EPI report, and nearly all other recent research, point out very real problems with using these data, not just overly wonky anxiety.

When you sample, you always have implicit or explicit “confidence intervals” around the estimate. A principal observing teacher quality via classroom activities is sampling. If that principal only watches a particular teacher one time during the year, for 10 minutes, that is a very imprecise sample. That teacher could be terrible most of the time, but the principal happened to catch her on part of a very good day, and (wrongly) writes down that she is an excellent teacher. Obviously, the more observations a principal does, the more likely that the sample is an accurate reflection of the “true” teacher quality (Mike Miles in Harrison 2 has addressed this well, with 8-16 observations required per year).

(We know that flipping a coin is a 50/50 heads/tails probability, in a large sample. But flip it 2 times, and the odds are that 25 percent of the time you get two tails, 25 percent both heads, and 50 percent a mix. So, two random observations of a “good” teacher (“heads”) have a 25 percent chance of believing that she is not a good teacher (“tails”).)

Note that this also requires “random” observation – if the teacher knows the principal is coming to watch, she will obviously improve her performance at that time (a corollary to teaching to the test).

It is equally important to note that using test score data is also a sample of “real student learning.” If the tests are valid, reliable and measure the learning we want students to have, they are better than tests that don’t have those characteristics (which fits most of our current tests). But, one test a year (or two, or a few) are only samples of student learning. If many of a teacher’s students are ill for March CSAPs, the sampling of learning won’t be terribly accurate. The better the tests, the more tests we give, the more likely the sample is accurate, and in statistical terms, the smaller the confidence interval around the estimate, meaning we could say with more precision that teacher X is very good, and be correct about that.

Once-a-year CSAPs provide very imprecise estimates for individual teachers, which, along with other problems noted in other blog posts, should caution about high stakes use of these data.

Even with these caveats, the data can be used in some basic ways, probably helping to sort teachers into three categories. I agree with Kevin Welner that we can think about using test score (and other) data to sort teachers into a small category of “excellent teachers” who seem to drive high student achievement over several years and also rank high on principal and/or peer evaluations. At the other end, consistently low student achievement scores and low observational ratings should identify truly poor teachers. The vast number of teachers will be in a middle ground, and trying to sort them more precisely will be going well beyond the ability of our sampling techniques.

Popularity: 6% [?]

Thanks for the post Paul. Helps to move the conversation from the margins.

I sincerely hope the Gov’s Council charged with establishing the criterion for teacher and principal effectiveness is watching and listening. Mike Miles has done some great work with the Harrison School District around the issue of teacher evalautuion and I hope we can glean some insight from their work.

Part of my frustration comes because the current conversation focusses too much on the possible punitive responses to poor performance versus the positive aspects of identifying struggling teachers and getting them necessary resources. If we can identify excellent teachers, we can learn from them. If we can identify struggling teachers we can help them. I hope this is the focus of our work with teacher evaluation.

Paul, one of the issues that comes up for me with the VAM research, and I’ll be honest here, I have not read it all, is if we cannot assign a value of a teacher’s impact to a student, what role do assessments play what so ever in the classroom? When I give a formative assessment in the classroom I know that it gives me (with limited validity) a snapshot of what a student knows at a given time. I analyze the data to see what, if any, remediation needs to take place. But why bother if I cannot assign some effect of my teaching to the students? Should I just shrug and say my students’ performance may be due to my teaching practice or some source outside my classroom. How do I adjust my practice if I cannot identify what impact I have on the students?

Mark, I think the use of tests and quizzes for formative assessment is very important, and useful. It is just that the results aren’t going to be real precise. But, if you learn that most of the class really didn’t understand Lesson X, then it should guide you to spend more time on that topic and/or present it differently. And, you can probably determine that John is getting it better than Fred, and focus more attention on Fred for that topic.

It just wouldn’t be suited for very high stakes judgments, like retaining Fred or not in that grade, because the tests just aren’t accurate and precise enough. However, combining lots of regular test results, with your own observations, and other information about a student, should certainly provide enough information to make judgments.

Sounds like good guidence for a teacher’s evaluation.

Paul, a question (hopefully relevant),

One of the primary critiques against VAM is the variability – a relatively high percentage of teachers will rank high one year, but drop significantly the next. But I am wondering why (or to what extent) this shows the tests to be unreliable. For a simple example (rom watching a few innings of the Rockies earlier), I was reminded of the high variation in batting averages for players from year to year. In a lot of professions it is not surprising to have high performers who have significant variability.

Do you have any sense of how much variability is random, and if it was controlled for in the study? And/or what would be an expected level of variability?

Alex, this is a very good question. I think that the implicit assumption of most of these models is that the teacher has a “true” type (good, bad, average, etc) that doesn’t vary that much year to year (of course, in the real world it might vary, in a given year, if a teacher has a family crisis, or is just getting burned out, etc.). We are trying to measure that true type with student data. My sense is that, while teachers probably do vary to some degree, as do baseball hitters or pitchers, there is far greater variability in our current precision in measuring their quality than there is actual variation in their performance. That is, the measurement technique is way more variable than the underlying performance. Certainly, that seems true with a one-shot CSAP test taken by a small number of students (as compared to 400 at bats per year, all against major league pitching, which is a statisically more valid measure).

If I might interject a bit….

There is a missing piece between the administrator sampling observation where they record their judgment of bad to great and the student learning outcomes as a measure of successful teaching — objective, data-based observations. That is, the relatively simple process of gathering data on teaching practices and student behaviors related to learning (which includes other in-class behaviors). One can gather data on the fidelity of implementation of whatever teaching practice or intervention the teacher choose to determine if, in fact, it’s being implemented as intended (often it’s not), and the effectiveness of the intervention/teaching practice on the behavior of the students. For example, a teacher wants the low-reading boys to ask more content related questions, and decides to call on that set of students twice as often as other students. Tracking the number of questions asked of those students in comparison to all questions asked may well surprise the teacher; likewise, tracking the number of questions answered by that target group can provide evidence of the effectiveness of the effort.

The disclaimer: after 30 years in the education profession, I wrote software to aid in the collecting of such data, called eCOVE Observation Software (the COVE stands for Collaborate, Observe, Value, Empower). Whether you use eCOVE or pencil and paper, the important step is to gather the objective timer/counter data as a basis for professional discussions and reflection. The objective data will go far in changing the dynamic between peers or between administrator and teacher. I have created over 200 data collection tools, and you are free to explore them for ideas. Search for eCOVE and the website will come up with video, papers, all the tools created to date, etc. If I can be of assistance in designing specific data collection tools or techniques for collecting and using the objective data, don’t hesitate to email me at john@ecove.net

Peace, John