I appreciate the recent blog discussion on student value-added methods for teacher evaluation, and I hope it can perhaps help bridge some of the “insider/outsider” rift on this topic.
My point in noting problems with value-added test score data is not to derail those efforts but to improve them. Using these data in some manner is certainly better than having no information at all about how teachers influence student achievement. But, we have to be careful, because the data have many flaws – they are necessarily only a sample of the “true” quality of a teacher, which is very hard to know, when we don’t observe them, in class for 180 days per year, six hours per day. That, of course, is impossible.
My concern (and a problem with the LA Times publication of such data) is that some people now think we have ironclad, precise data on teacher influence on student achievement, which we can now just plug in to evaluate the teacher. The recent EPI report, and nearly all other recent research, point out very real problems with using these data, not just overly wonky anxiety.
When you sample, you always have implicit or explicit “confidence intervals” around the estimate. A principal observing teacher quality via classroom activities is sampling. If that principal only watches a particular teacher one time during the year, for 10 minutes, that is a very imprecise sample. That teacher could be terrible most of the time, but the principal happened to catch her on part of a very good day, and (wrongly) writes down that she is an excellent teacher. Obviously, the more observations a principal does, the more likely that the sample is an accurate reflection of the “true” teacher quality (Mike Miles in Harrison 2 has addressed this well, with 8-16 observations required per year).
(We know that flipping a coin is a 50/50 heads/tails probability, in a large sample. But flip it 2 times, and the odds are that 25 percent of the time you get two tails, 25 percent both heads, and 50 percent a mix. So, two random observations of a “good” teacher (“heads”) have a 25 percent chance of believing that she is not a good teacher (“tails”).)
Note that this also requires “random” observation – if the teacher knows the principal is coming to watch, she will obviously improve her performance at that time (a corollary to teaching to the test).
It is equally important to note that using test score data is also a sample of “real student learning.” If the tests are valid, reliable and measure the learning we want students to have, they are better than tests that don’t have those characteristics (which fits most of our current tests). But, one test a year (or two, or a few) are only samples of student learning. If many of a teacher’s students are ill for March CSAPs, the sampling of learning won’t be terribly accurate. The better the tests, the more tests we give, the more likely the sample is accurate, and in statistical terms, the smaller the confidence interval around the estimate, meaning we could say with more precision that teacher X is very good, and be correct about that.
Once-a-year CSAPs provide very imprecise estimates for individual teachers, which, along with other problems noted in other blog posts, should caution about high stakes use of these data.
Even with these caveats, the data can be used in some basic ways, probably helping to sort teachers into three categories. I agree with Kevin Welner that we can think about using test score (and other) data to sort teachers into a small category of “excellent teachers” who seem to drive high student achievement over several years and also rank high on principal and/or peer evaluations. At the other end, consistently low student achievement scores and low observational ratings should identify truly poor teachers. The vast number of teachers will be in a middle ground, and trying to sort them more precisely will be going well beyond the ability of our sampling techniques.
Popularity: 6% [?]