For fairness in blogging, I feel compelled to write about the relatively new Brookings report “Evaluating Teachers: The Important Role of Value-Added” that takes a different perspective on using (value-added) student achievement data than the EPI report (“Problems with the use of student test scores to evaluate teachers”) released a few months ago.
Both studies were co-authored by a large group of distinguished researchers, so there is lots of food for thought in this debate. But let’s be clear that this is not just another case of “top researchers disagree on facts, so I can just ignore all of that confusing research and support what my gut says.”
The EPI researchers point out many flaws in the current technologies of using student value-added achievement data, and therefore recommend against its use in high-stakes decisions (like teacher rewards or firing). The Brookings researchers agree that there are many flaws in value-added data, but ask the reasonable question “compared to what,” noting that other current methods of evaluating teachers are not very good either.
So it is more an issue of interpretation than what the facts are.
Perhaps the most interesting thing the Brookings researchers do is look at one problem with value added data, the low correlation for a given teacher across years (the same teacher compared in year 1, 2 and 3), with parallel correlations in other domains, like baseball batting averages, insurance salesperson rankings, mutual fund performance rankings, etc.
The relatively low correlation of year-to-year teacher’s rankings (0.3-0.4 range) has been cited by critics as a reason not to use value-added data, since it appears to have too much measurement “noise” to be as accurate as we would like (that is, in a disturbing large number of cases, the same teacher is ranked as “good” one year, and then “bad” the next year.)
The Brookings researchers suggest that the “noise” is not greater than what we see in these other domains and that this is an argument to use value-added for high-stakes decisions, even with its flaws. In some sense, this is argument by analogy, the accuracy of which we should examine.
First we have to think about whether we believe a teacher varies greatly in her performance over time. Does a teacher have a “true type” (good, bad, average) and our challenge is to sample and measure that (pretty consistent) “type” correctly, or does the teacher actually vary greatly in her ability over time? (see my earlier blog on this topic.)
If we think the teacher type is relatively “fixed” than the low correlation is a big problem, and it represents “noise” and our inability to sample well enough to find the “true type.” Personally, that perspective makes more sense than a belief that teacher quality varies greatly from year to year.
Second, let’s examine the analogies more carefully. Hitting a baseball thrown 60 feet at 90 miles per hour is a notoriously difficult and fickle skill – concentration and mental state do seem to matter a great deal – perhaps also so does the quality of pitching, non-random choices about which hitter to put up against which pitcher (lefty against righty, fastball versus curveball), etc.
It makes sense to me that top hitters one year might be less effective the next year, when they might be facing a divorce, or a contract year, or whatever. Also, baseball players are well-known to be highly compensated, perhaps partly for the risky elements of their performance over time.
This seems to me a weak analogy to a teacher who has six hours and 180 days a year in a more self-controlled environment to perform “good, bad or average.”
Second, let’s look at mutual fund performance. Here, from living 15 years in NYC, where the financial markets are eaten for breakfast, I am quite confident that the top investors in boom times (buy more tech stocks in 1997!) are also likely to be among the worst investors in downturns (buy more tech stocks in 2000!) or slow growth periods (when the top performers are cautious, diverse portfolio investors).
Academic studies of financial markets strongly suggest “random walks” and very little likelihood that we should expect high correlations across years in top performers, making this, again a poor analogy.
Third, insurance sales seem similar to financial investment to me – perhaps during good economic times a particular type of salesperson is more easily able to get families to spend money on insurance – it might require a different set of skills to be a high sales performer in a recession. Thus, the low correlation is in fact caused by factors outside the salespersons’ control (as with the financial markets, and probably partly in baseball too).
Their best analogy might seem to be the use of ACT/SAT scores for college admission despite a relatively low correlation with student GPAs (and the fact that no other measurable admissions factor has a higher correlation). While I first found this argument more compelling, it is severely flawed by the fact that they are now correlating two different things – actual student course achievement and a specific test – not the same thing over multiple years – a statistician would expect more “noise” in correlating two different things.
So it seems that the Brookings researchers can’t have this both ways. Either you believe teacher type is fairly fixed, and then the low correlation is really a problem with our measurement technologies and a true problem with using the data.
Or you believe that teachers’ quality varies as much as professionals in these other fields, where it seems quite clear that much of this variation is a function of the external environment. Many teachers have argued that their lack of control over their environment is a major barrier to how much they can move students achievement – the non-random assignment of teachers to groups of students, the variation in students from year to year, variations in other supports in the school, changing curricula, etc.
So where does this leave us in a discussion that is not just theoretical? After all, Colorado’s State Council for Educator Effectiveness is working, even as we blog, on implementing the SB 191 requirement that 50 percent of a teacher’s high-stakes evaluation be based upon student achievement. Researchers agree that value-added has problems – they disagree about how severe those problems are, and whether focusing too much on these flaws, and not on problems with other forms of evaluation, makes the “perfect the enemy of the good.”
My belief is that there are many current problems with using value-added – some are fixable, with more and better tests, with better administration of tests (to avoid outright cheating and overly teaching to the tests), and with a clear understanding of the challenges created.
But other problems are not easily fixable, and perhaps we don’t want to fix them because they have value within schools – the non-random assignment of students to teachers within a school, the group nature of teaching, especially in higher grades, the difficulty assessing performance improvement in the arts, physical education, and other domains.
These issues suggest using value-added data very carefully, and as only one component in a meaningful, high-stakes teacher evaluation system.
Popularity: 13% [?]