You are viewing the EdNews Blog archives.
These archives contain blog posts from before June 7, 2011
Click here to view the new First Person section of Chalkbeat Colorado

Proceed with caution on value-added

Posted by Nov 29th, 2010.

For fairness in blogging, I feel compelled to write about the relatively new Brookings report “Evaluating Teachers: The Important Role of Value-Added” that takes a different perspective on using (value-added) student achievement data than the EPI report (“Problems with the use of student test scores to evaluate teachers”) released a few months ago.

Both studies were co-authored by a large group of distinguished researchers, so there is lots of food for thought in this debate. But let’s be clear that this is not just another case of “top researchers disagree on facts, so I can just ignore all of that confusing research and support what my gut says.”

The EPI researchers point out many flaws in the current technologies of using student value-added achievement data, and therefore recommend against its use in high-stakes decisions (like teacher rewards or firing).  The Brookings researchers agree that there are many flaws in value-added data, but ask the reasonable question “compared to what,” noting that other current methods of evaluating teachers are not very good either.

So it is more an issue of interpretation than what the facts are.

Perhaps the most interesting thing the Brookings researchers do is look at one problem with value added data, the low correlation for a given teacher across years (the same teacher compared in year 1, 2 and 3), with parallel correlations in other domains, like baseball batting averages, insurance salesperson rankings, mutual fund performance rankings, etc.

The relatively low correlation of year-to-year teacher’s rankings (0.3-0.4 range) has been cited by critics as a reason not to use value-added data, since it appears to have too much measurement “noise” to be as accurate as we would like (that is, in a disturbing large number of cases, the same teacher is ranked as “good” one year, and then “bad” the next year.)

The Brookings researchers suggest that the “noise” is not greater than what we see in these other domains and that this is an argument to use value-added for high-stakes decisions, even with its flaws. In some sense, this is argument by analogy, the accuracy of which we should examine.

First we have to think about whether we believe a teacher varies greatly in her performance over time.  Does a teacher have a “true type” (good, bad, average) and our challenge is to sample and measure that (pretty consistent) “type” correctly, or does the teacher actually vary greatly in her ability over time?  (see my earlier blog on this topic.)

If we think the teacher type is relatively “fixed” than the low correlation is a big problem, and it represents “noise” and our inability to sample well enough to find the “true type.”  Personally, that perspective makes more sense than a belief that teacher quality varies greatly from year to year.

Second, let’s examine the analogies more carefully.  Hitting a baseball thrown 60 feet at 90 miles per hour is a notoriously difficult and fickle skill – concentration and mental state do seem to matter a great deal – perhaps also so does the quality of pitching, non-random choices about which hitter to put up against which pitcher (lefty against righty, fastball versus curveball), etc.

It makes sense to me that top hitters one year might be less effective the next year, when they might be facing a divorce, or a contract year, or whatever.  Also, baseball players are well-known to be highly compensated, perhaps partly for the risky elements of their performance over time.

This seems to me a weak analogy to a teacher who has six hours and 180 days a year in a more self-controlled environment to perform “good, bad or average.”

Second, let’s look at mutual fund performance.  Here, from living 15 years in NYC, where the financial markets are eaten for breakfast, I am quite confident that the top investors in boom times (buy more tech stocks in 1997!) are also likely to be among the worst investors in downturns (buy more tech stocks in 2000!) or slow growth periods (when the top performers are cautious, diverse portfolio investors).

Academic studies of financial markets strongly suggest “random walks” and very little likelihood that we should expect high correlations across years in top performers, making this, again a poor analogy.

Third, insurance sales seem similar to financial investment to me – perhaps during good economic times a particular type of salesperson is more easily able to get families to spend money on insurance – it might require a different set of skills to be a high sales performer  in a recession.   Thus, the low correlation is in fact caused by factors outside the salespersons’ control (as with the financial markets, and probably partly in baseball too).

Their best analogy might seem to be the use of ACT/SAT scores for college admission despite a relatively low correlation with student GPAs (and the fact that no other measurable admissions factor has a higher correlation).  While I first found this argument  more compelling, it is severely flawed by the fact that they are now correlating two different things – actual student course achievement and a specific test – not the same thing over multiple years – a statistician would expect more “noise” in correlating two different things.

So it seems that the Brookings researchers can’t have this both ways.  Either you believe teacher type is fairly fixed, and then the low correlation is really a problem with our measurement technologies and a true problem with using the data.

Or you believe that teachers’ quality varies as much as professionals in these other fields, where it seems quite clear that much of this variation is a function of the external environment.   Many teachers have argued that their lack of control over their environment is a major barrier to how much they can move students achievement – the non-random assignment of teachers to groups of students, the variation in students from year to year, variations in other supports in the school, changing curricula, etc.

So where does this leave us in a discussion that is not just theoretical? After all, Colorado’s State Council for Educator Effectiveness is working, even as we blog, on implementing the SB 191 requirement that 50 percent of a teacher’s high-stakes evaluation be based upon student achievement.   Researchers agree that value-added has problems – they disagree about how severe those problems are, and whether focusing too much on these flaws, and not on problems with other forms of evaluation, makes the “perfect the enemy of the good.”

My belief is that there are many current problems with using value-added – some are fixable, with more and better tests, with better administration of tests (to avoid outright cheating and overly teaching to the tests), and with a clear understanding of the challenges created.

But other problems are not easily fixable, and perhaps we don’t want to fix them because they have value within schools – the non-random assignment of students to teachers within a school, the group nature of teaching, especially in higher grades, the difficulty assessing performance improvement in the arts, physical education, and other domains.

These issues suggest using value-added data very carefully, and as only one component in a meaningful, high-stakes teacher evaluation system.

Popularity: 8% [?]

3 Responses to “Proceed with caution on value-added”

  1. Alexander Ooms says:

    I’d be interested in comparisons to professions that are not hyper-competitive (like professional sports and mutual fund managers). To play in the MLB, you have already excelled playing baseball at a number of levels. For the roughly 750 players in the major leagues, you are probably in the top 1% of all US adults for those skills. There are larger variances within that 1%, but they are small differences overall. Mutual fund managers somewhat less so, but it’s still a highly-difficult profession with significant barriers to entry.

    In contrast, what’s remarkable about teaching is the extreme variation – the standard deviation in quality (I assume) is far greater — and no surprise, as there are roughly 3 million K-12 public school teachers with comparatively lower bars to entry than other professions.

    I also think we need to be rational about our ability to differentiate somewhere. I don’t know that you can isolate a significant difference between two guys on the rockies batting 250 and 260, but if you were to throw the MLB, AAA, AA and all other professional and semi-pro leagues together, you’d sure see a difference between the best and the worst. Let’s see if value-added measurements can help discriminate between the top and bottom quintile, and worry less about separating the 49-50% from the 50-51%.

  2. paul teske says:

    Alex, this is a very good point – that highly skilled, highly paid baseball players and financial managers are drawn from the top echelon of people in those arenas, whereas having 3-4 million teachers means that they are drawn from across the whole spectrum of ability. And, yes, common sense would suggest that the year-to-year variation might be larger among those highly skilled, specialized talents.

    But, that makes the point stronger here – data shows that teachers move around a lot, from high quintiles to low quintiles, even when we are focused on a very large group. They move around just as much as the highly skilled folks, as measured by year-to-year correlations. That means there are serious problems with the “noise” in measurement, unless you believe that a teacher’s abilities truly vary enormously from year to year. I find that far less plausible than the fact that we presently have few and poor tests to make the assessments we want to make.

    • Alexander Ooms says:

      I think I’d disagree with this latter point somewhat as I would assume that the variation in a smaller, more selective group (MLB), while similar on a relative scale, is different on an absolute scale. But I also think we are talking differences in our respective opinions of degrees, not in kind.

      Paul and others are right to be skeptical, but to me, the far more difficult question is this: what error rate (or noise) are we prepared to accept when evaluating teachers (and I wrote on this here: The easiest way to make sure we never evaluate a teacher incorrectly is to never evaluate teachers (which is not that far from the current system). Are we prepared to have a system in which the error rate is 10%? 25%?

      All diagnostic systems have error rates, in medicine, law and other parts of daily life. We can and should work to eliminate them, but we also need to acknowledge that an error rate will exist, and we need more discussion and consensus on what is acceptable.

Leave a Reply

Colorado Health Foundation Walton Family Foundation Daniels fund Pitton Foundations Donnell-Kay Foundation