You are viewing the EdNews Blog archives.
These archives contain blog posts from before June 7, 2011
Click here to view the new Voices section of EdNews

The shortcomings of value-added modeling

Posted by Aug 31st, 2010.

Hello EdNews readers. I’ll be checking in occasionally with blog entries here focused on new and worthwhile research. For this first blog, I want to point you to a research brief published on Sunday by the Economic Policy Institute.

The piece, with the dry but informative title of “Problems with the Use of Student Test Scores to Evaluate Teachers,” is authored by an extremely impressive collection of accomplished researchers. If you read nothing else about education this week, please read the three-page executive summary (then continue on and read the rest!).

Before discussing this research brief, I want to re-introduce myself. I teach school policy and law at the CU Boulder School of Education, where I also direct the Education and the Public Interest Center. In my blog entries here, I will try to point readers to useful resources on the EPIC website in addition to resources – like the new Economic Policy Institute brief – from other places.

The main point of the EPI research brief is straightforward: while value-added modeling (VAM) is a technical advancement that highlights student growth, the numbers generated are nevertheless too inaccurate to be used as a primary factor in making high-stakes decisions about teachers. That is, if someone tells you that a teacher is good or bad based on a VAM calculation, you are wise to take the judgment with a sizeable grain of salt. This is the same warning that I — with far less impressive credentials — issued a couple years ago, as did the National Academy of Sciences earlier this year.

The full EPI research brief does a great job explaining how and why high-stakes VAM policies cannot be supported by VAM itself. But there’s one quote and one illustration/study that I want to pull out of the brief, to hopefully entice you to read the entire thing.

First the quote:

“There is simply no shortcut to the identification and removal of ineffective teachers” (p. 20).

As I write these blog entries throughout the year, I could probably begin each one with, “There is simply no shortcut to…” In part, this reflects the complex nature of schooling, but it also reflects the sad state of policymaking, where politicians and others are so easily enticed by the quick fix.

The replacement of ineffective teachers with effective ones is unquestionably a worthwhile policy goal, but it’s much easier said than done. A policy intended to accomplish this goal would have to reliably (a) identify the ineffective teachers (without wrongly targeting the effective ones), and (b) identify and recruit effective replacement teachers. Also, the policy should accomplish this in a more cost-effective way than alternative possibilities (but given the problems with the first part of this puzzle, we’re not yet at the point where we should worry about such comparisons).

Now for the illustration. I’ll quote from page 2 of the executive summary:

One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time ….

This is scary stuff, but only if used unwisely – only if policy makers give too much credence to the scores. VAM approaches do tell us something; a teacher (or school) whose VAM scores are consistently at the extreme high end of the distribution are very likely of higher quality than those consistently at the extreme low end. So here’s my alternative proposal: use VAM approaches as a first-stage, cost-effective tool that will help inform a more in-depth, second-stage quality analysis. A teacher or school at the bottom (e.g., the bottom 5 percent) in a state or district should be identified for classroom observations, principal evaluation, and other hands-on information-gathering that can lead to a determination of professional development needs or removal/turnover. Similarly, a teacher or school at the top might be identified for further study that might help us learn from successes. This approach has three major advantages:

  1. The ultimate evaluations of teachers and schools will not be made based on the test scores; they will be more thorough and reliable.
  2. The use of VAM here is supportable, since it is not being used to make fine-grained distinctions among teachers, but it does serve as a good tool that allows for a cost-effective use of hands-on evaluation tools.
  3. A teacher will not feel extreme pressure to teacher to the test, since his or her career would ultimately be determined by hands-on observations and other information, not by students’ test scores.

Sadly, the approaches being considered and implemented in Colorado and elsewhere rely far too much on test scores and VAM approaches. We are rushing toward a system of teacher evaluation that is sure to wrongly identify teachers as good or bad, and it will likely be years before policymakers realize and correct the mistake.

Popularity: 12% [?]

9 Responses to “The shortcomings of value-added modeling”

  1. And factor into the equation the discrepancies found within the special education population with testing results. These test results have been skewed for years because of many factors, that I won’t list here. I still advocate, whether in special ed or regular ed, you need to teach based on the student’s individual learning style and unique needs – relevant today among all students. Now school districts need to spend the money to train the teachers to meet these student needs. Not being done and so both students and teachers suffer.

    • Kevin Welner says:

      Yes, and there are similar problems with the results of tests administered to English Learners. If one administers a math test in English to a students with limited English skills, will the resulting low score tell you that the student is poor in math or poor in her skills regarding written academic English? How well would you do on a math test written in Khmer, for instance? What if you had a couple years of instruction in Khmer — would you be ready then?

  2. Mark Sass says:

    Isn’t the research that is being presented to show that VAM does not work based on ‘old” data? Let’s say that the new evalaution system (based on SB 191) does improve student data; that because principals are now evaluated on student growth they begin to rush resources to the neediest teachers; that teachers use the VAM to collaborate and improve their practices; that the theory behind SB 191 works (I know that this is a huge leap of faith for some). Wouldn’t this impact the results that we currently see using VAM? I am probably mangling some statistics concept that you can better articulate, but aren’t we assessing results based on old causes?

    I thought about this when I read the quote from the EPIC summary referring to the “effective” teachers who saw inconsistent results. If the students entering the effective teachers classroom year after year consistently performed at high levels (again due to an implemented SB 191 style teacher evalution) wouldn’t this change the teacher’s results?

    I worry that the value added componenet of SB 191 is over shadowing the entire approach that the bill expressed.

    • Kevin Welner says:

      What I think you’re suggesting, Mark, is a theory of action concerning the ideal causal chain following from SB 191. Let me put that aside for a moment though and first address your question about the research.

      The EPI study and the other research mentioned are not directly about SB 191. (That research is largely also not about the potential motivational effects that you consider in your note.) Instead, it’s about the strengths and weaknesses of the technology (of the statistical modeling approach). VAM is a technological improvement — success depends on whether students’ test scores increase, as opposed to just requiring scores above an adequate yearly progress threshold. That’s the VAM appeal. But the problem is that although there’s a clear improvement in this regard, many problems remain. Think about how to account for student mobility, how to account for ongoing out-of-school effects on student scores, how to account for the effects of other teachers on students’ scores in a given subject (e.g., the quality of learning in a social studies class could easily affect the student’s language arts test scores), how to account for weaknesses in the tests themselves (tests are an imperfect measure of the content/skills tested and an awful measure of the content/skills not tested). At the end of the day, differences between one teacher’s VAM scores and another teacher’s VAM scores just don’t tell us a lot. And as the study mentioned earlier points out, a given teacher’s VAM scores one year are likely to be very different from her scores the next year.

      That’s essentially what this research focuses on.

      As for the likely effects, the switch from a proficiency-threshold system to a growth model does not address core concerns about test-based accountability, such as narrowed curriculum, teaching to the test, and (as noted above) measurement error in the tests themselves. What SB 191 tries to do is to include multiple indicators — measures of teacher/principal quality beyond a VAM score. But my understanding is that SB 191 calculates the VAM scores as 50% of those quality determinations, right? If so, that seems far too high given the limitations of the technology and the likely unintended consequences of the incentives. I included at the end of my post what I think was a reasonable way to use the VAM scores, but of course that horse left the barn last legislative session, and SB 191 is the law we all have to make work as best we can unless it’s changed or repealed.

  3. jeff says:

    VAM overshadows the entire approach of the bill because the law mandates that it becomes a huge part of a teacher’s evaluation. I get what you see in this new law but IMHO it’s so ham handed that I have a hard time feeling positive about the potential benefits. I wish the Governor’s Council all the best in their undertaking to make this “comprehensive.” The proof of the pudding is in the eating.

    That said, I had serious doubts about the School Innovation Act too and here I am a founding partner of a teacher led Innovation School. We already waived the tenure law and replaced it with a professional practice model of accountability and school governance. If you can look at something through a new lens (distributed leadership in this case), the game changes and new vistas open.

    I have not found a lens on 191 that gives me anything but angst. I really do not believe this is the way to go about R&D.

  4. Jason Glass says:

    Dear Kevin (and commentators),

    This EPI study simply reinforces what practically all value-added researchers have been saying for decades: That this is ONE tool in determining educator effectiveness that should be used IN COMBINATION with other measures. Further, it becomes more stable and increasingly accurate with more data.

    It is also important to note that this EPI study is not the culmination of, nor the final word on, value-added analysis. William Sanders (the inventor of the method) counters that it is possible using value-added analysis to make stable, valid, and reliable inferences about educator effectiveness using this method. In thinking about “a good way to go about R & D,” we should hold off on the rush to conclusion on the result of one study in nearly 20+ years of work.

    Contrary to some posts here, SB 191 does not require the value-added model be used for the 50% student achievement component. However, as it is the most powerful way of analyzing assessment data the human race has developed, it should most definitely be considered as one important and viable option for satisfying that component in the law for some teachers in some grades.

    SB191 resolves (legislatively) the question of “if” we are going to define and measure educator effectiveness, and it resolves the question of “if” we are going to do anything with the information.

    Now the questions move to “how” this should be done. Value-added analysis is ONE important part of that conversation.

  5. jeff says:

    Jason is correct. The law does not specify 50% VAM. What it does say is:

    ONE OF THE STANDARDS FOR
    MEASURING TEACHER EFFECTIVENESS SHALL BE DIRECTLY RELATED TO
    CLASSROOM INSTRUCTION AND SHALL REQUIRE THAT AT LEAST FIFTY
    PERCENT OF THE EVALUATION IS DETERMINED BY THE ACADEMIC GROWTH
    OF THE TEACHER’S STUDENTS. (caps in original, I’m not shouting)

    So, if VAM is “the most powerful way of analyzing assessment data the human race has developed”, presumably all other ways of looking at these data have more serious problems in terms of making high stakes decisions.

    So we can use a value added model or we can use something we already know works worse. I think most people can connect those dots and I would assume that’s why commenters have made the jump to 50% of a teacher’s evaluation coming from VAM.

    • Jason Glass says:

      I think it is the case that all the other models have more serious methodological issues than VAM does. I also think that it is the case that the debate over how serious the methodological issues with VAM really are depend on the political angle of the person making the argument, the quality of the assessment data being fed into the analysis, and how decisions are being made as a result of the information.

      Value-added is certainly a superior method of looking at data than what schools typically use, which is a pure attainment-based approach (like 43% of kids are proficient or advanced). In my humble opinion, the attainment-based way of looking at student data has done more to demonize and damage public schools than any other single effort. These approaches fail to acknowledge that many schools get tremendous growth out of kids.

      If there is a better way of looking at student assessment data than VAM, I’m open to hearing it and will happily stand corrected if someone can provide a superior approach.

      Jeff I think you are correct that VAM will probably end up being a big part of the discussion on 191. However, it can’t be all of the discussion, as the model only works for teachers who teach in assessed areas. That’s roughly only 30% of teachers.

      The bigger, more vexing, and more interesting question (at least in my opinion) is how we will innovate and learn to handle all the untested subjects and grades. The methodological questions around value-added are small beans compared to that.

  6. jeff says:

    Jason, we certainly agree about the effects of an “attainment-based approach” to accountability.

    You also bring up a great point in terms of what percent of “teachers” are actually in tested subjects. One of the elements of ProComp, the performance pay system in DPS, uses CSAP growth and it applies to teachers of math and language arts in grades 4 through 10. Third grade is tested but there are no baseline data so growth cannot be measured. Some of the new testing the district is implementing will cover more than just these folks but will not solve the whole problem.

    Another gap that I have not heard come up so far is that in DPS, “teacher” means not only classroom teachers but also nurses, psychologists, social workers, and all the other student services professionals. They are all part of the DCTA bargaining unit and covered by the same contract as classroom teachers. We did develop evaluations that focus on these particular jobs (instead of trying to use a standard teacher evaluation).

    I assume the new law applies equally to these employees so I wonder what measures of student growth will make up at least 50% of their evaluations.

Leave a Reply

Colorado Health Foundation Walton Family Foundation Daniels fund Pitton Foundations Donnell-Kay Foundation