A testing industry whistleblower speaks

Posted by Jan 6th, 2011.

Editor’s note: Holly Yettick is a doctoral student in the Educational Foundations, Policy and Practice program at the School of Education at the University of Colorado in Boulder.

I am not a test basher. Although I have many, many concerns about standardized testing and the way in which it is interpreted and used in our country today, you will not hear me say that it is the scourge of the earth or that test results are entirely without meaning or value.

Part of the reason is that I learned a lot about the contemplation and care that can go into the development and scoring of tests during the two summers I interned at the Educational Testing Service, whose employees include some of the smartest and most thoughtful people I have ever met. Another part is that even those who believe that standardized exams are discriminatory would probably have to admit that human judgment is even more so.

A standardized exam does not know that Ava’s father is in jail and her mother is on the street and thus treat her, if unintentionally, like she is already a lost cause. A standardized exam does not cut Jacob slack because he is the school’s star quarterback. At least with a standardized exam, whatever discrimination is going on is, well, standard and, as such, can be pinned down, studied and, hopefully, corrected.

Well, maybe. A book that I read over break made me question the very foundations of that idea. Frankly, I found it shocking and I am surprised that it has not gotten more attention—as in I think the author is someone who should be on Oprah.

The book is Making the grades: My misadventures in the standardized testing industry. The author is Todd Farley. During his 15 years as a test scorer, supervisor and develop,  Farley worked on important tests (National Assessment of  Educational Progress or NAEP, high stakes exit exams for states). He worked for major companies Pearson, ETS—the only one he describes sympathetically—he says he’d trust the education of his new son to them.

Maybe his book has failed to attract much attention because its small publisher, PoliPoint Press is so obviously partisan, with other titles including Why I’m a Democrat and The Eliminationists: How hate talk radicalized the American right. But this press is in many ways a poor fit: Farley himself is not overtly political and neither is his book.

He got into testing because he was an aspiring writer and self-described slacker who needed some way to support himself. He did it for the money—he wasn’t out to make any sort of ideological point. His specialty was scoring and supervising the scoring of essay questions and word problems that could not be scored by machines. These types of questions have grown increasingly popular in recent years. NAEP is full of them as is CSAP and just about any other state exam.  Here are just a few of Farley’s revelations:

  • When the numbers didn’t add up, scoring supervisors routinely made them up. All the time. On every project: When dozens of people are reading responses from thousands of children, it is important that there is a certain level of agreement among the scores. This agreement is measured by reliability statistics. The level of reliability requested by the client or promised by the company was virtually never attainable—not even close. So supervisors simply made up numbers. Even quality control measured instituted to ensure this did not occur could be easily thwarted: Farley and his co-workers would simply pull Mr. Smith’s scoring decision from the system if it was too different from the score given by Mrs. Jones. Numbers were also routinely changed so that reliability statistics from, say, 2004, would not differ significantly to those from 2005.

Given the economic downtown, a temporary job that pays $10 an hour to college graduates will probably attract some excellent candidates. But Farley worked in the industry during both economic booms and busts and, even during the busts, he despaired at the idea that his motley crews were making important decisions about people’s fates. These employees included non-native speakers who had trouble reading English, retirees who were battling senility and misfits and zealots of all types who would give the same score to every question while they read a book under the desk or refuse the follow the scoring guidelines or fail to understand the guidelines in the first place.

So why not fire them, then? They’re not, after all, unionized. This rarely happened because scoring happens on a tight deadline: If you took the time to recruit the train yet another group of scorers, that deadline would be missed. Of course, after a training period, scorers are supposed to be “qualified” by passing a test that demonstrates that their scores are in line with the scoring guidelines. In Farley’s experience, upwards of 50 percent of recruited scorers would be fired after flunking this test. Yet deadlines had to be met. The flunkies were routinely hired back the day after being shown the door.

4 points: The response is clear, focused, and developed for the purpose specified in the prompt.

3 points: The response is clear and focused.

2 points: The response does not maintain focus or organization throughout.

1 point: The response does not maintain focus or organization throughout.

This is, of course, just an excerpt from a much more detailed explanation. But imagine trying to get agreement from several dozen individuals on interpreting “focused” or “clear” or “organization”. Often, as scoring progressed, scoring rules progressed as well. (Should a child get credit for responding to a question about a favorite food that rocks was her favorite food? What about dirt? Water?)  From one year to the next, the committees of teachers convened annually to review scoring would demand to know who, exactly, had made up these crazy rubrics only to be told, “The rubric was written by last year’s range-finding committee of your state’s teachers in this very hotel.”

Interestingly, the testing industry has not responded to Farley’s claims. But I believe that taxpayers should. I’m not sure what we should do. Given what Farley has exposed, should for-profit companies be scoring tests? Should teachers be paid to score essay exams, perhaps from schools other than their own? Should all essay exams be machine-scored, especially as technology develops? Should we even attempt to assess large groups of students with such an ambiguous format?  What do you think?

3 Responses to “A testing industry whistleblower speaks”

  1. I have been a certified Chief Examiner and Coordinator of Testing for Mesa State College in Grand Junction, Colorado for over 20 years. I have seen pretty much everything from one aspect to another. I truly believe that with the technology we have today, ALL essays should be holistically scored in order to leave the human aspect out of the equation.

  2. Mark Newton, MJE says:

    Is anyone really surprised at this? What’s even more disheartening, however, is the lack of response from those who legislate such exams. Is there any politician or educrat who will truthfully answer these facts? Didn’t think so.

  3. Ed Augden says:

    To add to Mr. Newton’s remarks, it’s even more disheartening that some of the self-styled “reformers” have not responded. Or, will many of them simply dismiss these revelations as propaganda from liberals? When I began teaching in 1968, those critical of public education were mostly southern Democrats protesting integration and conservative Republicans claiming that public schools have failed because of teachers’ unions and should be replaced by vouchers. The rhetoric is basically the same in 2011. Only now, the cry that schools are failing and should be replaced by charter schools and high stakes testing is being led by such as U.S. Sec. of Education, Arne Duncan. From my perspective, I see not reform only reinforcing the authoritarian status quo and Duncan, supported by Obama, leading the charge for high stakes testing and more authoritarianism in public education. Many public school teachers I spoken to see very little difference between Democrats and Republicans. Is there? More important, why do the “reformers” rely primarily on personal observations and anecdotal information and ignore attributed research of poverty’s detrimental effect on student achievement, especially the poor?

