You are viewing the EdNews Blog archives.
These archives contain blog posts from before June 7, 2011
Click here to view the new First Person section of Chalkbeat Colorado

Archive for the ‘Testing’ Category

Why we opt out of CSAP

Monday, March 28th, 2011

Angela Engel is the author of the book, Seeds of Tomorrow; Solutions for Improving our Children’s Education and the director of Uniting4Kids a new national non-profit promoting quality neighborhood schools through parent, teacher and student leadership.

CSAP won’t help your kids get jobs. According to a recent report, “Are They Really Ready To Work? Employers’ Perspectives on the Basic Knowledge and Applied Skills of New Entrants to the 21st Century U.S. Workforce,” employers identified professionalism, teamwork, oral communication, ethics and social responsibility as the most important skills.

These are the very skills NOT being measured by CSAP.

CSAP won’t help your kids succeed in college. Recent reports say 50 percent of college students require remediation. In college, they don’t supply questions and force-feed answers. You’re expected to think critically, demonstrate inquiry, and solve real problems.

CSAP won’t help your kids read and write better. Shaded bubbles don’t teach kids, people do. Children emerge as lifelong learners when reading is for reading’s sake and writing has value to the writer, not only to the data collector or the report drafter.

CSAP won’t help your kids grow into good mothers and fathers or husbands and wives. Life’s most important jobs require confidence, cooperation, and compassion. High-stakes tests produce stress and anxiety and promote competition. Children today are reporting stomach aches, headaches, and fatigue in growing numbers. The percentage of children being medicated is increasing at alarming rates.

CSAP won’t help your kids be good citizens. Realizing the promise of democracy requires empowered voters willing and able to challenge fraud and corruption. Standardization is the antithesis of American values. Governments attempts to micro-manage education is a violation of the fundamental and inalienable rights of the individual. The United States was founded on the principle that government is responsible to, and derives its power from, its citizens, not the other way around.

CSAP cannot be trusted to indicate anything – especially student achievement, teacher effectiveness or school quality.

CSAP won’t help your kids grow into good mothers and fathers or husbands and wives.

Answers not graded by a computer are graded by temporary workers with no education experience and little training. They examine hundreds of student responses during a single shift. Their judgments of your children’s answers are not scientific, but subjective.

CSAP won’t improve your child’s education. Findings show that increasing standardized testing does not improve student achievement. While the costs are never calculated, class sizes are increasing, curriculum is narrowing, and breadth of content has replaced depth of learning. Many districts have lost music and art. Schools have less money for field trips and after-school activities. Practice booklets have replaced books, computers, and lab equipment. Colorado has spent more than $50 million a year developing, administering, grading, summarizing, and reporting all those numbers with no return.

CSAP doesn’t create a positive and respectful school climate.

Income has the highest correlation to CSAP scores. Researchers can predict schools’ summative test scores with 83% accuracy based on the number of children on free and reduced lunch. It is particularly damaging to children with disabilities, English second language learners, and children from low-income families. CSAP punishes teachers, pressures students, and pits administrators against parents.

CSAP scores don’t get schools more money. Funding is based on pupil enrollment. According to the Colorado Department of Education, “The Education Accountability Act of 2009 (SB 09-163) repealed previous SAR law. Negative weights for Unsatisfactory and No Score percentages are not in effect anymore.” In looking at this year’s budget cuts perhaps it’s the tests we should consider cutting.

I’m a mom. I used to be a teacher. I really care about kids. I know, as you know, that these measurement tools are counter to who we are as learners and everything we value about human development, engagement, and equitable opportunity.

This year marked the sixth year we sent a refusal note the day before CSAP testing begins. Administrators have always acknowledged our parental authority and respected our family’s decision. Non-conformity and challenging the status quo makes many people uncomfortable.

I simply explain that CSAP is like gambling – it’s designed to make you lose. The wealthier you are, the more you can afford to risk. Dr. Martin Luther King said, “There comes a time when silence is betrayal.” In the game of CSAP and high-stakes testing, our children are losing. Refusing to let them take the test won’t make them win; it will just change the game.

Popularity: 34% [?]

A testing industry whistleblower speaks

Thursday, January 6th, 2011

Editor’s note: Holly Yettick is a doctoral student in the Educational Foundations, Policy and Practice program at the School of Education at the University of Colorado in Boulder.

I am not a test basher. Although I have many, many concerns about standardized testing and the way in which it is interpreted and used in our country today, you will not hear me say that it is the scourge of the earth or that test results are entirely without meaning or value.

Part of the reason is that I learned a lot about the contemplation and care that can go into the development and scoring of tests during the two summers I interned at the Educational Testing Service, whose employees include some of the smartest and most thoughtful people I have ever met. Another part is that even those who believe that standardized exams are discriminatory would probably have to admit that human judgment is even more so.

A standardized exam does not know that Ava’s father is in jail and her mother is on the street and thus treat her, if unintentionally, like she is already a lost cause. A standardized exam does not cut Jacob slack because he is the school’s star quarterback. At least with a standardized exam, whatever discrimination is going on is, well, standard and, as such, can be pinned down, studied and, hopefully, corrected.

Well, maybe. A book that I read over break made me question the very foundations of that idea. Frankly, I found it shocking and I am surprised that it has not gotten more attention—as in I think the author is someone who should be on Oprah.

The book is Making the grades: My misadventures in the standardized testing industry. The author is Todd Farley. During his 15 years as a test scorer, supervisor and develop,  Farley worked on important tests (National Assessment of  Educational Progress or NAEP, high stakes exit exams for states). He worked for major companies Pearson, ETS—the only one he describes sympathetically—he says he’d trust the education of his new son to them.

Maybe his book has failed to attract much attention because its small publisher, PoliPoint Press is so obviously partisan, with other titles including Why I’m a Democrat and The Eliminationists: How hate talk radicalized the American right. But this press is in many ways a poor fit: Farley himself is not overtly political and neither is his book.

He got into testing because he was an aspiring writer and self-described slacker who needed some way to support himself. He did it for the money—he wasn’t out to make any sort of ideological point. His specialty was scoring and supervising the scoring of essay questions and word problems that could not be scored by machines. These types of questions have grown increasingly popular in recent years. NAEP is full of them as is CSAP and just about any other state exam.  Here are just a few of Farley’s revelations:

  • When the numbers didn’t add up, scoring supervisors routinely made them up. All the time. On every project: When dozens of people are reading responses from thousands of children, it is important that there is a certain level of agreement among the scores. This agreement is measured by reliability statistics. The level of reliability requested by the client or promised by the company was virtually never attainable—not even close. So supervisors simply made up numbers. Even quality control measured instituted to ensure this did not occur could be easily thwarted: Farley and his co-workers would simply pull Mr. Smith’s scoring decision from the system if it was too different from the score given by Mrs. Jones. Numbers were also routinely changed so that reliability statistics from, say, 2004, would not differ significantly to those from 2005.

Given the economic downtown, a temporary job that pays $10 an hour to college graduates will probably attract some excellent candidates. But Farley worked in the industry during both economic booms and busts and, even during the busts, he despaired at the idea that his motley crews were making important decisions about people’s fates. These employees included non-native speakers who had trouble reading English, retirees who were battling senility and misfits and zealots of all types who would give the same score to every question while they read a book under the desk or refuse the follow the scoring guidelines or fail to understand the guidelines in the first place.

So why not fire them, then? They’re not, after all, unionized. This rarely happened because scoring happens on a tight deadline: If you took the time to recruit the train yet another group of scorers, that deadline would be missed. Of course, after a training period, scorers are supposed to be “qualified” by passing a test that demonstrates that their scores are in line with the scoring guidelines. In Farley’s experience, upwards of 50 percent of recruited scorers would be fired after flunking this test. Yet deadlines had to be met. The flunkies were routinely hired back the day after being shown the door.

4 points: The response is clear, focused, and developed for the purpose specified in the prompt.

3 points: The response is clear and focused.

2 points: The response does not maintain focus or organization throughout.

1 point: The response does not maintain focus or organization throughout.

This is, of course, just an excerpt from a much more detailed explanation. But imagine trying to get agreement from several dozen individuals on interpreting “focused” or “clear” or “organization”. Often, as scoring progressed, scoring rules progressed as well. (Should a child get credit for responding to a question about a favorite food that rocks was her favorite food? What about dirt? Water?)  From one year to the next, the committees of teachers convened annually to review scoring would demand to know who, exactly, had made up these crazy rubrics only to be told, “The rubric was written by last year’s range-finding committee of your state’s teachers in this very hotel.”

Interestingly, the testing industry has not responded to Farley’s claims. But I believe that taxpayers should. I’m not sure what we should do. Given what Farley has exposed, should for-profit companies be scoring tests? Should teachers be paid to score essay exams, perhaps from schools other than their own? Should all essay exams be machine-scored, especially as technology develops? Should we even attempt to assess large groups of students with such an ambiguous format?  What do you think?

Popularity: 15% [?]

Leaning tower of PISA

Tuesday, December 7th, 2010

As I write this, 2009 results of the Programme for International Student Assessment, an internationally standardized triennial test of 15-year-old students’ performance in reading, math and science have just been released.

Here is some language from a news release issued by several national education organizations:

“15-year-old students in the United States continue to rank at average or below average in international comparisons of reading, math, and science.

“The bottom line of the 2009 results is: The U.S. ranked 14 out of 65 countries in reading, virtually the same ranking as the 2003 test; 17th in science, which is an improvement from 21st in 2006; and 25th in mathematics, the same ranking as 2006. The good news is that U.S. students, especially those with the lowest performance, have significantly improved in science since 2006.”

Like so many education matters these days, your reaction to these numbers depends in large part on your educational ideology.

On one side, we will hear people proclaiming that the sky continues to fall. In a new analysis of 2006 PISA scores that appeared in the current issue of The Atlantic , Stanford economist Eric Hanushek said when it comes to international comparisons, the news for the U.S. is all bad:

Even…relatively privileged students do not compete favorably with average students in other well-off countries. On a percentage basis, New York state has fewer high performers among white kids than Poland has among kids overall. In Illinois, the percentage of kids with a college-educated parent who are highly skilled at math is lower than the percentage of such kids among all students in Iceland, France, Estonia, and Sweden.

In other words, Hanushek says, our education system doesn’t just fail low-income kids of color in urban districts. It fails everyone. Expect more handwringing of this sort later today.

On the other side, we will hear people who decry standardized testing as a soul-killing and one-dimensional measurement tool that tells us almost nothing useful. And international comparisons are especially pernicious, they say. In her Washington Post blog, Valerie Strauss quotes the late, lamented Gerald Bracey:

But really, does the fate of the nation rest on how (kids) bubble in answer sheets? I don’t think so. Neither does British economist, S. J. Prais. We look at the test scores and worry about the nation’s economic performance. Prais looks at the economic performance and worries about the validity of the test scores: “That the United States, the world’s top economic performing country, was found to have school attainments that are only middling casts fundamental doubts about the value and approach of these [international assessments].”

How politicized has this debate become in a politicized and polarized era? Here’s a brief sampling of comments that followed Hanushek’s article on The Atlantic website:

Reporters need to stop taking Hanushek’s (quiet, gentle) word for it and actually question his logic. They will find it to be quite ideological, and not bound by his data. There is a lot of sophisticated hand waving, but it masks an ideological agenda not based on the data. A simple understanding of what an effect size is, what portion of variance explained means, and basic economic (and psychological) research methods would help journalists be more skeptical of listening to “that guy you go to for What’s the other side of the story?” This sort of false equivalency is what makes many scientists hate the majority of science reporting, and social science reporting (which is what this is) is no exception.


It is also worth noting that STANDARDIZED TESTS have been showing these discrepancies for THREE DECADES, since we were proclaimed “A Nation at Risk” as a result, in the report of that title. And yet the generation raised in those schools has ACED “the real test,” further extending America’s lead in virtually every area of human endeavor. To me, this discrepancy between test results and REAL-WORLD results suggests something completely different–that we have not yet figured out how to measure WHAT REALLY MATTERS in an information-based society. Until we do, the results of standardized tests whose results do not correspond to real world performance must be taken with a LARGE grain of salt…

Then there is this:

Math education at the grammar school/middle school level in the US is mostly about time wasting. Teachers ignorant of mathematics teach kids that basic arithmetic is a concept rather than a process.

The reality is that any average 8 year old can master fractions and decimals and any average 10 year old can master algebra. US schools spend a decade teaching kids two years worth of material in math. Don’t even get me started on foreign languages. Nowadays many kids have had three years of foreign language by the time they enter high school, and they have absolutely no skills- they can’t speak, or write or read. One summer immersion program would be more effective and cheaper if the goal were for the student to become proficient. Clearly that is not the goal.

In general the US system is about punching the clock, wasting time, filling up the day. The European and Asian systems emphasize actual proficiency in key concepts and demand more work and more proficiency in their students in shorter periods of time. Americans- if you want your kid to learn math, sign them up for Kumon or other similar programs that emphasize continual practice and allow students to proceed at their own pace. Your kid will know that 2+2 is not a “concept”, it is “4″. They will perform well on standardised tests and be ready when the time comes for real concepts (i.e. mathematics rather than arithmetic).

Even the UK system (not one of the strongest in Europe), puts the US to shame. The average kid with good A levels (essentially a pre-University qualification) is better educated than the average US graduate of a four year state university.

It is not that US students are less capable. They are simply less educated. Standards are lower and true proficiency is not the object of the US educational system. The object is to get the kids through and keep teachers employed. Ironically, at the PhD and higher graduate levels in technical fields, US graduates start to shine. By then these students have made up for their poor primary/secondary schooling and combine knowledge with understanding and insight (as US higher institutions emphasize creativity and problem solving over rote processes and obedience, unlike many countries in Asia).

I could go on excerpting, but you get the point.

Popularity: 6% [?]

Proceed with caution on value-added

Monday, November 29th, 2010

For fairness in blogging, I feel compelled to write about the relatively new Brookings report “Evaluating Teachers: The Important Role of Value-Added” that takes a different perspective on using (value-added) student achievement data than the EPI report (“Problems with the use of student test scores to evaluate teachers”) released a few months ago.

Both studies were co-authored by a large group of distinguished researchers, so there is lots of food for thought in this debate. But let’s be clear that this is not just another case of “top researchers disagree on facts, so I can just ignore all of that confusing research and support what my gut says.”

The EPI researchers point out many flaws in the current technologies of using student value-added achievement data, and therefore recommend against its use in high-stakes decisions (like teacher rewards or firing).  The Brookings researchers agree that there are many flaws in value-added data, but ask the reasonable question “compared to what,” noting that other current methods of evaluating teachers are not very good either.

So it is more an issue of interpretation than what the facts are.

Perhaps the most interesting thing the Brookings researchers do is look at one problem with value added data, the low correlation for a given teacher across years (the same teacher compared in year 1, 2 and 3), with parallel correlations in other domains, like baseball batting averages, insurance salesperson rankings, mutual fund performance rankings, etc.

The relatively low correlation of year-to-year teacher’s rankings (0.3-0.4 range) has been cited by critics as a reason not to use value-added data, since it appears to have too much measurement “noise” to be as accurate as we would like (that is, in a disturbing large number of cases, the same teacher is ranked as “good” one year, and then “bad” the next year.)

The Brookings researchers suggest that the “noise” is not greater than what we see in these other domains and that this is an argument to use value-added for high-stakes decisions, even with its flaws. In some sense, this is argument by analogy, the accuracy of which we should examine.

First we have to think about whether we believe a teacher varies greatly in her performance over time.  Does a teacher have a “true type” (good, bad, average) and our challenge is to sample and measure that (pretty consistent) “type” correctly, or does the teacher actually vary greatly in her ability over time?  (see my earlier blog on this topic.)

If we think the teacher type is relatively “fixed” than the low correlation is a big problem, and it represents “noise” and our inability to sample well enough to find the “true type.”  Personally, that perspective makes more sense than a belief that teacher quality varies greatly from year to year.

Second, let’s examine the analogies more carefully.  Hitting a baseball thrown 60 feet at 90 miles per hour is a notoriously difficult and fickle skill – concentration and mental state do seem to matter a great deal – perhaps also so does the quality of pitching, non-random choices about which hitter to put up against which pitcher (lefty against righty, fastball versus curveball), etc.

It makes sense to me that top hitters one year might be less effective the next year, when they might be facing a divorce, or a contract year, or whatever.  Also, baseball players are well-known to be highly compensated, perhaps partly for the risky elements of their performance over time.

This seems to me a weak analogy to a teacher who has six hours and 180 days a year in a more self-controlled environment to perform “good, bad or average.”

Second, let’s look at mutual fund performance.  Here, from living 15 years in NYC, where the financial markets are eaten for breakfast, I am quite confident that the top investors in boom times (buy more tech stocks in 1997!) are also likely to be among the worst investors in downturns (buy more tech stocks in 2000!) or slow growth periods (when the top performers are cautious, diverse portfolio investors).

Academic studies of financial markets strongly suggest “random walks” and very little likelihood that we should expect high correlations across years in top performers, making this, again a poor analogy.

Third, insurance sales seem similar to financial investment to me – perhaps during good economic times a particular type of salesperson is more easily able to get families to spend money on insurance – it might require a different set of skills to be a high sales performer  in a recession.   Thus, the low correlation is in fact caused by factors outside the salespersons’ control (as with the financial markets, and probably partly in baseball too).

Their best analogy might seem to be the use of ACT/SAT scores for college admission despite a relatively low correlation with student GPAs (and the fact that no other measurable admissions factor has a higher correlation).  While I first found this argument  more compelling, it is severely flawed by the fact that they are now correlating two different things – actual student course achievement and a specific test – not the same thing over multiple years – a statistician would expect more “noise” in correlating two different things.

So it seems that the Brookings researchers can’t have this both ways.  Either you believe teacher type is fairly fixed, and then the low correlation is really a problem with our measurement technologies and a true problem with using the data.

Or you believe that teachers’ quality varies as much as professionals in these other fields, where it seems quite clear that much of this variation is a function of the external environment.   Many teachers have argued that their lack of control over their environment is a major barrier to how much they can move students achievement – the non-random assignment of teachers to groups of students, the variation in students from year to year, variations in other supports in the school, changing curricula, etc.

So where does this leave us in a discussion that is not just theoretical? After all, Colorado’s State Council for Educator Effectiveness is working, even as we blog, on implementing the SB 191 requirement that 50 percent of a teacher’s high-stakes evaluation be based upon student achievement.   Researchers agree that value-added has problems – they disagree about how severe those problems are, and whether focusing too much on these flaws, and not on problems with other forms of evaluation, makes the “perfect the enemy of the good.”

My belief is that there are many current problems with using value-added – some are fixable, with more and better tests, with better administration of tests (to avoid outright cheating and overly teaching to the tests), and with a clear understanding of the challenges created.

But other problems are not easily fixable, and perhaps we don’t want to fix them because they have value within schools – the non-random assignment of students to teachers within a school, the group nature of teaching, especially in higher grades, the difficulty assessing performance improvement in the arts, physical education, and other domains.

These issues suggest using value-added data very carefully, and as only one component in a meaningful, high-stakes teacher evaluation system.

Popularity: 8% [?]

The tension between intentions and outcomes

Monday, October 18th, 2010

In case you missed this, Sabrina Shupe and Alexander Ooms had a great thread discussion on a blog post this past week.  I highly suggest you read it.  It resonated with me because it articulated an internal rumination that I have had for years now.

The discourse reminded me about why I entered the teaching profession.  It was to make a difference; it was to continue the social justice work that I had always been prepared for by my childhood experience in Chicago, working with migrant workers who struggled to make a living.  Going to Operation Push gatherings to listen to Jesse Jackson encourage people to continue the good fight.  I wanted to, and still want to promote education reform as a civil rights issue, the civil rights issue of our time.

As a new teacher I embraced my passion and used it to get through those tough first years of teaching.  But as I gained more insight in to the day-to-day challenges of teaching I began to question my efficacy as a teacher.  As a teacher I knew I was doing the right work; my intentions were good.  But was I being an effective teacher?  How would I know?

CSAP came along, as did other standardized assessments.  These assessments allowed me to assess the work of my students.  But it still didn’t satiate my desire to know if I was being effective.  Seven years ago my school began to implement the ideas of professional learning communities.

PLCs are driven by collaboration:  collaboration that relies on comparing the work of teachers of similar courses.  Today, common course teachers will publicly show results of common assessments to identify outstanding work and to identify substandard work.  This work of collaboration allowed me to judge my effectiveness, but more importantly it gave me access to the highly effective teaching practices of my colleagues.  To do this I had to rely on some standardized assessments, like CSAP and the ACT, as well as common assessments produced by our common course team.  I knew the pitfalls of CSAPs and other standardized assessments, but I balanced these concerns with the positive aspects of how they could improve my practice.

For me, Alexander and Sabrina’s dialogue highlighted the contradictions and the wonderful tension between intentions and outcomes.  Neither can operate alone.  Just as theory and practice have to negotiate one’s reality.  I hope I have not misrepresented Alexander’s and Sabrina’s discussion, read it for yourself.  But it struck a chord with me.

Today, public discourse is more about public discord.  We need more thoughtful exchanges, exchanges that provoke one to weigh the reader’s bias against another’s.  Exchanges that allow room for one to depart from an atrophied partisan position.  I hope this site continues to produce these examples of thought provoking exchanges.

Popularity: 3% [?]

How to evolve the School Performance Framework

Monday, October 11th, 2010

Ooms is a member of the West Denver Preparatory Charter School board, and several other boards involved in education reform

The recent results of Denver’s School Performance Framework (SPF) was fairly minor news. That’s encouraging, because it means that evaluating schools, with a premium on student academic growth, is more and more part of the lexicon. No one will, or should, claim that the SPF is the only metric that matters, but it is pretty hard to argue that the data is not useful (although I’ll offer even money that someone in the comments may take up this challenge).

At the same time, after spending considerable time with the SPF, I also think it needs to evolve. Now I come to praise the SPF, not to bury it — in my opinion, the Colorado Growth Model (the engine of the SPF) is one of the most important developments in recent memory. However let’s take the SPF seriously enough to acknowledge its limitations and look for ways to improve it.

There are three main ways I think the SPF could evolve to include and sort data to provide a fuller view of school achievement. It’s been true for too long that some board members actively resist comparative data, which allows them to support pet projects and political agendas when a hard look shows their programs to be underperforming. Moving to a data-informed opinion is critical to make any significant changes in the way we educate our children.  The data I would add include: a confidence interval; inclusion of selective admissions, and a comparison by FRL.  These are all highly important variables in school evaluation. Let me explain each.

First, as SPF academic data is based on the CSAP, which is administered only in grades 3-10, so the percentage of students whose scores count toward a school’s ranking varies considerably. For example, elementary schools offer 6 grades (K-5), in which academic growth data is available only for 4th and 5th graders.  This means that — assuming every grade has an equal number of students — only 2 of 6 grades (or just 33% of students) are counted in the growth score, which is the single largest component of the SPF. There is a similar problem in high schools, in which all academic data is only available for roughly 50% of the student body (9th and 10th grades).

Assuming even distribution across grades, the percentage of students whose scores are included in the growth data varies considerably by type: elementary schools (33%); high schools (50%); K-8  (56%); 6-12 (71%) and middle schools (100%). Particularly for smaller schools — which are most often the elementary grades – this means that a pretty small cohort of kids can determine the academic growth score for the whole school.

What I’d like to see the SPF do is two-fold: first, there needs to be a confidence interval for each school. Now, as Paul Teske has pointed out, data is often based on sampling, and this alone does not invalidate the results.  However, at a minimum one should be aware to comparisons between schools where 100% of the students contributed academic data versus only 33%. The required math here is not that hard (here is an online calculator) — for a school of 300 students, to get 95% confidence that the growth score is +/- 5 percentage points, you need a sample size of about 170 students.  I don’t believe there is an elementary school in DPS that comes anywhere close to that standard, and my guess is that most have a possible swing on academic growth data of +/- 8 percentage points (so a mean growth score of 50% could be anywhere from 42% to 58% – which spans 3 SPF categories). That’s significant.

So in recognition of what will be very different confidence intervals, schools should thus be compared primarily by grades served (apples, meet apples).  Compare K-8 programs first among themselves and the median of their group score, and then among all schools. Maintain the overall ranking, but acknowledge the significant difference between the data sets of different grades served by setting them apart (example to follow).

Secondly, I’d like to see the percentage of students in each school that are selective admissions — students who are awarded places based on academic ability or skill.  This would include both entire magnet schools as well as selective admissions programs within larger student bodies. Simply put, it is deeply unfair to compare schools that can hand-pick students with those that do not. With few exceptions, the percentage of selective enrollment seats within many DPS programs is lost in the statistical bureaucratic muck, and badly deserves some transparent light. I’ve written about this previously, and I remain at a complete loss at a system in which schools with these different enrollment policies are ranked as if they are equal when they are clearly not.

Third is to more explicitly consider the percentage of students in poverty (or FRL). The correlation between subpar student academic achievement and poverty remains high, and particularly if we are serious about addressing the achievement gap, we need to look more closely at schools that have FRL higher than the district average (of about 65%), and less at those schools whose demographics only resemble those of our city when inverted.

What might this new SPF look like? Here is some of the data for DPS high schools (chosen because sample size is large enough to be interesting and small enough to be manageable):

Now I don’t have a confidence interval here — which is most useful in comparing schools who serve different grades — but given that all of these schools are relying on academic data from roughly 50% of their students, I’d sure like one.  Selective admissions reveals one school: GW, whose 28% selective enrollment is from their web site and may be slightly dated, but I’d bet it’s pretty close.

Note that the four lowest scoring schools (who are in the two “danger” categories) all have FRL above 85%, while of the top four (in the second highest category), only one does. Which leads us to the second part: a graph comparing the SPF score with the percentage of students who are FRL (red is the regression line):

What is telling here is the easily discernible pattern through the lens of FRL and achievement. The median point score – for the high school category alone – is 45%.  Three schools scored significantly above that median: CEC, East and GW.  East has open enrollment and an FRL of 35% (note that the latter is neither a pejorative nor discredits their high SPF score); GW has 52% FRL and a selective admissions policy for over a quarter of their students, which makes a considerable difference (my guess is that without these students GW would drop a category). The high school that is most impressive is CEC, with high SPF points, FRL of 81%, and an open-enrollment policy.* as benefits its isolated position in the top right.

Somewhat appealing are Lincoln and Manual – both received SPF scores just over the high school median, but did so with large numbers of FRL students. TJ had a somewhat higher score, but their relatively small FRL population shows them far below the trendline.  Kennedy looks remarkably average or below; South, West and North all disappointing, and laggards Montbello and DVS are already (and rightly) undergoing programmatic changes.

Now this view is largely lost in the overall SPF, which gave CEC an overall ranking of 24th and placed them in the second-highest category of “Meets Expectations.” But if you are a parent searching for a good high school program, you care a lot less about the comparison to elementary, K-8 and middle schools.  And you should take a hard look at the impressive results at CEC.

So while I believe it remains important to show the relative performance of all schools, this is how I would like to see the SPF evolve.  For the combination so evident in CEC is, to me, the rare trifecta that narrows the achievement gap: academic growth (hopefully with a strong confidence interval); open-enrollment policies;* and serving a large FRL population.

This trifecta is also really hard to do.  Last year I wrote a depressing post on the SPF which was more specific about the truly lousy prospects for high-poverty, open-enrollment students. The results this year were just not that different – the worst schools have somewhat narrowed, but there is still a long way to go at the top, particularly in grades 6-12.

However we should acknowledge the achievements that are being made: for high schools that is East and especially CEC , who are deserving of recognition not easily apparent in the overall SPF. My guess is there are similar schools in each of the different grade structures.  It would benefit all of us to have a clearer picture of who they are. Hopefully the SPF will take some tender steps towards this evolution.

*Update: I regretfully spoke too soon about CEC’s enrollment policy.  The school does not have geographic enrollment, and instead accepts students based on an application process that requests transcripts and grades, awards received, attendance data, and three recommendations.  This clearly places CEC (and they somewhat self-identify) as a magnet school with 100% selective admission.  To operate as a magnet with 81% FRL is commendable, but this is not a school with open-enrollment and their achievements should include this qualification.

Popularity: 16% [?]

Advance and hindsight with CSAPs

Thursday, September 16th, 2010

Back in March, as students were filling in the last of their CSAP ovals, I wrote a post encouraging a discussion of what to look for with 2010 CSAP scores — which were then still 6 months away.  And while I agree with Mark that CSAPs are an autopsy and do next to nothing to help teachers gauge student progress and deficiency during the school year, like an autopsy they do provide valuable insight into overall trends at a broad level.

While not so useful to teachers, CSAPs and the comparisons in the Colorado Growth Model, can help both a district and individual schools see where they are making progress, and where they are not.  In Denver, we also now also have the 2010 School Performance Framework, for which CSAPs are the primary engine, which adds a little more color and multiple measures of assessment.

Usually CSAP scores are used in hindsight to justify existing positions (um, like the end of this post). So last March, I identified four areas where I thought CSAP results would be particularly illuminating — well before anyone knew what those scores would be.  Now we do.

Here are those same areas revisited, and what we might discern from the results:

1.  DPS Academic Growth

March: So when the 2010 CSAPs come out, start here: how much real academic growth has the district achieved?

September: When you look at the district’s results on the basis of one year’s growth, the petty pace of progress is a little underwhelming.  But it increasingly appears — as an intentional strategy or not — that DPS is pursuing a path of slow but steady operational improvement instead of trying more comprehensive and radical reforms in search of more rapid change.  As most reports noted, DPS has shown growth in excess of almost all Colorado school districts two years in a row – even if the absolute growth in proficiency has been minimal.

For overall proficiency in the core subjects of math, reading and writing, DPS has seen an average annual increase of 1.7 percentage points over the past five years. Now, that does not sound like much, but the cumulative impact has been an 8.3 percentage point gain.

And now consider the size of the ship: a perfectly distributed increase in a steady 75,000 student district would mean over 6,200 more students are now proficient, or an overall gain equal to roughly 15 new quality schools of 400 students each — or three new schools each year.  Now there is some considerable double counting here with the gains at specific charter and innovation schools, but over a five-year term, this is clearly a positive trend and comprises better results than in almost all urban districts nationally.

I’m also somewhat hopeful at the increase in 10th grade proficiency, which is the last test before a student graduates. Academic gains in early years, if not sustained, are problematic — the proper goal of any school district is to have proficient high school graduates, not just proficient elementary students. Here again, the long-term results are solid, if unspectacular: 10th grade proficiency has improved from 26.8% in 2005 to 32.2% in 2010 — there is a smaller gain in 10th grade than overall, but many districts are only seeing improvements in the early years, which are reversed in middle and high school.

Again, a school system where just one-third of 10th graders are proficient can in no way be considered a success, but the steady, incremental progress of the District should be seen for what it is. Many people (including me) will argue for policies to accelerate this growth, but even we should pause to acknowledge the gains. One can still plausibly argue over glasses half-empty or half-full, but we should all now agree that the water is rising.

2. District turnarounds: Cole/CASA, Trevista/Horace Mann, Gilpin

March: These three schools were all part of transformation plans in the 2007-2008 reform efforts, the 2010 scores will show if they are on track after a transitional year.

September: There are really six separate growth scores here as each school has both an elementary and a middle program.  What we see from the scores is that overall, the track that these schools are now on is heading straight to the undistinguished shed of average (in what is still an underperforming district).  Of the six programs, Trevista’s elementary school looks like it is still cause for concern, with a growth score of just 39% — 11 percentage points from the median.  Grouped near the median growth score of 50% were CASA’s elementary program (47%), Trivista’s middle school (52%) and both programs at Gilpin (51% and 52%).

CASA’s middle program did the best, with a growth score of 68%. The SPF shows a similar level of mediocrity: CASA and TriVista are both in the middle category of “Accredited on Watch” while Gilpin is in the penultimate category of “Priority Watch.”

It’s hard to know if this constitutes success or not — these were programs that were previously failing miserably, and the improvement to average could well be seen as an accomplishment. But if so, it is deeply limited, as median growth will not help DPS close its achievement gap with the rest of the state. The shift in students also largely prevents an apples-to-apple comparison. But my sense is that a set of average schools was not what DPS and community leaders envisioned during the wrenching turnaround process. It’s somewhat disappointing news.

3. Charter Expansions: West Denver Prep, DSST

March: The ability of these two schools to maintain their high academic standards while they grow is a critical test.

September: Any questions about quality replication should be banished, at least for now, as WDP and DSST combined for four of the top 10 schools in academic growth in the entire state, and were the only schools in the top 10 not serving elementary kids (full disclosure: I serve on the WDP board).  Both of the new schools finished slightly higher than their existing siblings.  Median growth percentiles for the new WDP campus were 89% (compared to 84% for the existing campus); the new DSST middle school came in at 80%, just ahead of the existing DSST high school at 77%.

The SPF combines scores for DSST’s middle and high schools, so WDP and DSST took three of the top five spots, and were the only schools not serving elementary students in the District’s top category of “Distinguished.”  This is remarkable success and bodes well for continued expansion (both schools just open additional campuses this fall).

4. Program Expansions: Kunsmiller Creative Arts Academy (KCAA)

March: If the program can show clear academic growth while serving their local community, it could open the door for a similar attempts with different district programs, and a movement to spread successful magnet programs to different demographic groups.

September: KCAA was opened with strong backing amid the hope that it could show success at least within shouting distance of the magnet Denver School of the Arts.  In its first year, it did not.  In fact, the Colorado Growth Model shows that KCAA managed proficiency of 47% and growth of 41% in their elementary program, and proficiency of 30% and growth of just 41% in middle school — well under the median in all areas. Also disenchanting were their SPF results, where KCAA ranked in the bottom category of “On Probation” as one of the lowest 15 schools in the district — which astonishingly showed worse scores than the school had in the 2009 SPF, before the redesign.

I was surprised at these scores, and my initial assumption was that there was probably considerable discrepancy by grade.  However the five grades tested showed little difference: grade 8 was just above the median in two of three subjects (math and reading both 52%); grade 7 was above median growth in one (reading 57%), as was grade 5 (writing 53%).  The other 11 grade and subject score areas  – which make up almost 75% of the 15 total — all tested below median growth.  Oddly enough, KCAA’s 8th grade — which consisted of legacy students from the former program and was least affected by the redesign — did the best.  If you remove the 8th grade scores, KCAA managed above median growth in just two of 12 areas.

This is a deeply inauspicious start for a program that carries the considerable hope of its supporters. While I have no doubt the school’s passionate defenders will claim that KCAA’s intangible benefits are superior to the annoying application of quantitative data and comparison, I’d be hard pressed to believe that anyone in the planning stages would have voiced their support for a future if it included this backwards trajectory.

So these were the four areas I identified last March, but as always, there are some terrific surprises in the data.  One school deserves a special shout-out: Beach Court Elementary.  On the Colorado Growth Model, Beach Court had the highest overall growth in the state, with an average score of 91%. Compared to other DPS programs, Beach Court was tops in two subjects: writing (96%) and reading (92%) and tied for third in math (86%).

This is not a sudden outlier, as last year’s growth compared to other DPS programs was similar: highest in writing, second in reading, and tied for third in math. Beach Court ranked 6th on the SPF, and with a FRL population of over 90% was just one of two traditional schools managing the rank of “Distinguished” with poverty rates above the DPS average.  That is simply phenomenal work – and Frank Roti and his entire team deserve ample congratulations, and a lot more of our attention.

Popularity: 5% [?]

The shortcomings of value-added modeling

Tuesday, August 31st, 2010

Hello EdNews readers. I’ll be checking in occasionally with blog entries here focused on new and worthwhile research. For this first blog, I want to point you to a research brief published on Sunday by the Economic Policy Institute.

The piece, with the dry but informative title of “Problems with the Use of Student Test Scores to Evaluate Teachers,” is authored by an extremely impressive collection of accomplished researchers. If you read nothing else about education this week, please read the three-page executive summary (then continue on and read the rest!).

Before discussing this research brief, I want to re-introduce myself. I teach school policy and law at the CU Boulder School of Education, where I also direct the Education and the Public Interest Center. In my blog entries here, I will try to point readers to useful resources on the EPIC website in addition to resources – like the new Economic Policy Institute brief – from other places.

The main point of the EPI research brief is straightforward: while value-added modeling (VAM) is a technical advancement that highlights student growth, the numbers generated are nevertheless too inaccurate to be used as a primary factor in making high-stakes decisions about teachers. That is, if someone tells you that a teacher is good or bad based on a VAM calculation, you are wise to take the judgment with a sizeable grain of salt. This is the same warning that I — with far less impressive credentials — issued a couple years ago, as did the National Academy of Sciences earlier this year.

The full EPI research brief does a great job explaining how and why high-stakes VAM policies cannot be supported by VAM itself. But there’s one quote and one illustration/study that I want to pull out of the brief, to hopefully entice you to read the entire thing.

First the quote:

“There is simply no shortcut to the identification and removal of ineffective teachers” (p. 20).

As I write these blog entries throughout the year, I could probably begin each one with, “There is simply no shortcut to…” In part, this reflects the complex nature of schooling, but it also reflects the sad state of policymaking, where politicians and others are so easily enticed by the quick fix.

The replacement of ineffective teachers with effective ones is unquestionably a worthwhile policy goal, but it’s much easier said than done. A policy intended to accomplish this goal would have to reliably (a) identify the ineffective teachers (without wrongly targeting the effective ones), and (b) identify and recruit effective replacement teachers. Also, the policy should accomplish this in a more cost-effective way than alternative possibilities (but given the problems with the first part of this puzzle, we’re not yet at the point where we should worry about such comparisons).

Now for the illustration. I’ll quote from page 2 of the executive summary:

One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time ….

This is scary stuff, but only if used unwisely – only if policy makers give too much credence to the scores. VAM approaches do tell us something; a teacher (or school) whose VAM scores are consistently at the extreme high end of the distribution are very likely of higher quality than those consistently at the extreme low end. So here’s my alternative proposal: use VAM approaches as a first-stage, cost-effective tool that will help inform a more in-depth, second-stage quality analysis. A teacher or school at the bottom (e.g., the bottom 5 percent) in a state or district should be identified for classroom observations, principal evaluation, and other hands-on information-gathering that can lead to a determination of professional development needs or removal/turnover. Similarly, a teacher or school at the top might be identified for further study that might help us learn from successes. This approach has three major advantages:

  1. The ultimate evaluations of teachers and schools will not be made based on the test scores; they will be more thorough and reliable.
  2. The use of VAM here is supportable, since it is not being used to make fine-grained distinctions among teachers, but it does serve as a good tool that allows for a cost-effective use of hands-on evaluation tools.
  3. A teacher will not feel extreme pressure to teacher to the test, since his or her career would ultimately be determined by hands-on observations and other information, not by students’ test scores.

Sadly, the approaches being considered and implemented in Colorado and elsewhere rely far too much on test scores and VAM approaches. We are rushing toward a system of teacher evaluation that is sure to wrongly identify teachers as good or bad, and it will likely be years before policymakers realize and correct the mistake.

Popularity: 7% [?]

A teachable moment?

Tuesday, August 31st, 2010

Complaints about some evaluators not fully understanding our true value …  a sense that points were taken away unfairly, despite reviewer training in the appropriate rubrics  ….  evaluators not understanding, and not crediting us, for the things we do well… a sense that someone in a higher position should reverse the injustice.   It all feels unfair.

Yes, but, most of these Colorado complaints about the round two R2T scoring could also be applied to premature teacher evaluation based upon the inappropriate use of faulty test score data.

Isn’t there some irony in the fact that some of the folks complaining about unfair R2T scoring of Colorado’s application are also among the ones who turned a deaf ear to, or brushed aside, some of the legitimate concerns about using current test scores to evaluate teachers?

My colleague Robert Reichardt made a similar point in April, after Colorado lost round 1 of R2T.  Now we feel twice the pain.

Let me be clear.  I support better teacher evaluation and we need to move in that direction, using multiple measures of better and more frequent principal and peer evaluation, and some appropriate use of student test scores.

There are certainly some individuals and groups who have looked for any reason not to advance real teacher evaluation, because they want to preserve the status quo (which is basically no useful teacher evaluation), and I don’t want to support that position.  At the same time, there are lots of others who see legitimate problems with the current technology that ties student test results to specific teacher evaluations, and want to proceed carefully, in order to do this right.  I was surprised how little attention policy makers gave to that latter group this spring.

As the implementation of SB 191 moves forward into the implementation stage, but now without federal funding to support it, we should keep these concerns in mind.

There are at least four reasons why we can’t now validly and reliably link teacher evaluations to student test scores.  When we address some of these elements, we will be able to more fairly and more effectively evaluate teachers.

First, we don’t have good value-added tests.  A annual March CSAP test is not good enough (you need a valid beginning and end of year test to the same students whose gain you want to assess), and more than half of Colorado grades/subjects don’t even have the annual CSAP available anyway.

Second, students are probably not randomly assigned to teachers, as this evaluation processes requires.  If teacher Jane is known by her principal to be good at teaching students with serious family problems, and thus gets assigned a group of difficult students, and moves their knowledge forward by 0.75 grade levels, while teacher Joan is known to not be good with difficult students, and gets all of the easier ones, and advances their knowledge by 1.0 grade level, who has done a better job?  (It isn’t clear that we can, or want to, “fix” this, but it is a reality that skews the data).

Third, one year of data is not a large enough sample to use for a teacher – you probably need 3.  Classes of 26 students, with 50% mobility levels that are not uncommon in urban areas, leave 13 students with a particular teacher all year – that is not enough data to make a reliable judgment about teacher quality.

Fourth, lots of good teaching is joint and collaborative, especially at the secondary level.   The social science teacher may be as responsible for improved student writing as is the English teacher.  We don’t want teaching to only be a solitary practice with no sharing and collaboration.

Added to these concerns, making student test scores very high-stakes will greatly increase the likelihood of outright cheating, as well as more subtle “teaching to the test” (and not the good kind, where people teach the subjects they are supposed to teach, but the overly narrowing kind where you only ask the types of questions known to be on the test).

I won’t try to make this post double-ironic, but among the beauty of Denver’s own ProComp is that it was put together by and with teachers, and advanced by a teacher vote, and it incorporates multiple measures, to recognize that we can’t really nail down a single dimension of teaching to assess and reward.  It is disappointing that we couldn’t summon that kind of process at the state level.

To see a different way of handling this issue, Chad Aldeman of the Quick and Ed blog (a strongly pro-reform  voice) recently contrasted LA’s handling of teacher data with Tennessee’s approach:

“In contrast, Tennessee has been using a value-added model since the late 1980’s, and every year since the mid-1990’s every single eligible teacher has received a report on their results. When these results were first introduced, teachers were explicitly told their results would never be published in newspapers and that the data may be used in evaluations. In reality, they had never really been used in evaluations until the state passed a law last January requiring the data to make up 35 percent of a teacher’s evaluation. This bill, and 100% teacher support for the state’s Race to the Top application that included it, was a key reason the state won a $500 million grant in the first round.”

Popularity: 3% [?]

Unanswered questions on CSAP protocol

Wednesday, August 11th, 2010

Last week we read about the Colorado Virtual Academy (COVA) “mishap” that invalidated more than 6,000 CSAP test scores. This week’s release of CSAP data by the Department of Education keeps the story in the forefront. But when it comes to the whole COVA incident, I must confess to having some unanswered questions. (And I must also confess to working closely with a COVA board member, as well as having both CDE employees and COVA parents as friends.)

EdNews describes administering CSAP tests to students from different grades in the same room as “a violation of state testing protocol.” But if action is going to be taken as severe as throwing out thousands of assessment scores (resulting in failure to make AYP under federal law), it would help to know more about the origin of the protocol. It’s not in state law. It doesn’t originate from rules adopted by the State Board.

Hence it would be valuable to know: When was the policy adopted? By whom? With what rationale?

Others have brought attention to the lack of clarity in the way the “protocol” is presented. In the EdNews story, a COVA official correctly observed that the regulation appears nowhere in the 100-plus page student assessment procedures manual (PDF). Additionally, only one small notice was added in the 2010 proctors manual (PDF) — where it hadn’t appeared in 2009. A closer look at the proctors manual reveals no trace under the section labeled “Standard Conditions for a Standardized Test.” I also perused CDE’s 53-slide official presentation for CSAP administration training and couldn’t find a reference to the procedure.

So yes, a rule is a rule (or in this case, maybe a procedure or “protocol”). COVA is not entirely without fault. But to what extent do we adhere rigidly to bureaucratic norms? If the procedure is suddenly so crucial to be considered a “misadministration,” why not include it in the procedures manual? If it’s worthy only of one small mention in a test proctor’s manual, why not also provide for commensurate flexibility?

Better understanding the origins of the procedure not only could add important context to this discussion but also could provide clarity to avoid a possible repeat of such an incident in the future.

Popularity: 7% [?]

Colorado Health Foundation Walton Family Foundation Daniels fund Pitton Foundations Donnell-Kay Foundation