Thursday, March 29, 2007

How Standardized Testing is Killing American Education: Reason #6

5th in a series.

6) Ranked scoring doesn't tell you anything about learning: If every student in the state dramatically improved their learning and scored much higher on the test, would you expect the average scores to go up? Likewise, if every student in the state was accidentaly given the test in Arabic instead of English, would you expect the average scores to go down? Guess what? They wouldn't! Not in either case. In both cases, the average score would be "600" no matter how many more questions were answered correctly (or incorrectly) or even if the test was an incomprehensible graduate level neuroscience test given to 3rd graders.

How can this be? Well, the "scores" that students get are not directly based on the number of answers they get right. The highest possible score is "1000", but that doesn't mean that a "600" score denotes 60% of the questions answered correctly. A student with a score of 600 may have gotten 10%, or 50%, or even 90% of the questions correct. The scores that are published for these tests are "scaled scores." Scaled scores are given based on how many students you scored better than on the test, not how many questions you got correct.

Imagine that there are 100 9th graders in California. After taking the test, the students are lined up in order based on the number of questions they answered correctly. The person at the beginning of the line might have answered 2%, or 20% (which is what you'd expect a student randomly guessing to get), or even 80% of the questions correct. Similarly, the students at the end of the line might have answered 50%, or 75%, or 98% of the questions correctly. It doesn't matter: they're just lined up in order. After they are in order, they are divided into 5 equal groups. The first 20% of the students (20 students in our example) all receive a score of "200". The next 20 students all receive a score of "400", then "600" for the next 20, "800" for the next 20, and "1000" for the 20 students with the most questions answered correctly.

A school's API (Academic Performance Index) score is the average of these "quintile" scores from all of its students. So what's the problem?

Well, let's look at our imagninary 100 9th graders. Student #3 could have answered 4% of the questions correctly, and student #18 could have answered 43% of the questions correctly, but they both get the same score: 200. Likewise, student #20 could have answered 42% of the questions correctly and student #21 could have answered 43% of the questions correctly, but student #20 will get a score of 200 while student #21 will get a score of 400... twice as many points! Can you see how these numbers can be misleading?

Another problem: schools receive a score of 1-10 based on their ranking, similar to the students. The bottom 10% get a "1", the top 10% get a "10" and so on. But the number of students that would need to move up one quintile for a school to move from a "1" to a "2" is significantly higher than the number of students that would have to move up one quintile from a "4" to a "5". The schools in the middle are bunched together very closely, and movement between the rankings doesn't necessarily indicate a large number of students scoring differently. Movement at the bottom (and at the top) on the other hand require large numbers of students to improve their scores, and the improvement in learning that moving from a "1" to a "2" is significantly greater than that of a school that moves from a "5" to a "6"... yet the numerical value given to this score is the same.

Most importantly, the only way for a student to move up in the rankings is for them to improve disproportionately more than a student that previously scored higher than they did. There will always be 20% of the students in the bottom quintile: that's how the system works. If a student moves up, another student has to move down. Therefore, we are requiring schools with low scores to teach their students more than schools with high scores. The system is basically competitive: you don't have to improve you students' learning necessarily. Rather, you need to hope for some other school to do a worse job than your school does. This puts educators (and students and parents, for that matter) in a position of hoping for other schools to educate their students poorly. I don't know about you, but I find that reprehensible. We should never put our children in a position where their success is measured in such a way that they are dependent on the failure of others. If our schools are training our children to engage in life as a zero-sum game where their well-being is predicated on the misfortune or failure of others, we are setting them up to take the messed up world they've inherited from us and make it all the more hellish.

- "'Welcome to Hell.' "Oh, thanks. That means a lot, coming from you.'"

Friday, March 16, 2007

How Standardized Testing is Killing American Education: Reason #7

Fourth in a series.

7) English Learners unfairly penalized: One would expect someone who is learning English to score lower on a test of English proficiency than a native English speaker. Test scores bear that out, and it's not surprising or unexpected.
There is no reason to assume, however, that someone who is learning English will necessarily be less proficient in math, or science, or general social studies (apart from US History) than native English speakers. Indeed, a student who received more education in their home country before coming to the US than a native speaker would be expected to be more proficient. At the very least, we should assume that the scores in these non-English subjects should be roughly equivalent. A more realistic assumption based on international research would be that foreign-educated students might actually score better than native English speakers educated in the American public-education system.
So why do English learners consistently score lower in math than native speakers? A cursory glance at the test will reveal this immediately: the format of the questions require a level of English proficiency just to understand what the question is; a level of English proficiency that many English learners have yet to attain.
Math questions are almost all contextualized word problems. A problem given as "4+7=___" could be accessed by anyone, regardless of English proficiency. When it is given instead as "Farmer Brown has 4 chickens and 7 ducks. How many birds does he have?" we run into problems. First, vocabulary: Farmer, chicken, duck, bird... if a student doesn't know these words, the question becomes more difficult, and not because of any deficiency in mathematical skills. Add to that the problem that comes with not realizing that "chickens" and "ducks" are both subsets of the larger category "birds" (if a student has heard of chickens but not ducks, they might legitimately answer "4"), and the various conjugations of the irregular verb "to have" (you know that "has" and "have" mean the same thing in this sentence... does an English learner know that?) and a student might get the answer wrong for several reasons that have nothing to do with their level of math proficiency, and math proficiency is what this test is supposed to be assessing.
So what do we do about it? More accurate results could come by letting students use a translating dictionary for the non-English portions of the test. This would of course have to be tied to an increase in the time allowed, since looking up 3 or 4 words per questions will most likely more than double the time necessary to finish the test. Even without the dictionary, English learners need more time to read and comprehend English texts, so the time extension or even giving unlimited time would go a long way toward redressing this inequity. Unfortunately, both of these suggestions make-up the "axis of evil" for standardized test makers: the argument is that the point of the test being "standardized" is that scores can be compared fairly because every student takes the exact same test under the exact same conditions. Any variation in conditions destroys standardization by this view. I would argue that for all subjects other than English, the opposite is true, and a fair and equitable measuring stick for students proficiency in subjects other than English cannot be attained until we sacrifice these sacred cows on the altar of equal opportunity.

(Resources: 1 2)

-"Medium-Head Boy!.... You see, he doesn't know!"

How Standardized Testing is Killing American Education: Reason #8

Third in a series.

8) "Scaled scores" don't tell you anything about student learning: Standardized tests scores are given as "scaled scores." This means that your score is not based directly on how many questions you got right: a student who answered 25% of the questions correctly would not receive a score half that of a student who answered 50% of the questions correctly. Rather, the scores tell you how many other students that took the test scored worse than you did. A student who scores in the 35th percentile did not necessarily get 35% of the questions correct. What happened is that 35% of the students who took the same test got less questions correct than that student did. It's possible that they got 35% of the questions correct, but it's just as possible that they got 5% of the questions correct, or 75%, or even 90%. A scaled score doesn't tell us anything about the number of questions answered correctly.
Likewise, improvement on a scaled score doesn't necessarily indicate improvement in learning. A student could answer 45% of the questions correctly one year and 55% the next year. Their scaled score could improve, or drop, or stay the same, depending on whether other students improved similarly or not. Ideally, we want all students to improve, don't we? Well, if that happens at the same rate, our scaled scores will not change at all, and will give no indication that the outcome we most desire is actually taking place!
Scaled scores are deceptive on several counts. First of all, it is not uncommon for someone to think that someone with a scaled score under 50% has mastered less than 50% of the material. That is not true. Someone with a scaled score of 50% scored higher than 50% of the students who took the same test. In other words, this is a totally average student. Right in the middle. Typical of American students in general. This students actual score could tell us a lot about the state of American education: if an average student has an actual score of 20%, we would be disappointed; likewise, an "average" actual score of 85% would be very encouraging. Unfortunately, the only score we're ever exposed to is the scaled score, which doesn't tell us a lot about what (or whether) students are actually learning.

(Resources: 1 2)

-"USA Today has come out with a new survey: Apparently three out of four people make up 75 percent of the population."

How Standardized Testing is Killing American Education: Reason #9

Second in a series.

9) Norming is biased: "Norming" refers to comparing one students' results against all other students to determine how they compare to the population at large. Most of the time, however, the "population at large" scores are compared to is actually a smaller sample of the entire population which is judged to be a representative sample of the entire population. This smaller sample is given the test early, and those results are used to set up a virtual spread of scores.
So, you have two problems: how do you assure that your sample is truly representative of the larger population? You can select for race, number of years in the country, socio-economic status, parents' education, region of the country, gender, age, and a host of other variables that may or may not have some bearing on results, but no matter how big your sample is, you're always going to have sampling error. Choosing a representative sample is also really hard and expensive, so instead, samples tend to be less representative in favor of choosing students from the same geographical area, often close to the location of the test-makers offices. For the SAT, that meant that upper-middle class, predominantly white students were the sample that the test was normed against for years. Remember a few years back when they "rescaled" the scores and people complained that they were lowering the bar by making grading "easier"? What actually happened was that a more representative sample was used and the college board realized that their sample had been skewing the Norm high for years. The new scores are more accurate because they're based on a more representative sample.
The second big problem is just regular old sampling error. You can't get away from it. When you compound the sampling error inherent in choosing test questions with the sampling error from the group used to set the Norm, the reliability of the test results becomes shakier and shakier.
Several years ago, as Reformed Math made Integrated courses more popular, California debuted Integrated Math Standardized tests as options for schools. For several years, the results were impossible to norm: that is to say, results did not fit a normal distribution as you would expect from an unbiased test. Results had to be fiddled with and forced artificially into a normal curve. You'd think that this would reveal a flaw in the testing (even more than the normal level of error which is considerable) and states and districts might hold back on making major decisions based on these scores. No such luck. Bureaucracy reigns supreme, and the wheels of progress have too much inertia to stop turning, even if it means innocent students are crushed underneath.

(Resources: 1 2)

- "He uses statistics as a drunken man uses lampposts—for support rather than for illumination."

The Annual Standardized Testing Rant: First in a series!

Welcome back to my favorite topic: how standardized tests are killing American education. I've tackled this topic before, so this year I'm going to go for a series of the main reasons I detest standardized testing so much in the form of a "top ten" list." Here we go (drum roll, please!):

10) Sampling error makes it impossible to get accurate results: "Sampling Error" refers to the inherent error that exists when you choose a small sample of all possible items to evaluate mastery of the entire set. For standardized tests, there are millions of possible questions that could be asked to assess students' mastery of the standards that students are supposed to learn in a given year. To create a usable test, a small number of those possible questions must be chosen. The assumption is that the questions are chosen carefully enough so that they are representative of all possible questions. In other words, if a student answers 70% of the sample questions correctly, the assumption is that they would have answered 70% of all possible questions correctly.
"Sampling error" is a mathematical term that refers to the probability that the sample score is close (usually 90% or 95% accuracy is checked for) to the actual score the student would have received if tested on all questions. You see this number when political polls results are reported, it's called the "margin of error." So if candidate A is poled at 40% and candidate B is polled at 45% but the margin of error is + or - 7%, you would say that they are in a statistical tie. The margin of sampling error is greater than the difference between the results, meaning that the poll doesn't really indicate a clear advantage for either candidate.
For the California standards tests, students scores are grouped into "quintiles," where a student in the bottom 20% is in quintile 1, students in the next 20% (21% to 40%) are in quintile 2, etc. A student who is in the 3rd percentile is in quintile 1 and receives a score of 200. A student in the 19th percentile is also in the 1st quintile and also receives a score of 200. A student in the 22nd percentile would be in quintile 2 and receives a score of 400. Quintile 3 gets 600, 4 gets 800 and 5 gets 1000. The problem is, if you look at the average number of questions correct of a student in quintile 3 and the average number of questions correct of a student in quintile 4, the difference is less than the margin of error due to sampling error! Students could go up or down 1 quintile just by choosing different questions to include in the test, without any additional learning or skills on the students' part.
It seems immoral to me to attach such high stakes to tests that suffer from this tragic flaw from the outset. I think that we can use these tests as long as we acknowledge their limited ability to give us accurate data. When we make major funding decisions as if these results are objective fact and not broadly fallible approximations, we are playing Russian Roulette with our kids' education and future. Our kids deserve better than that.

(Resources: 1 2)

-"Definition of Statistics: The science of producing unreliable facts from reliable figures."