**Assessment: Why Item Analyses Are So Important**

by **Grant Wiggins**, Authentic Education

As I have often written, the Common Core Standards are just common sense – but that the devil is in the details of implementation. And in light of the unfortunate excessive secrecy surrounding the test items and their months-later analysis, educators are in the unfortunate and absurd position of having to guess what the opaque results mean for instruction. It might be amusing if there weren’t personal high stakes of teacher accountability attached to the results.

So, using the sample of released items in the NY tests, I spent some time this weekend looking over the 8^{th} grade math results and items to see what was to be learned – and I came away appalled at what I found.

Readers will recall that the whole point of the Standards is that they be embedded in complex problems that require both content and practice standards. But what were the hardest questions on the 8^{th} grade test? *Picayune, isolated, and needlessly complex calculations of numbers using scientific notation*. And in one case, an item is patently invalid in its convoluted use of the English language to set up the prompt, as we shall see.

As I have long written, there is a sorry record in mass testing of sacrificing validity for reliability. This test seems like a prime example. Score what is easy to score, regardless of the intent of the Standards. There are 28 8^{th }grade math standards. Why do such arguably less important standards have at least 5 items related to them? (Who decided which standards were most important? Who decided to test the standards in complete isolation from one another simply because that is psychometrically cleaner?)

Here are the released items related to scientific notation:

It is this last item that put me over the edge.

**The item analysis.** Here are the results from the BOCES report to one school on the item analysis for questions related to scientific notation. The first number, cast as a decimal, reflects the % of correct answers statewide in NY. So, for the first item, question #8, only 26% of students in NY got this one right. The following decimals reflect regional and local percentages for a specific district. Thus, in this district 37% got the right answer, and in this school, 36% got it right. The two remaining numbers thus reflect the difference between the state score for the district and school (.11 and .10, respectively).

Notice that, on average, only **36%** of New York State 8^{th} graders got these 5 questions right, pulling down their overall scores considerably.

Now ask yourself: given the poor results on all 5 questions – questions that involve isolated and annoying computations, hardly central to the import of the Standards – would you be willing to consider this as a valid measure of the Content and Process Standards in action? And would you be happy if your accountability scores went down as a teacher of 8^{th} grade math, based on these results? Neither would I.

There are 28 Standards in 8^{th} grade math. Scientific Notation consists of 4 of the Standards. Surely from an intellectual point of view the many standards on linear relationships and the Pythagorean theorem are of greater importance than scientific notation. But the released items and the math suggest each standard was assessed 3-4 times in isolation prior to the few constructed response items. Why 5 items for this Standard?

It gets worse. In the introduction to the released tests, the following reassuring comments are made about how items will be analyzed and discussed:

Fair enough: you cannot read the student’s mind. At least you DO promise me helpful commentary on each item. But note the third sentence: *The rationales describe why the wrong answer choices are plausible but incorrect and are based on common errors in computation.* (Why only computation? Is this an editorial oversight?) Let’s look at an example for arguably the least valid questions of the five:

Oh. It is a valid test of understanding because you say it is valid. Your proof of validity comes from simply reciting the standard and saying this item assesses that.

Wait, it gets even worse. Here is the “rationale” for the scoring, with commentary:

Note the difference in the rationales provided for wrong answers B and C: “may have limited understanding” vs. “may have some understanding… but may have made an error when obtaining the final result.”

This raises a key question unanswered in the item analysis and in the test specs. Does computational error = lack of understanding? Should Answers B and C be scored equal? (I think not, given the intent of the Standards). The student “may have some understanding” of the Standard or may not. Were Answers B and C treated equally? We do not know; we can’t know given the test security.

So, all you are really saying is: wrong answer.

*Answers A, B, C are plausible but incorrect. They represent common student errors made when subtracting numbers expressed in scientific notation.* Huh? Are we measuring subtraction here or understanding of scientific notation? (Look back at the Standard.)

Not once does the report suggest an equally plausible analysis: *students were unable to figure out what this question was asking!!!* The English is so convoluted, it took me a few minutes to check and double-check whether I parsed the language properly:

*Plausible but incorrect…* The wrong answers are “plausible but incorrect.” Hey, wait a minute: that language sounds familiar. That’s what it says under every other item! For example:

All they are doing is copying and pasting the *same* sentence, item after item, and then substituting in the standard being assessed!! Aren’t you then merely saying: we like all our distractors equally because they are all “plausible” but wrong?

**Understanding vs. computation.** Let’s look more closely at another set of rationales for a similar problem, to see if we see the same jumbling together of conceptual misunderstanding and minor computational error. Indeed, we do:

Look at the rationale for B, the correct answer: it makes no sense. Yes, the answer is 4 squared which is an equivalent expression to the prompt. But then they say: “The student *may* have correctly added the exponents.” That very insecure conclusion is then followed, inexplicably, by great confidence: “A student who selects this response “understands the properties of integer exponents…” – which is of course, just the Standard, re-stated. Was this blind recall of a rule or is it evidence of real understanding? We’ll never know from this item and this analysis.

In other words, all the rationales are doing, really, is claiming that the item design is valid – without evidence. We are in fact learning nothing about student understanding, the focus of the Standard.

Hardly the item analysis trumpeted at the outset.

**Not what we were promised.** More fundamentally, these are not the kinds of questions the Common Core promised us. Merely making the computations trickier is cheap psychometrics, not an insight into student understanding. They are testing what is easy to test, not necessarily what is most important.

By contrast, here is an item from the test that assesses for genuine understanding:

This is a challenging item – perfectly suited to the Standard and the spirit of the Standards. It requires understanding the hallmarks of linear and nonlinear relations and doing the needed calculations based on that understanding to determine the answer. But this is a rare question on the test.

Why should the point value of this question be the same as the scientific notation ones?

**In sum: questionable.** This patchwork of released items, bogus “analysis” and copy and paste “commentary” give us little insight into the key questions: where are my kids in terms of the Standards? What must we do to improve performance against these Standards?

My weekend analysis, albeit informal, gives me little faith in the operational understanding of the Standards in this design, without further data on how item validity was established, whether any attempt was made to carefully distinguish computational from conceptual errors in the design and scoring- and whether the tentmakers even understand the difference between computation and understanding.

It is thus inexcusable for such tests to remain secure, with item analysis and released items dribbled out at the whim of the DOE and the vendor. We need a robust discussion as to whether this kind of test measures what the Standards call for, a discussion that can only occur if the first few years of testing lead to a release of the whole test after it is taken.

New York State teachers deserve better.

*This article first appeared on Grant’s personal blog; Grant can be found on twitter here; image attribution flickr user anthonypbruce; Assessment: Why Item Analyses Are So Important*

Thank you for this important summary of the rigor required in constructing high-fidelity multiple choice exams. In my experience, best-practice item analyses are rarely carried out (except maybe by textbook publishers? And that is a whole other conversation). If I can take this a step further…I’m challenged by how this assessment format fits with constructivist models of learning and teaching: http://educateria.com/2012/11/29/true-or-false-are-multiple-choice-exams-relevant-to-21st-century-education/

Great overview – thanks again!

At 70, I’m a lot older, and with both a military career and 30 years in electrical engineering under my belt a lot more practiced than 8th Graders, but I don’t see a problem with the question that has the writer up in arms. Even at their age I knew enough to cast both terms in the same power of 10, (here 8) and subtract the smaller (.93) from the larger, 9.1. How is this difficult? I could even say the other choices were so far off what had to be correct (.93 * 10^8) that I could omit the arithmetic.

That may have been the one question that accurately evaluated what students taking the test knew about scientific notation; I’ve caught out electronics techs and degree’d engineers with similar questions when interviewing them.

Datum: I never got a degree and never took College courses in engineering after I’d dropped out of High School . Working as an EE , I asked questions like this at interviews because (among other things) an engineer who makes 20 deciBel mistakes can kill people.