Scoring Different Versions of the Same Test

Many, perhaps all, of you, are looking at a vastly different teaching situation in the Fall. A teacher wrote to me recently asking if I could describe how he could give several versions of a test on the same material to his online classes and scale them so as to be fair everyone even though one test version may unintentionally be more or less difficult than another.

The AP program is able, as well as anyone, to give different tests from year to year and within the same year and adjust the scores (5,4,3,2,1) so they indicate the same knowledge of the subject even though the questions are different. The process is described here and how different versions of the same exam are handled here.

AP Exam questions, including multiple-choice questions, take several years to develop and are pre-tested on students before they ever appear on an AP Exam. A few questions are reused from the previous year to help compare the difficulty of the two exams. I do not see how a teacher with a small number of students (compared to the AP Calculus numbers) can use a similar approach with their classes.

Nevertheless, the concern about testing outside of a traditional classroom exists. Teachers are rightly concerned about students collaborating on take-home tests. Giving several versions will help to prevent this. But how do you grade them fairly?

No two tests on the same material can be of equal difficulty whether written by the College Board or a teacher. In the blog post On Scaling that appears in the gallery below various ways of scaling tests are discussed. It occurred to me that the “Kennedy Scale” can be adapted to the current situation.

The Kennedy Scale – used to handle multiple versions of the same test.

In Assessing True Academic Success [1] by Dan Kennedy suggest this method. The entire article is worth reading every year and discusses a lot about assessment, besides just scaling.

Kennedy writes of his method, “Mathematically, the effect of scaling is to adjust the mean, a primary goal, and reduce the standard deviation, a secondary effect that helps me keep the entire class engaged.” “[Teachers] can challenge [their] students to do just about anything, then see how far they can go. …[Students] are freed from the burden of getting a certain percent right, so they can concentrate on doing as much as they can as well as they can.”

Here is how the method works. First, determine the class mean you desire. Kennedy suggests a class average of 82 for regular classes, 85 for electives, and 90 for advanced. These are based on his school wide empirical (historical) data. You may use your own data or just what you think is reasonable.

Using two data points (class mean, desired mean) and (highest score, 99 or 100), write the equation of the line through these points expressing the scaled score as a function of number of points a student earns. Use this function to scale the test.

Making the “desired mean” the same for each version of the test should go a long way to making the students’ scores indicate the same knowledge of the material. This method may be used even if the possible points on a version is not 100.


Doing the calculations.

The On Scaling post contains a TI-8x program that will for the calculations for you.

Dan Anderson, another reader of this blog, sent this link to a Desmos graph he made that will calculate the Kennedy scale score for your tests. Once you open it, save it to your Desmos files. It works like this:

  • Enter the 4 numbers in the left column:
    • AverageRawScore for the version you are scaling,
    • DesiredAverage the same for all versions,
    • MaxRawScore for the current version, and
    • DesiredMax probably 99 or 100.
  • The scaled scores will appear in the table in the lower left.
  • The graph shows all the scores from 0 to 100. To graph just your scores, delete everything in the xi column and enter your scores (in any order). The scaled scores appear in the second column of the table and the pairs are graphed.
  • The two highlighted points are (AverageRawScore, DesiredAverage) and (MaxRawScore, DesiredMax). These may be dragged to see the effect of changing them.

A final caution: If the AverageRawScore is greater than or equal to the DesiredAverage (or even close), then some scores may be scaled down. You probably want to avoid this (although, it is consistent with the idea).

And a final thought: There is no need for each version to have the same number of points or that 100 be the maximum possible. Include a question that will stretch their thinking – if some or most miss it, the scale will take care of that and not hurt them. This may make them feel more confident and prepare them for the AP Calculus Exam where most students, even those who earn a 5, do not answer all the questions.

Stay well and hang on!


 

 

 

 

 

[1] Assessing True Academic Success   by Dan Kennedy, The Mathematics Teacher, September 1999, page 462 – 466).


 

 

 

 


 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.