Scoring Different Versions of the Same Test

Many, perhaps all, of you, are looking at a vastly different teaching situation in the Fall. A teacher wrote to me recently asking if I could describe how he could give several versions of a test on the same material to his online classes and scale them so as to be fair everyone even though one test version may unintentionally be more or less difficult than another.

The AP program is able, as well as anyone, to give different tests from year to year and within the same year and adjust the scores (5,4,3,2,1) so they indicate the same knowledge of the subject even though the questions are different. The process is described here and how different versions of the same exam are handled here.

AP Exam questions, including multiple-choice questions, take several years to develop and are pre-tested on students before they ever appear on an AP Exam. A few questions are reused from the previous year to help compare the difficulty of the two exams. I do not see how a teacher with a small number of students (compared to the AP Calculus numbers) can use a similar approach with their classes.

Nevertheless, the concern about testing outside of a traditional classroom exists. Teachers are rightly concerned about students collaborating on take-home tests. Giving several versions will help to prevent this. But how do you grade them fairly?

No two tests on the same material can be of equal difficulty whether written by the College Board or a teacher. In the blog post On Scaling that appears in the gallery below various ways of scaling tests are discussed. It occurred to me that the “Kennedy Scale” can be adapted to the current situation.

The Kennedy Scale – used to handle multiple versions of the same test.

In Assessing True Academic Success [1] by Dan Kennedy suggest this method. The entire article is worth reading every year and discusses a lot about assessment, besides just scaling.

Kennedy writes of his method, “Mathematically, the effect of scaling is to adjust the mean, a primary goal, and reduce the standard deviation, a secondary effect that helps me keep the entire class engaged.” “[Teachers] can challenge [their] students to do just about anything, then see how far they can go. …[Students] are freed from the burden of getting a certain percent right, so they can concentrate on doing as much as they can as well as they can.”

Here is how the method works. First, determine the class mean you desire. Kennedy suggests a class average of 82 for regular classes, 85 for electives, and 90 for advanced. These are based on his school wide empirical (historical) data. You may use your own data or just what you think is reasonable.

Using two data points (class mean, desired mean) and (highest score, 99 or 100), write the equation of the line through these points expressing the scaled score as a function of number of points a student earns. Use this function to scale the test.

Making the “desired mean” the same for each version of the test should go a long way to making the students’ scores indicate the same knowledge of the material. This method may be used even if the possible points on a version is not 100.

Doing the calculations.

The On Scaling post contains a TI-8x program that will for the calculations for you.

Dan Anderson, another reader of this blog, sent this link to a Desmos graph he made that will calculate the Kennedy scale score for your tests. Once you open it, save it to your Desmos files. It works like this:

• Enter the 4 numbers in the left column:
• AverageRawScore for the version you are scaling,
• DesiredAverage the same for all versions,
• MaxRawScore for the current version, and
• DesiredMax probably 99 or 100.
• The scaled scores will appear in the table in the lower left.
• The graph shows all the scores from 0 to 100. To graph just your scores, delete everything in the xi column and enter your scores (in any order). The scaled scores appear in the second column of the table and the pairs are graphed.
• The two highlighted points are (AverageRawScore, DesiredAverage) and (MaxRawScore, DesiredMax). These may be dragged to see the effect of changing them.

A final caution: If the AverageRawScore is greater than or equal to the DesiredAverage (or even close), then some scores may be scaled down. You probably want to avoid this (although, it is consistent with the idea).

And a final thought: There is no need for each version to have the same number of points or that 100 be the maximum possible. Include a question that will stretch their thinking – if some or most miss it, the scale will take care of that and not hurt them. This may make them feel more confident and prepare them for the AP Calculus Exam where most students, even those who earn a 5, do not answer all the questions.

Stay well and hang on!

[1] Assessing True Academic Success   by Dan Kennedy, The Mathematics Teacher, September 1999, page 462 – 466).

On Scaling

Why “scaling” is necessary

No teacher can make two tests on the same topics equal in difficulty. No two teachers, even if they collaborate, can make two tests on the same topic equal in difficulty. No two teachers in different schools, districts, or states can make two tests on the same subject equal in difficulty. Even professional testing companies, such as the Educational Testing Service (ETS) that writes the AP exams, cannot write two tests on the same courses of equal difficulty.

Scaling is needed to account for the difference in difficulty. Scaling attempts to make the scores on different forms of a test indicate that a student writing the test has the same amount of knowledge as another student with a similar score.

The ETS does this by pre-testing its items on college students and including several questions from previous years to help judge the difficulty from year to year. They do a great deal of statistics on each item each year. But they do not pretend that this year’s test is the same difficulty as last year’s test. After their computations and consultations with colleges are done, they scale the test. Their goal is to make the score indicate the same amount of knowledge from test to test and year to year.

A teacher cannot do that in his or her class. They don’t have the resources or the time. Yet, there are ways to even out the difficulty of your classroom tests and quizzes. .

Some poor ways to scale

In what follows, P will represent the percentage of the total points available on a test that a student earns, and S will equal the score the student is given for that percentage.

Percentage scaling (S = P): For many years I, and I expect most teachers, simply let S = P. But sometimes the scores were kind of low: the test was too hard, or the students didn’t do well (or maybe the teacher didn’t do well). What to do? Among the usual solutions are (1) give a make-up test, (2) let the students make corrections to earn back some of the points, (3) scale the test by raising all the grades arbitrarily, or (4) make sure the next test is “easy.” I’ve tried all of them.

Doesn’t make too much sense, does it?

Categories: For quite a few years, I listed the percentages from highest to lowest and looked for natural breaks to separate the scores into 90, 80, 70, etc. Intermediate scores were spread between the cut points. If you don’t need a number to put on the report cards, the categories become A, B, C, etc. with perhaps a “+” or a “–“ attached.

Comic Interlude – the “Square Root Scale”

The “square root scale” is $S=10\sqrt{P}$. So, a 36 is scaled to a 60, an 81 to a 90, and a 70 to an 84. What this accomplishes is to raise everyone score for no reason other than to raise the score. See the graph below.

The Square Root Curve,$S=10\sqrt{P}$., in red and the Percentage Curve, S = P, in blue

Compared to the percentage grade, the low scores get raised more than the higher scores. Everyone wins big time, but what does it tell you? I can see no justification for this, except maybe the “complicated” algebra involved fools the students, administrators, and parents into thinking that something really scientific is going on. It’s not.

(Since this is a calculus blog, there is a calculus exercise in the appendix below that analyzes this scheme.)

A Better Choice for Scaling – the Kennedy Scale

While no method is perfect, this method suggested in Assessing True Academic Success by Dan Kennedy [1] is a reasonable and easy one. The entire article is worth reading every year and discusses a lot about assessment, besides just scaling.

He writes of his method, “Mathematically, the effect of scaling is to adjust the mean, a primary goal, and reduce the standard deviation, a secondary effect that helps me keep the entire class engaged.” “[Teachers] can challenge [their] students to do just about anything, then see how far they can go. …[Students] are freed from the burden of getting a certain percent right, so they can concentrate on doing as much as they can as well as they can.”

I used this method for BC Calculus and 8th grade Algebra 1 in the year I came out of retirement and was happy with the results.

Here’s how the method works. First, determine the class mean you desire. Kennedy suggests a class average of 82 for regular classes, 85 for electives, and 90 for advanced. These are based on his school wide empirical (historical) data. You may use your own data or just what you think is reasonable.

Using two data points (class mean, desired mean) and (highest score, 99). (The 99 could be adjusted as you see fit.} Write the equation of the line through these points (P, S) expressing S as a function of P. Use this function to scale the test.

This TI-8x program, from the same article, will easily compute the scores for you. (There is a typo in the fourth line; it should read 0->Ymin:126->Ymax.)

Update Excel Spread Sheet for Kennedy Scale.

At the suggestion of a reader, here is an Excel spreadsheet for you may download for the Kennedy Curve. Enter the four values at the top left and the scores w ill be calculated.

Updated December 8, 2020

Update Desmos Program for Kennedy Score

Dan Anderson sent a comment (see below) with a link to a Desmos graph he made that will calculate the Kennedy scale for your tests. You can access the graph here. Once you’ve opened it, save it to your Desmos files.

It works like this: enter the 4 numbers in the left column AverageRawScore, DesiredAverage, MaxRawScore, and DesiredMax as they apply to your test. The scaled scores will appear in the table in the lower left.

To scale your exam, delete everything in the x1 column and enter your scores (in any order, with duplicates). The scaled scores appear in the second column of the table and the pairs are graphed.

The two highlighted points are (AverageRawScore, DesiredAverage) and (MaxRawScore, DesiredMax). These may be dragged to see the effect of changing them.

A final caution: If the AverageRawScore is greater then or equal to the DesiredAverage (or even close), then some scores may be scaled down. You probably want to avoid this (although, it is consistent with the idea).

Updated October 13, 2018

Update October 19, 2020

Remember, by scaling, you are not giving away free points; you are trying to account for the difference in difficulty from one test to the next.

Scaling Different Versions of the Same Test How to adapt the Kennedy method when using different versions of the same test in your class.

Update August 24, 2021

Appendix: An analysis of the Square Root Curve – A Calculus Exercise

For the function $S=10\sqrt{P}$.

1. Determine the percentage score(s), P, which receives the least points using this method. Justify your answer.
2. Determine the percentage score(s), P, which receives the most points using this method. Justify your answer.
3. At the value found in 2, what is the slope of the line tangent to the graph of $S=10\sqrt{P}$?
4. Compare your answer for 3 to the slope of S = P. Why must this be so? Is it related to the MVT?

Solution

1. Since the Square Root curve lies above the percentage curve all the values receive some increase except the end points (P = 0 and P = 100) which receive no increase.
2. Let I = the increase in  the score, then

$I=10\sqrt{P}-P$

$\displaystyle \frac{{dI}}{{dP}}=\frac{{10}}{{2\sqrt{P}}}-1$

$\displaystyle \frac{{10}}{{2\sqrt{P}}}-1=0,\text{ when }P=25$

This is the maximum since it is the only place where P’ changes from positive to negative. At P = 25 the score is raised by 25 points to a 50.

3. $\displaystyle \frac{{dS}}{{dP}}=\frac{{10}}{{2\sqrt{P}}}$. At P = 25, dS/dP = 1. The slope of the tangent line is 1.

4. At P = 25 the slope of the tangent line to the square root scale is 1: the tangent is parallel to the percentage graph. The square root scale to the left of P =25 is raising faster then S = P therefore its slope is greater. After P = 25 the slope of the square root scale decreases and drops faster than the slope of S = PP = 25 is the place where the slope changes from steeper to less steep and thus where the slopes are equal. This is the farthest point vertically above the percentage graph. This is also the point guaranteed by the MVT on the interval [0, 100].

[1] Assessing True Academic Success   by Dan Kennedy, The Mathematics Teacher, September 1999, page 462 – 466).