A Brief Note about Grade Statistics

or

How a Curve is Computed

[A whole lot more than you probably want to know about how grade curves are computed and used to determine your grade.]

This page describes what the statistics of the exam scores mean, and describes in slightly technical detail how I compute my grade curve.

What do the overall exam statistics mean?

For the purposes of this discussion, I will present an actual grade curve from a 50-question exam given in Astronomy 161 in 2001.

For each exam, I create a grade curve; a plot of the number of students who got each possible score (in this example between 0 and 50 points). The plot is in the form of a histogram or bar plot.

From this data, I compute the mean score, the median score, and the "spread" (or standard deviation) of scores. These are reported in the first part of each exam result report like this:

Statistics [Raw (Percent out of 50)]:: Mean: 35.3 (70.2%); Median: 36 (72%); Spread: 5.93 (11.86%)

What do each of these numbers mean?

The Mean

The Mean of the curve is the average (the "arithmetic mean") of the all of the exam scores. It is computed by adding up the scores for all of the students and dividing by the number of students. In the example above, the average score was 35.3 questions out of 50 correct, or 70.2%. This is pretty good, compared to the expected score of 68% (my typical goal in composing an exam is 66-68% for the target mean assuming a traditional C+ curve).

The "Spread"

The Spread in the curve is a measurement of the distribution of scores above and below the mean. In simple terms, it is the characteristic width of the grade curve, defined mathematically as the "standard deviation" of the scores.

A large spread in a grade curve means that the scores were spread over a large range, making the curve wide and shallow. A large "tail" of low scores will also result in a larger spread in scores. A tall, narrow curve (small spread) means most people scored pretty close to the mean grade.

In the above example, the spread is 5.93 questions (11.9%), meaning a fairly normal grade curve (not too narrow, not too broad). The spread is usually 5-7 questions for a typical test.

The Median

The Median is the score that divides the grade curve down the middle (think about the so-called "median-line" in the road: the line painted down the middle). Half of the students score at or above the median, and the other half at or below the mean.

In the example above, the median was 36 correct out of 50 (or 72%), This means that half of the students scored a 36 or higher (and the other half score less than 36). For those of you who are familiar with "percentiles" (like on the SAT) the median gives the 50th-percentile score.

The median is another way of judging the class performance, since the arithmetic mean can be skewed slightly by having a number of very high or very low scores. If the curve has a long tail towards lower scores, as it the case here, then the median is a better measure of class performance than the mean. If the curve is a symmetric bell curve, the mean and median are the same.

In this example, the grade distribution curve is slightly lopsided towards low scores. This is why the median (36) is larger than the mean (35.3) about a whole question. The more lopsided a curve is (on either side), the greater the difference between the median and mean.

How is a Grade Curve computed?

A "grade curve" is the way that an exam's raw score (say, 38/50) is converted into a grade-point score on the standard 4-point scale (i.e., A = 3.85 and above, A=3.50 to 3.84, etc.). The conversion is based on the overall statistical performance of the class, and defines the boundaries between A, B, C, etc. This approach is more fair than strict percentage-point grading which sets arbitrary boundaries (e.g., 90%=A, 80%=B, etc.) without regard to the overall performance of the class.

A grading curve is setup by defining which raw scores correspond to which letter grades. The divisions between the major letter grades are given in the Quiz Results Reports as the "Grade Cutoffs" that appear in the second half of each report. For example:

Grade Cutoffs [Raw (Percent out of 50)]: A: 42-50 (84-100%); B: 35-41 (70-82%); C: 28-34 (56-68%); D: 23-27 (46-54%); E: 0-22 (0-44%)

This table shows the range of raw scores corresponding to a particular letter grade is shown (with percentage-point equivalents given in the parentheses). Different exams usually have slightly different cut-offs between different grades, depending on the overall performance, but rarely more than a question or two of difference from exam-to-exam.

Letter grade cutoffs are determined by defining the mean score to be a C+ (a so-called "C+ curve"), and then subdividing the curve into regions for each grade (ABCDE). If the curve is very symmetric and bell-shaped, the width of each of the letter-grade "bins" will usually turn out to be about as wide as the curve spread (standard deviation). This usually works out to about 14-16% wide in most cases (as it does here).

However, I don't do this blindly using just the statistics. This is where the curving process becomes more involved...

One piece of guidance I have in setting up the curve is to consult the running statistics of historical performance on questions in our multiple-choice question collection (some of these data go back more than 25 years). If the class performs very well on the exam compared to previous classes, I usually elect to compute the curve more favorably by setting the mean score to be equal a high C+ or (in really good cases) a low B-. If performance is about average, so I use a C+ curve. This is the first complication. In this example, the class scored about 4% above the historical mean, and the high median (72%) means the median score was a B- instead of a C+. This means the class on average did rather well, with more than half the class getting between B- and A on the exam.

A source of complication arises if the point-spread is large (>16%). The most common cause is a curve with a long "tail" of low scores below the median, giving the curve a lopsided appearance. To diminish the impact of this long low tail on the scores in the central "core" of the curve, I adjust the widths of the bins. This is what goes into determining the rough dividing lines between letter grades. In all cases, I round the final divisions to a given raw score (you cannot get fractional points, on right or wrong, no in-between). I am given some guidance in how to make these divisions by examining the percentile statistics (e.g., 10% of the students get less than the 10-th percentile score, 90% get less than the 90-th percentile, etc.). These provide an additional quantitative handle on the spread of data within the curve. Usually, I find that a judgment based on the statistical mean and spread in the curve is more favorable to students than making blind percentile cuts, but it does serve as a decent sanity check on my grade divisions.

The final step in determining the grade curve is to review the detailed responses to each question. The grading program we use prints out the statistics for each question (i.e., how many students gave each of the different answers for a given question). What I am looking for here is whether a particular question "threw" most of the class. In general, any question that has response statistics much worse than the overall exam average will be scrutinized.

On rare occasions this review has helped to identify truly bogus or otherwise misleading questions. In those cases I will usually elect to summarily reject the question and recompute the exam grades retroactively. In the unlikely event that throwing out the bogus question would lower someone's the score, the original higher score is retained. This is a fairly rare occurance, maybe once every 5 years, but I check nonetheless.

I then assemble all the pieces and compute a final grade for all of the students. After a final review by hand of the calculations to check for problems, imbalances, etc., I assign the final grades.

I you want more details, see this worked example of the grading process.

Final Remarks

In the end, I do not use grade statistics as a substitute for exercising my judgment as an instructor. I simply do not believe that one should "hide behind the curve" when it comes to assigning grades. Students take their grades seriously, and so do I.

As such, I make minor adjustments with rough quantitative guidance from the performance statistics so that the curves will be as fair and consistent as possible. My goal is to make the process as fair as possible so as to mitigate the otherwise impersonal situation in a large GEC lecture class.

Return to the Astronomy 161 Quiz Summaries