Rubric-based Grading Tool
An ongoing NLP project that asks a very specific question: can an LLM grade reasoning-heavy student answers the same way a professor does?
Built around CS302: Algorithms and Theory of Computing, this project treats the professor grade as ground truth and evaluates whether a rubric-based grading tool can reproduce that judgment in a structured and reliable way.

Overview
This project is centered on a simple but difficult goal: I trust the professor grade as ground truth, so I want to test whether my grading tool can give the same grade on reasoning-heavy student answers. What makes the project difficult is not only the model, but the entire structure around it: preprocessing messy inputs, building a usable rubric for varied answers, comparing multiple prompt designs, and evaluating where LLM grading aligns with professor judgment and where it still fails.
What drew me to this project is that grading looks simple from the outside, but it becomes a much harder NLP problem once the answers are open-ended. It is not enough for the model to detect a few keywords. It has to recognize partial reasoning, separate real understanding from shallow pattern matching, and allocate points in a way that stays faithful to how a human instructor would grade.
That is why I did not want to work on short factual responses. I wanted to work on the difficult part: reasoning.
Why this course
I chose Algorithms and Theory of Computing because this class has many reasoning questions, which makes it a genuinely challenging problem for NLP. The answers are not just short facts. They often involve proof ideas, structured explanations, reductions, justification, and partial correctness. That makes the grading task much closer to real academic judgment.
This is also why I think the course is a strong benchmark. If a grading tool only works on very short or factual answers, that is not especially interesting. But if it can begin handling questions about true or false with explanation, NP-style reasoning, grammar ambiguity, proofs, or reductions, then it becomes a much more meaningful tool.
Why the task is hard
Reasoning answers are varied. A student can be concise but correct, partially correct, or logically close without matching the reference wording at all.
Why CS302 fits
The course contains exactly the kinds of questions that make grading interesting to study: structured reasoning, explanation, and judgment beyond surface matching.
Pipeline
The pipeline matters because I do not want grading to be a black box prompt. I want it to be a structured process that can be inspected, compared, and improved step by step. The tool does not just ask an LLM for a score. It moves through several stages before the final output is evaluated against the professor grade.
Question + rubric/reference solution + student answer → preprocessing → structured grading prompt → parsed score and feedback → comparison against professor grade
Preprocessing
Preprocessing took much more time than I expected. In theory, the system begins with a question, a rubric, and a student answer. In practice, those inputs are often buried inside messy PDFs, inconsistent formatting, or handwritten responses that are hard to read.
Some handwritten text is difficult to see, so the challenge already begins before grading. Once the extraction becomes noisy, the problem is no longer just grading. It becomes OCR, normalization, parsing, and only then evaluation.
This changed how I think about the project. I originally thought the model would be the main source of difficulty, but the input layer turned out to be one of the biggest bottlenecks. If the extracted text is wrong or incomplete, even a strong grading prompt will fail for the wrong reason.
Rubric building
The most challenging part of the project has been building the rubric. This cost me a lot of time because student answers are varied even when they express the same underlying idea. Some answers are concise but valid. Some are partially correct in a way that deserves partial credit. Some are structurally strong but incomplete.
That means the rubric cannot just describe the ideal answer. It also has to encode how judgment works across many valid answer patterns. This is where grading becomes more than matching keywords. It becomes a problem of formalizing academic judgment.
In this project, I trust the professor grade as ground truth. So the goal is not whether the model can produce a reasonable grade in isolation. The goal is whether my tool can reproduce the same grade the professor would give.
Prompt strategy
My strategy is to test multiple prompts rather than trust a single grading setup. Prompt wording changes grading behavior, especially when the model has to decide how strict or permissive it should be with partial reasoning and partial credit.
I also classify questions into categories to see which types of questions the LLM performs well on and which it does not. This matters because not all reasoning questions are equally hard. A true or false question with explanation is different from a proof, and a proof is different from a reduction.
Multiple prompts
I compare prompt variants to see how strictness, rubric framing, and instruction style affect score alignment with professor grading.
Question categories
I group questions by type to identify where LLM grading performs well and where certain reasoning formats still remain difficult.
Evaluation
Evaluation is the center of the project because the system is only useful if I can compare its behavior against professor grading in a careful way. I use several metrics because no single number fully captures grading quality.
Some metrics tell me whether the scores are numerically close. Others tell me whether the model preserves the same ranking pattern across students. That matters because a model can be close on average while still misjudging relative answer quality.
| Metric | Why it matters |
|---|---|
| MAE | Measures how close the model score is to the professor score on average. |
| Exact Match | Shows how often the tool gives exactly the same score as the professor. |
| Pearson Correlation | Checks whether the model follows the same scoring trend as the professor. |
| Spearman Correlation | Checks whether the model preserves the same ranking across student answers. |
This combination helps me ask a much better question than whether the LLM seems good at grading. It lets me ask under what conditions the grading tool stays aligned with professor judgment and where that alignment starts to break.
Where this is going
I hope that from next semester, the professor can begin applying the tool across multiple courses and disciplines. My larger hope is that if the instructor formats the question and rubric clearly, then the tool should be able to perform well without being redesigned from scratch for every class.
I do not want this to remain a one-course demo. I want it to become a grading framework that can be reused across different academic settings, as long as the input structure is clear enough.
At the same time, this project is still ongoing, so I do not want to write a full conclusion yet. There are still many approaches I have not applied, including in-context prompting, chain-of-thought style grading, and other ways of aligning model reasoning more tightly with rubric logic.
So for now, I think of this page as a progress report. The core benchmark is already clear, the evaluation structure is much better defined, and the main weaknesses are visible enough that the next steps feel concrete.