The reference benchmark for AI grading in K-12. Every major foundation model and every commercial AI grading vendor, evaluated on the same yardstick of human-rated student work. Public, open, independent.
How AI grading systems compare on K-12 student work.
Quadratic-weighted kappa against credentialed human raters, percent within one rubric point, and mean absolute error. Updated as new models and vendors are submitted.
#
Model
QWK vs human
Within 1 pt
MAE
1
cograder-2.0
production system, validated on Success Academies study
Listed as a participant, not the evaluator.
0.97
97.9%
0.23
2
Claude Opus 4.7
Anthropic
0.71
89.2%
0.41
3
GPT-4 Turbo
OpenAI
0.68
87.5%
0.46
4
Gemini 2.5 Pro
Google DeepMind
0.65
85.1%
0.51
5
Llama 3.1 405B
Meta
0.61
82.3%
0.58
Transparency. The model used internally by k12eval-bench's scoring pipeline, and any conflicts of interest involving listed vendors, are disclosed in the methodology document. cograder-2.0 is listed on this leaderboard as a participant, not the evaluator.
Methodology
How a number gets onto the leaderboard.
Every result on k12eval-bench is reproducible, open, and rated by credentialed humans, not by another AI. The full spec is published alongside each version of the bench.
01
Stratified item pool
Items sampled across grade band, subject, language, and student demographic. v0.1 covers 2,400 items; v1 expands to 10,000+ with bilingual coverage and fairness audits.
02
3-rater human gold standard
Each item independently scored by three credentialed K-12 teachers and reconciled to consensus. Raters are paid, calibrated, and recruited as a standing research panel.
03
Public eval suite
Open-source scoring scripts, a deterministic prompt protocol, and a submission API. Any vendor or research lab can submit a model and get the same numbers back.
04
Versioned, datasheeted, citable
Each release ships with a datasheet, a methodology paper, and a permanent DOI. v0.1 is interim; v1 becomes the canonical evaluation set for the field once funded and built.
For vendors and labs
Submit your model to k12eval-bench.
Run on the same items, against the same human-rated answer key, under the same prompt protocol. Results land on the public leaderboard. Reviewed and re-scored on the next release cycle.
The bench, the methodology, and the leaderboard are stewarded by the K12Eval Project as public infrastructure. Funding for v1 is detailed on the funders page.