k12eval · benchv0.1 · interim

k12eval/bench

The reference benchmark for AI grading in K-12. Every major foundation model and every commercial AI grading vendor, evaluated on the same yardstick of human-rated student work. Public, open, independent.

Submit a model Read the methodology

2,400

Items in v0.1

10,000+

Target items in v1

3 expert

Raters per item

EN / ES

Languages

Public leaderboard

How AI grading systems compare on K-12 student work.

Quadratic-weighted kappa against credentialed human raters, percent within one rubric point, and mean absolute error. Updated as new models and vendors are submitted.

Model

QWK vs human

Within 1 pt

MAE

cograder-2.0

production system, validated on Success Academies study

Listed as a participant, not the evaluator.

0.97

97.9%

0.23

Claude Opus 4.7

Anthropic

0.71

89.2%

0.41

GPT-4 Turbo

OpenAI

0.68

87.5%

0.46

Gemini 2.5 Pro

Google DeepMind

0.65

85.1%

0.51

Llama 3.1 405B

How a number gets onto the leaderboard.

Every result on k12eval-bench is reproducible, open, and rated by credentialed humans, not by another AI. The full spec is published alongside each version of the bench.

Stratified item pool

Items sampled across grade band, subject, language, and student demographic. v0.1 covers 2,400 items; v1 expands to 10,000+ with bilingual coverage and fairness audits.

3-rater human gold standard

Each item independently scored by three credentialed K-12 teachers and reconciled to consensus. Raters are paid, calibrated, and recruited as a standing research panel.

Public eval suite

Open-source scoring scripts, a deterministic prompt protocol, and a submission API. Any vendor or research lab can submit a model and get the same numbers back.

Versioned, datasheeted, citable

Each release ships with a datasheet, a methodology paper, and a permanent DOI. v0.1 is interim; v1 becomes the canonical evaluation set for the field once funded and built.

For vendors and labs

Submit your model to k12eval-bench.

Run on the same items, against the same human-rated answer key, under the same prompt protocol. Results land on the public leaderboard. Reviewed and re-scored on the next release cycle.

[email protected]

The bench, the methodology, and the leaderboard are stewarded by the K12Eval Project as public infrastructure. Funding for v1 is detailed on the funders page.

Datasets /funders [email protected]