The Professional Skill Benchmark — measuring AI work quality

Everyone ships "AI skills." Almost nobody measures whether the output is any good. This is an open benchmark that does — think of it as a SWE-bench for professional work: PRDs, OKRs, roadmaps, postmortems, GTM plans and more, scored on a fixed rubric by an LLM judge.

🏆 See the live leaderboard Methodology & code ↗ 📝 Grade your own work

How it's scored

For each skill, a representative held-out case is run, then a judge model rates the output 1–5 on four dimensions:

Structure — follows the expected shape for this kind of artifact.

Completeness — covers what the task needs; nothing important missing.

Usefulness — specific and genuinely useful to a professional, not generic.

Grounding — stays grounded in the input; no invented facts or metrics.

Why it's credible

Reproducible & open. The harness, cases, and results are in the repo — run it yourself with node evals/run-evals.mjs.
Blind to the author. The judge sees the skill's stated purpose and the output, not the model or author name.
It catches real failures. A recent run flagged three skills at ~2.0/5; a fix lifted them to 4.75/5. Scores move when quality moves.
Honest coverage. 15 of 180 skills are scored today and the set is growing — we don't claim a number we can't show.

Submit your skill to the benchmark

The benchmark is open. To get a skill scored:

Open a PR adding skills/<name>/SKILL.md and a case in evals/cases.json (or use the submit-a-skill form).
The PR check validates structure and scores the changed skill automatically, posting the result on your PR.
Skills that clear the bar (aim for ≥ 4.0/5) get merged and appear on the leaderboard.

→ View the current standings