Measuring the quality of AI-produced professional work.
Everyone ships "AI skills." Almost nobody measures whether the output is any good. This is an open benchmark that does — think of it as a SWE-bench for professional work: PRDs, OKRs, roadmaps, postmortems, GTM plans and more, scored on a fixed rubric by an LLM judge.
For each skill, a representative held-out case is run, then a judge model rates the output 1–5 on four dimensions:
node evals/run-evals.mjs.The benchmark is open. To get a skill scored:
skills/<name>/SKILL.md and a case in evals/cases.json (or use the submit-a-skill form).