All skills

AI Benchmark Design

/design-ai-benchmarkingNEW
Data & Study Design

What it does

Design and validity review for studies benchmarking AI systems against a human-expert panel. Covers arm definitions, multi-dimensional rubrics with calibration probes, reviewer panels, inter-rater targets, and LLM-as-judge vs human adjudication.

Highlights

  • AI-vs-expert evaluation design
  • Calibration probes & inter-rater targets
  • LLM-as-judge vs human adjudication

Install this skill

git clone https://github.com/aperivue/medsci-skills.git
cp -r medsci-skills/skills/design-ai-benchmarking ~/.claude/skills/
Full documentationView source on GitHub

Related skills