AI Benchmark Design

/design-ai-benchmarkingNEW

What it does

Design and validity review for studies benchmarking AI systems against a human-expert panel. Covers arm definitions, multi-dimensional rubrics with calibration probes, reviewer panels, inter-rater targets, and LLM-as-judge vs human adjudication.

Highlights

✓AI-vs-expert evaluation design
✓Calibration probes & inter-rater targets
✓LLM-as-judge vs human adjudication

Install this skill

git clone https://github.com/aperivue/medsci-skills.git
cp -r medsci-skills/skills/design-ai-benchmarking ~/.claude/skills/

Full documentation→View source on GitHub

Related skills

Study Design/design-study

Identifies analysis unit, cohort logic, data leakage risks, and validation strategy.

Sample Size Calculator/calc-sample-size

Interactive sample size calculator with decision-tree guided test selection. Covers 11 designs including Cox regression EPV.

Data Cleaning/clean-data

Standardize, validate, and transform raw research datasets. Handles missing data, outlier detection, and variable recoding.

De-identification/deidentify

De-identify clinical research data before LLM-assisted analysis. Standalone Python CLI with 10 country locale packs. No LLM involved.