AI Benchmark Design
/design-ai-benchmarkingNEWWhat it does
Design and validity review for studies benchmarking AI systems against a human-expert panel. Covers arm definitions, multi-dimensional rubrics with calibration probes, reviewer panels, inter-rater targets, and LLM-as-judge vs human adjudication.
Highlights
- ✓AI-vs-expert evaluation design
- ✓Calibration probes & inter-rater targets
- ✓LLM-as-judge vs human adjudication
Install this skill
git clone https://github.com/aperivue/medsci-skills.git
cp -r medsci-skills/skills/design-ai-benchmarking ~/.claude/skills/Related skills
Study Design/design-study
Identifies analysis unit, cohort logic, data leakage risks, and validation strategy.
Sample Size Calculator/calc-sample-sizeInteractive sample size calculator with decision-tree guided test selection. Covers 11 designs including Cox regression EPV.
Data Cleaning/clean-dataStandardize, validate, and transform raw research datasets. Handles missing data, outlier detection, and variable recoding.
De-identification/deidentifyDe-identify clinical research data before LLM-assisted analysis. Standalone Python CLI with 10 country locale packs. No LLM involved.