All skills

LLM/MLLM Evaluation

/mllm-evalNEW
Model Engineering & Validation

What it does

Model-agnostic evaluation harness for an LLM or MLLM on a clinical task — report generation, visual question answering, clinical text extraction — covering the adjudicated reference, clinical-efficacy metrics (RadGraph-F1 / CheXbert-F1), faithfulness, contamination, prompt sensitivity, and a reader study.

Highlights

  • Clinical-efficacy metrics beyond BLEU/ROUGE
  • Contamination + prompt-sensitivity checks
  • Reader study + MLLM reviewer probe

Install this skill

git clone https://github.com/aperivue/medsci-skills.git
cp -r medsci-skills/skills/mllm-eval ~/.claude/skills/
Full documentationView source on GitHub

Related skills