LLM/MLLM Evaluation

/mllm-evalNEW

What it does

Model-agnostic evaluation harness for an LLM or MLLM on a clinical task — report generation, visual question answering, clinical text extraction — covering the adjudicated reference, clinical-efficacy metrics (RadGraph-F1 / CheXbert-F1), faithfulness, contamination, prompt sensitivity, and a reader study.

Highlights

✓Clinical-efficacy metrics beyond BLEU/ROUGE
✓Contamination + prompt-sensitivity checks
✓Reader study + MLLM reviewer probe

Install this skill

git clone https://github.com/aperivue/medsci-skills.git
cp -r medsci-skills/skills/mllm-eval ~/.claude/skills/

Full documentation→View source on GitHub

Related skills

Architecture Zoo/architecture-zoo

"Which architecture for which research question" decision tool: maps task, modality, data scale, and class imbalance to a paper-grounded architecture shortlist — each with the source paper, when-to-use, medical-imaging use, reference implementation, and the matching scaffold template.

Model Scaffold/model-scaffold

Generate a reproducible, runnable PyTorch training repo for a medical-imaging task — segmentation, classification, detection, synthesis, or self-supervised pretraining — with a patient-level seed-locked split, train/evaluate scripts, and a Methods stub. Integrates MONAI / nnU-Net, never reimplements them.

Model Validation/model-validation

Design or audit the clinical-validation study for an engineer-built medical-imaging model: patient-level split disjointness, the data-leakage taxonomy, internal vs external validation, comparator design, and task-correct metric selection — with a deterministic split-leakage gate.

Model Card & Datasheet/model-card

Generate the documentation an engineer-built model must carry — a Model Card (Mitchell et al. 2019), a Datasheet for its dataset (Gebru et al. 2021), and a METRIC data-quality pass — filled only from user-supplied facts, then verify every required section is present with a completeness gate.