Three Demos, Three Study Types: MedSci Skills End-to-End Pipeline

6 min read
medsci-skillsdemodiagnostic-accuracymeta-analysisepidemiologyopen-sourceSTARDPRISMASTROBE

Three Demos, Three Study Types: MedSci Skills End-to-End Pipeline

Most AI writing tools can draft a paragraph. Few can run the statistics correctly, generate figures at journal resolution, audit reporting compliance, and build a slide deck — all from the same dataset in a single session.

We built three end-to-end demos using only public data and MedSci Skills. Each demo covers a different study type, uses different statistical methods, and produces a different set of outputs. The goal: prove that 20 skills working together can handle the full research pipeline, not just the easy parts.


Demo 1: Diagnostic Accuracy — Wisconsin Breast Cancer

Input: One line of Python.

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()  # 569 samples, 30 features

What the pipeline produced:

The analyze-stats skill generated a Table 1 with automatic normality testing (Kolmogorov-Smirnov for n >= 50) and appropriate test selection — t-test for normal distributions, Mann-Whitney U otherwise. No manual statistical decisions required.

Three classifiers were compared: Logistic Regression (AUC 0.995), SVM (AUC 0.994), and Random Forest (AUC 0.987). All confidence intervals use the DeLong method, not bootstrap. The DeLong test caught a significant difference between SVM and Random Forest (p = 0.043) that point estimates alone would have missed.

MetricValue
Best AUC0.995 (95% CI: 0.990-1.000)
Figures4 at 300 dpi (ROC, confusion matrix, calibration, threshold)
Manuscript~1,600 words, IMRAD structure
STARD audit19/30 PRESENT, 5 PARTIAL, 6 MISSING — with fix recommendations
Slides12 with speaker notes

The STARD compliance audit is worth highlighting. The check-reporting skill checked all 30 STARD 2015 items and provided specific fix text for each missing item. For example:

Item 7 (Sampling): Add: "The dataset comprised a convenience series of FNA specimens collected at a single academic center."

This is what typically takes a reviewer 30+ minutes — done in seconds, with actionable fixes.


Demo 2: Meta-Analysis — BCG Vaccine Efficacy

Input: One R dataset.

library(metafor)
data(dat.bcg)  # 13 RCTs, 357,347 participants

What the pipeline produced:

The classic Colditz et al. (1994) BCG vaccine dataset. One R script handled: random-effects modeling (REML), forest plot, funnel plot, meta-regression, and a three-test publication bias battery.

Pooled result: RR = 0.49 (95% CI: 0.34-0.70) — BCG reduced TB risk by 51%.

But heterogeneity was massive: I-squared = 92.2%. The meta-regression identified absolute latitude as the key moderator, explaining 75.6% of between-study variance (p < 0.001). BCG works better at higher latitudes. This is the textbook finding — reproduced automatically with the correct bubble plot.

Publication bias assessment:

TestResult
Egger's regressionp = 0.189 (no asymmetry)
Begg's rank correlationp = 0.952
Trim-and-fill1 study imputed, adjusted RR = 0.52 (still significant)

Leave-one-out sensitivity analysis confirmed no single study drove the overall result.

MetricValue
Studies13 RCTs
Participants357,347
Figures4 at 300 dpi (forest, funnel, trim-and-fill, bubble)
Manuscript~1,800 words with PRISMA compliance
PRISMA auditFull 27-item checklist
Slides12 with speaker notes

Demo 3: Epidemiology — NHANES Obesity and Diabetes

Input: Real CDC data.

# Download 3 XPT files from CDC (free, no registration)
# DEMO_J.XPT (demographics), BMX_J.XPT (body measures), GHB_J.XPT (glycohemoglobin)

What the pipeline produced:

NHANES 2017-2018 data — 4,866 US adults after exclusions. Two Python scripts handled: data merging, BMI recoding (WHO categories), diabetes classification (ADA HbA1c >= 6.5%), survey weight application, and adjusted logistic regression.

Key finding: Obesity was associated with 4.5 times the odds of diabetes (adjusted OR 4.50, 95% CI: 4.49-4.51), controlling for age, sex, race/ethnicity, and education.

The critical insight most tools miss: survey weights. NHANES uses a complex survey design. Without weights, diabetes prevalence was 14.9%. With proper survey weights, it dropped to 10.2%. If you skip weights, your estimates are biased. MedSci Skills computes both and shows why this matters.

MetricValue
Participants4,866 US adults
Data sourceCDC (free, no registration)
Figures4 at 300 dpi (prevalence bar, OR forest, HbA1c density, subgroup)
Manuscript~1,700 words with STROBE compliance
STROBE auditFull 22-item checklist
Slides12 with speaker notes

Side-by-Side Comparison

Demo 1: WBCDemo 2: BCGDemo 3: NHANES
Study typeDiagnostic accuracyMeta-analysisCross-sectional
LanguagePythonRPython
Key statisticAUC 0.995RR 0.49OR 4.50
CI methodDeLongWald (log-scale)Survey-weighted
Figures444
Reporting guidelineSTARD 2015PRISMA 2020STROBE
Manuscript~1,600 words~1,800 words~1,700 words
Slides121212
Adversarial reviewPASSPASSPASS

Each demo used 5-6 of the 20 available skills. The pipeline chain: clean-dataanalyze-statsmake-figureswrite-papercheck-reportingpresent-paper.


What Makes This Different

Statistical rigor. DeLong CIs for AUC, not bootstrap. Wilson score intervals for proportions. Survey weights for NHANES. Prediction intervals for meta-analysis. These are details that generic AI tools consistently get wrong.

Anti-hallucination. Every citation in every manuscript is tagged [UNVERIFIED] unless verified against PubMed or CrossRef APIs. The system forces manual checking rather than generating plausible-looking fake DOIs.

Reporting compliance built in. STARD, PRISMA, and STROBE audits are not afterthoughts — they are part of the pipeline. Each audit returns item-by-item assessment with specific fix recommendations.

Reproducibility. Fixed random seeds, version headers, full parameter logging. Every output can be regenerated from the same input.


The Numbers

MetricTotal across 3 demos
Skills used6 of 20
Scripts7 (4 Python, 1 R, 2 Python)
Figures12 (all 300 dpi)
Manuscript words~5,100
Reporting items checked79 (30 STARD + 27 PRISMA + 22 STROBE)
Presentation slides36 (with speaker notes)
Hallucinated citations0
Cost$0 (open source, MIT license)

Try It Yourself

git clone https://github.com/Aperivue/medsci-skills.git
cp -r medsci-skills/skills/* ~/.claude/skills/

Each demo is self-contained in the demo/ directory:

  • demo/01_wisconsin_bc/ — Diagnostic accuracy
  • demo/02_metafor_bcg/ — Meta-analysis
  • demo/03_nhanes_obesity/ — Epidemiology

Run the Python/R scripts, then use the Claude Code skills to generate the manuscript, figures, compliance audit, and slides.


MedSci Skills is open source, MIT licensed, and free forever. Built by a radiologist who actually writes papers.

View on GitHub | All 20 Skills | How I Built This