Beyond Fluency: A Clinical Benchmark and Anomaly-Enhanced Baseline for Spine MRI Report Generation
Jun 2026·,,,,
Palau B.
Vogt F.
Laslo D.
Li H.
Konukoglu E.
Maria Monzon
Shared last authorship
,Jutzeler C.R.
Shared last authorship
·
0 min read
Abstract
Radiology reporting is time-consuming and subject to inter-rater variability, making automated report generation an attractive clinical application for Vision-Language Models (VLMs). We benchmark state-of-the-art VLMs on lumbar spine MRI with a focus on diagnostic accuracy and demonstrate that standard lexical and semantic metrics poorly reflect clinical correctness: fluent, well-structured reports can score highly while containing clinically meaningful diagnostic errors. To address this failure mode, we propose an architecture-agnostic framework that augments VLM inputs with spatially localized, disc-level anomaly heatmaps generated by a semi-supervised U-Net++ model. These heatmaps both improve anatomical sensitivity through explicit visual grounding and provide an independent interpretability output for clinical oversight, moving us closer to diagnostically reliable, visually grounded VLMs for lumbar spine MRI interpretation.
Type
Publication
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops — CV4Clinic

Authors
Maria Monzon
(she/her)
Computer Vision & Medical AI Researcher
PhD candidate at ETH Zurich developing robust and trustworthy deep learning for medical image analysis — spine and cardiac MRI, multimodal biomedical data, and uncertainty quantification. Previously a computer-vision researcher at BASF, where I deployed models to production in regulated, GLP-certified environments. I care about efficient code and reproducible research.