SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Surgeons do not simply observe surgical scenes—they interpret them. Experts understand not only which instrument is used, but why it is chosen, what risks it introduces, and what actions will likely follow. Current surgical AI systems struggle with such reasoning due to the lack of scalable training data capturing expert surgical intent.

We introduce SUREON, a large-scale video question-answering dataset designed to capture surgical reasoning from expert lecture videos. SUREON defines 12 reasoning-focused question categories spanning safety assessment, decision rationale, and procedural forecasting. Using a multi-agent extraction pipeline, we curate supervision from 134.7K clips covering 170 procedure types, producing 206.8K QA pairs and an expert-validated benchmark of 354 examples.

To study reasoning capabilities, we present SureonVLM, a vision-language model trained with supervised fine-tuning, and SureonVLM-R1, a reasoning model trained using Group Relative Policy Optimization. Both models demonstrate strong surgical reasoning ability, achieving over 84% accuracy on the SUREON benchmark and outperforming larger general-domain models on surgical perception tasks.

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

SUREON extracts surgical reasoning supervision from expert lecture videos, enabling vision-language models to understand surgical intent, risk, and future actions.

Abstract

Dataset Overview

BibTeX