LungCURE — Benchmarking Multimodal Real-World Clinical Reasoning

Overview

Abstract

Lung cancer clinical decision support demands precise reasoning across complex, multi-stage oncological workflows. Existing multimodal large language models (MLLMs) fail to handle guideline-constrained staging and treatment reasoning. We formalize three oncological precision treatment (OPT) tasks for lung cancer, spanning TNM staging, treatment recommendation, and end-to-end clinical decision support. We introduce LungCURE, the first standardized multimodal benchmark built from 1,000 real-world, clinician-labeled cases across more than 10 hospitals. We further propose LCAgent, a multi-agent framework that ensures guideline-compliant lung cancer clinical decision-making by suppressing cascading reasoning errors across the clinical pathway. Experiments reveal large differences across various large language models (LLMs) in their capabilities for complex medical reasoning, when given precise treatment requirements. We further verify that LCAgent, as a simple yet effective plugin, enhances the reasoning performance of LLMs in real-world medical scenarios.

Benchmark Design

Tasks

LungCURE formalizes three oncological precision treatment (OPT) tasks that span the full clinical workflow for lung cancer diagnosis and treatment. All tasks share the same multimodal patient input.

Task 1

TNM Staging

Given multimodal clinical data (imaging reports, pathology reports, clinical records, supplementary materials), predict the complete TNM pathological stage. A prediction is correct only if the T, N, and M stages are all correct.

Staging Accuracy Reasoning Quality (RQ)

Task 2

Treatment Recommendation

Given multimodal inputs and the ground-truth TNM stage as a conditioning signal, generate a guideline-compliant treatment plan. Tests the model's treatment planning capability independent of staging accuracy.

Precision (%) BERT-F1

Task 3

End-to-End Decision Support

Given only the multimodal clinical materials (no staging input), generate a complete clinical decision. Mirrors real-world deployment where a patient uploads records and receives full decision support without manual staging intervention.

Precision (%) BERT-F1

Dataset

The LungCURE Benchmark

1,000 real-world multimodal clinical cases collected from 10+ hospitals in China (2019–2025), fully de-identified and approved by the Ethics Committee of Peking Union Medical College Hospital.

🏥

Multi-Center

Sourced from 10+ hospitals across China for geographic and clinical diversity

📁

Multimodal

Imaging reports, pathology reports, clinical records, and genomic testing results

👨‍⚕️

Expert Annotated

Two-stage annotation protocol by senior clinicians with evidence-based TNM staging

🌐

Bilingual

Supports both Chinese (ZH) and English (EN) evaluation for cross-lingual assessment

🔒

Privacy-Safe

Fully de-identified; all patients provided informed consent before enrollment

⚖️

Guideline-Grounded

Gold standards based on AJCC, NCCN, and CSCO clinical oncology guidelines

Data Collection Pipeline

📂

Step 1: Data Collection

Imaging, pathology, clinical records, and gene testing data from hospitals

🛡️

Step 2: Anonymization

Privacy de-sensitization and unified management of case files

✅

Step 3: Annotation & QC

Interdisciplinary team annotation with expert TNM staging and quality control

Evaluation Metrics

TNM Staging Accuracy Reasoning Quality (RQ) Precision (%) BERT-F1

Method

LCAgent Framework

A simple yet effective multi-agent plugin that boosts MLLM performance on LungCURE by decomposing the clinical pathway into structured, guideline-grounded stages.

🎯

Anatomical Dimension Isolation

Three concurrently executed agents independently extract T, N, and M evidence; a deterministic rule-based node aggregates the final stage, eliminating stochastic errors.

🗺️

Feature Routing for Guideline-Grounded Treatment

Decision variables map to a structured feature vector that dynamically activates a scenario-specific expert agent under locally injected guideline subsets as hard constraints.

🔗

Cascading Error Suppression

Strict decision boundaries between stages prevent reasoning errors from propagating across the clinical pathway — a core failure mode of direct MLLM prompting.

🔌

Plug-and-Play Compatibility

Model-agnostic: consistently boosts Qwen3.5-397B, Kimi-K2.5, GPT-5.2, and more in a plug-in way with gains across all three tasks.

Key Performance Gains with LCAgent · Qwen3.5-397B, MLLM Image Input, ZH

+4.84%

TNM Staging Accuracy

+24.07%

Treatment Recommendation Precision

+30.38%

End-to-End Decision Precision

+42.91%

E2E Precision (OCR+LLM)

Reference

Citation

@misc{hao2026lungcurebenchmarkingmultimodalrealworld,
      title={LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment}, 
      author={Fangyu Hao and Jiayu Yang and Yifan Zhu and Zijun Yu and Qicen Wu and Wang Yunlong and Jiawei Li and Yulin Liu and Xu Zeng and Guanting Chen and Shihao Li and Zhonghong Ou and Meina Song and Mengyang Sun and Haoran Luo and Yu Shi and Yingyi Wang},
      year={2026},
      eprint={2604.06925},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2604.06925}, 
}