Model Logo Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

1Fudan University 2Shanghai AI Laboratory 3Rutgers University
Second research result visualization

ViTAR's framework of visual thinking and action-centric reasoning. In supervised fine‑tuning, ViTAR is trained with structured instructions to mimic expert‑like reasoning patterns and region‑marking behaviors. In Stage II, ViTAR is further optimized with rewards by reinforcement learning, shifting from imitation to autonomous decision refinement.

Abstract

Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive cognitive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI systems.

Main Results

Sizes of model trees

ViTAR demonstrates its robust generalization in basic visual perception and vision-language reasoning on seven medical benchmarks. Against non-reasoning medical VLMs, ViTAR leads in performance on most benchmarks. On VQA-RAD, ViTAR surpasses all of these models and surpasses the baseline model of Lingshu (78.9%) by nearly 8 points. On two reasoning-intensive benchmarks, MMMU-Med and MedXpertQA, ViTAR achieves 72.0% on MMMU-Med, surpassing both medical-specific non-reasoning and reasoning VLM, and closely matching certain proprietary models, reflecting its enhanced ability in multi-turn vision-language reasoning. On the most challenging MedXpertQA, ViTAR achieves a score of 26.9%, consistently outperforming all other open-source models with comparable parameter scale.

Why "think–act–rethink–answer" cognitive processes enhance reasoning?

Sizes of model trees

Compared to the first "think" statue (round 1), the second "rethink" statue (round 2) achieves more precise alignment with annotated lesion regions and allocates a higher proportion of attention to visual tokens. And ViTAR outperforms Lingshu by sustaining focused attention on verifiable regions and allocating more visual attention.

Sizes of model trees

As shown in figures, the average ratio (over 0-15 layers) increases from 5.84% in the first "think" round to 26.12% in the second "rethink" round. ViTAR allocates 2.82Ă— more attention to visual tokens than Lingshu (9.26%) , reflecting enhanced visual attentions.

Reasoning and Execution Efficiency

Sizes of model trees

ViTAR substantially reduces reasoning length and reasoning time while achieving the highest performance. This short reasoning length could underpin the high visual attention allocation. Despite adopting a two-round reasoning paradigm, ViTAR reaches a speedup of 2.46Ă— over Chiron-o1 and 1.56Ă— over Lingshu.

Introducing RL significantly improved the success rate from 85.7% to 99.9%, indicating substantial gains in execution stability and accuracy after RL.

Sizes of model trees

We observe the evolving ability improvements during the training dynamics of RL. Three phases can be identified in the changing dynamics of response length, training loss, and format rewards.

Case Study

Sizes of model trees

In the act stage, the model visually marks a region of interest in the left lower lung field. The "rethink" stage is guided by this explicit visual cue. ViTAR refines its reasoning and identifies the localized opacity as suggestive of pleural effusion. In the answer stage, the system returns a positive diagnosis, indicating that an abnormality is present.

Sizes of model trees

ViTAR marks a region of interest in the lower left portion of the image. Based on this highlighted area, ViTAR identifies imaging features consistent with a mass, which constitutes an abnormal finding, and concludes with the answer.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}