Abstract
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
Results
Takeaway 1: MentisOculi is far from
saturated
MLLMs and
UMMs
display similar failure patterns. Performance degrades consistently with difficulty and
falls below chance at Level 5. This highlights the fundamental
limitations of
current state-of-the-art models in solving multi-step visual reasoning tasks.
Takeaway 2: Explicit visual thought is currently ineffective
We find no evidence that self-generated imagery improves text-only reasoning.
Latent visual reasoning (Mirage)
offers only brittle gains, while
UMMs often underperform their
text-only counterparts.
Video models (Veo-3) fail rapidly
as complexity increases.
Takeaway 3: Models possess the competence to solve the tasks
When prompted with a precise text transcription rather than an image,
MLLMs like Gemini 3 and GPT-5
can solve RushHour on par with humans. This proves that the failure
stems from visual processing and planning, not a lack of logical reasoning
capacity.
Why do UMMs fail? A dual
issue.
Visual reasoning suffers from generation errors (bad images) and
interpretation errors.
Even when provided with correct "oracle" visuals, models often fail to use them as
actionable evidence.
This suggests architectures cannot yet effectively bridge the gap between generation and
reasoning.
BibTeX
@article{zeller2026mentisoculi,
title={{MENTISOCULI}: Revealing the Limits of Reasoning with Mental Imagery},
author={Zeller, Jana and Wiedemer, Thadd{\"a}us and Li, Fanfei and Klein, Thomas and Mayilvahanan, Prasanna and Bethge, Matthias and Wichmann, Felix and Cotterell, Ryan and Brendel, Wieland},
journal={arXiv preprint arXiv:2602.02465},
year={2026},
note={Preprint. January 31, 2026},
url={https://jana-z.github.io/mentis-oculis}
}
Collectively, the tasks require models to solve multi-step reasoning problems with geometric
constraints.
Success hinges on the ability to maintain a visual representation with high fidelity and
consistent geometry under affine transformations.
Each task is procedurally generated across five difficulty levels, scaling with the number of
operations required from one (left) to five (right)