Emergent Visual Mental Imagery in Large Language Models

Poster Presentation 56.310: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Banyan Breezeway
Session: Visual Memory: Imagery

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

Morgan McCarty¹ (mccarty.mo@northeastern.edu), Jorge Morales¹; ¹Northeastern University

Imagine a "D", rotate it 90 degrees to the left, now add a "J" right under it. What does that look like? If your response was "umbrella", you probably arrived there by manipulating the letters and visualizing the resulting composite object in your mind's eye. However, could tasks like this be solved by using language and reasoning alone? Large Language Models (LLMs) offer a perfect case for testing this hypothesis. We adapted this classic task and extended it with dozens of novel instruction-sets, thus avoiding data contamination in the models' training sets. Multiple state-of-the-art LLMs and human participants carried out step-by-step transformations of letters and shapes with the goal of naming the resulting object. We found that the best LLMs performed at, or even above, average human level. Despite this similarity, humans and machines did not succeed and fail in the same way. There were radically different patterns of human and LLM mistakes in a large proportion of the instruction-sets. LLMs provided more consistent answers (across models) than humans (across subjects), but human individual answers were often ranked higher. Finally, both humans and machines sometimes misconstrued (albeit somewhat differently) the intermediate steps of the instruction-sets. There were notable cases in which the most common answers were bimodal, indicating that some items were prone to occasional catastrophic failures or to systematic misunderstanding of the resulting object (e.g., a balloon with a string and bow may be misinterpreted as a stick person if the component proportions are wrong). Our results offer the enticing possibility that tasks long thought to require visual mental imagery can be solved exclusively via a (quasi-)propositional format. Our findings also offer a cautionary tale: Human and artificial intelligence systems' comparable performance in visual and cognitive tasks may hide important underlying computational differences.

Acknowledgements: This work was funded, in part, by a Northeastern University Undergraduate Research and Fellowships PEAK Summit Award.

Vision Sciences Society

Emergent Visual Mental Imagery in Large Language Models

Important Dates

MyVSS

Join VSS

Future Meetings