When Models See Wholes: A Mechanistic Account of Holistic Processing in Deep Vision Models
Poster Presentation 56.418: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Pavilion
Session: Object Recognition: Models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Fenil Doshi1,2, Thomas Fel1,2, Talia Konkle1,2, George A. Alvarez1,2; 1Harvard University, 2Kempner Institute for the Study of Natural and Artificial Intelligence
Human object recognition relies on configural processing, the ability to integrate local parts into coherent global shapes. In contrast, contemporary deep neural networks often rely heavily on local texture cues, leaving open whether any current architectures support human-like holistic shape perception. Here we introduce (i) a new measure of configural sensitivity using visual anagrams and (ii) a mechanistic account of the representational transformations that give rise to holistic processing in models. To isolate configural processing, we constructed visual anagrams: image pairs that depict different objects but are assembled from the exact same set of local patches rearranged into different global layouts. We quantify Configural Shape Score (CSS) as the ability to classify both members of each anagram pair. Because local patches are matched, classifying both images requires sensitivity to global configuration. Across 86 pretrained models, we find a continuum of holistic competence: self-supervised and language-aligned ViTs show strong configural sensitivity, whereas supervised CNNs and supervised ViTs fail, even when matched in accuracy. CSS also predicts performance on other tasks including noise robustness, foreground bias, and phase dependence. To examine how high-CSS models implement configural processing, we developed a mechanistic interpretability method that disentangles positional structure from semantic content, and tracked how these signals evolve across layers. In high-CSS models, positional structure remains geometrically intact into deeper layers, preserving the spatial scaffold needed for part–whole reasoning, while content representations become contextually enriched through long-range interactions. The contextual enrichment is supported by self-attention mechanisms, with the intermediate blocks forming a causally indispensable stage: removing contextual influences in these blocks collapses CSS. Overall, our findings suggest that holistic perception arises when a system maintains precise spatial relationships while contextually enriching local feature representations from long-range interactions, offering a mechanistic account of the mid-level computations that assemble local parts into structured global percepts.
Acknowledgements: 1. NSF PAC COMP-COG 1946308 to GAA 2. Kempner Institute Graduate Fellowship to FD