Holistic Face Processing Emerges Through Contextual Part Enrichment in Face-Trained Vision Transformers

Poster Presentation 33.460: Sunday, May 17, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Face and Body Perception: Wholes, parts, configurations, features

Srijani Saha1,2, Talia Konkle1,2, George Alvarez1,2; 1Harvard University, 2Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University

Seminal work in psychology has demonstrated that faces are recognized via holistic processing, where identities and face parts appear changed in different facial contexts, yet we lack formal image-computable models of these effects. Here we leverage face-trained and ImageNet-trained deep neural network models to provide a computational account of at least one form of holistic processing — updating local part representations based on global context (contextual enrichment of parts). First, we examined whether any of these networks showed classic signatures of human face processing, specifically the inversion effect. We computed identity prototypes (averaged upright exemplar embeddings) and compared distances of upright versus inverted test images. Confirming prior studies, face-trained models (n=4), irrespective of architecture, showed robust evidence of the Inversion Effect, while non-face-trained models (n=3) did not. We next leveraged the transformer models, whose patch-based representation and self-attention mechanisms allow us to understand more deeply if and how holistic processing is operating. To do so, we locally perturbed one face part (nose) and measured the identity shifts. Critically, by recording activations to each local patch in the context of the original vs. modified nose, we could ask if the identity shift was due to the change in the nose (new information), or due to the impact of the new nose on neighboring parts (contextual influence). We found that contextual effects alone created identity shifts in face-trained[TK1.1], but not object-trained models. This contextual effect dramatically decreased for inverted faces, tying the Inversion Effect observed to the likely disruption of contextual part integration. These findings unveiled how face-identification fine-tuning supports a mechanism where local face parts in upright configurations contextually update each other to build identity representations. This provides a computational formalization of holistic processing, offering a testing ground for other signatures such as the Thatcher and Composite Face Effects.

Acknowledgements: This work was supported by NSF PAC COMP-COG 1946308 to GAA