Face and Body Perception: Mechanisms

Talk Session: Saturday, May 16, 2026, 10:45 am – 12:30 pm, Talk Room 2
Moderator: Philippe Schyns, University of Glasgow

Talk 1, 10:45 am

From Blur to Identity: Two-Stage Neural Dynamics in Primate Face Processing System

Wanru Li1 (), Yipeng Li1, MiYoung Kwon2, Pinglei Bao1; 1Peking University, Beijing, China, 2Northeastern University, Boston, MA, USA

Human observers can recognize faces under substantial blur, although this robustness requires additional viewing time. While coarse-to-fine dynamics are well documented in early visual cortex, far less is known about whether the macaque inferotemporal (IT) cortex exhibits a comparable temporal structure. Recent work indicates that IT cortex contains explicit spatial-frequency representations, yet how these representations contribute to face identity recognition under blur remains unclear. Using Neuropixels recordings from the macaque middle and anterior lateral face patches (ML and AL), we examined how face-selective neurons respond to 120 identities across 11 levels of blur. We found that face-selective neurons are not homogeneous in their tuning. More than half (~60%) preferred moderately blurred faces, whereas the remainder preferred clear faces. Both subpopulations showed similar blur-induced increases in response latency, even after controlling for image contrast, indicating that blur broadly slows IT computations rather than merely reflecting low-level luminance differences. Analyses using high-pass and low-pass images further confirmed that clear-preferring neurons respond preferentially to high-frequency detail, whereas blur-preferring neurons emphasize coarse structure, revealing that IT face patches contain distinct spatial-frequency channels. At the population level, representational similarity analysis (RSA) revealed a robust two-stage dynamic across ML and AL face patches. After controlling for pixelwise similarity and contrast energy, blur-level information emerged rapidly as the dominant representational axis. Identity representations arose only later and became progressively delayed with increasing blur, revealing a temporal separation between coarse image structure and fine identity content. These results challenge models positing that invariant object recognition requires discarding low-level spatial frequency information. Instead, IT face patches retain multiple spatial-frequency channels and exploit these signals to extract identity from degraded sensory inputs. Together, our findings highlight a coarse-to-fine transformation in primate IT and provide a mechanistic account of the behavioral delays observed when recognizing blurred faces.

Talk 2, 11:00 am

Cortical face selectivity extends far into the visual periphery

Corey M Ziemba1, Alex Roseman1, Kenji W Koyano2,3, Carly Gregg2, Harish Katti2, David A Leopold2, Hendrikje Nienborg1; 1Laboratory of Sensorimotor Research, NEI, Bethesda, MD, 2Systems Neurodevelopment Laboratory, NIMH, Bethesda, MD, 3Advanced Neuroimaging Center, National Institutes for Quantum Science and Technology, Chiba, Japan

Although peripheral vision is less detailed than central vision, we can identify important stimuli, like faces, when presented far from the center of gaze. However, facial recognition and its neural mechanisms have mostly been studied near the fovea. Here, we combined neuronal population recordings from macaque cortical face patches, an advanced stimulus projection system, and naturalistic stimulus synthesis to investigate the strength and feature selectivity of face preference across nearly the entire visual field. We implanted microwire brush arrays in three fMRI-targeted face patches in two monkeys: The anterior medial face patch in inferotemporal cortex, the perirhinal face patch, and the prefrontal orbital face patch. We presented stimuli up to an eccentricity of 90 degrees using a projector, spherical mirror, and hemispheric dome screen. Monkeys fixated a central point while we briefly flashed images centered at different visual field locations spanning 0-80 degrees eccentricity. Stimuli consisted of isolated face and nonface object images. We also created scrambled counterparts for each image using a synthesis procedure that disrupts complex object features while maintaining low- and mid-level features. These scrambled stimuli are known to elicit similar neural responses in early visual cortex and match peripheral appearance compared to natural images under some circumstances. We found that the face preference of neural populations from all areas was present in the periphery, sometimes beyond 50 degrees eccentricity. However, selectivity was generally weaker compared with stimulus presentation at the fovea. Further, we found little evidence that face selectivity in peripheral vision can be accounted for by lower-level features preserved by our scrambling procedure. This result contrasts with perceptual studies showing that the appearance of peripheral stimuli can be matched between natural images and their scrambled counterparts and indicates that the cortex maintains highly selective responses to faces presented far in the visual periphery.

We acknowledge funding from the National Eye Institute Intramural Research Program at the National Institutes of Health (NIH, grant no. 1ZIAEY000570-01 to H.N.)

Talk 3, 11:15 am

Visual distortions reveal dissociable reference frames underlying face and object recognition in the ventral visual pathway

Antônio Mello1, Daniel Stehr1, Sarah Kerns1, Chandana Kodiweera1, Krzysztof Bujarski1, Brad Duchaine1; 1Dartmouth College

A reference frame (RF) is the coordinate system used to spatially represent a visual stimulus. Neurophysiological and psychophysical research has shown that the visual system relies on several RFs (e.g., retino-centered, body-centered), but evidence for object-centered RFs, where spatial information is encoded relative to the object itself rather than the viewer, in the ventral visual pathway (VVP) is limited. Here, we test a rare neuropsychological case across six experiments to identify RFs underlying visual recognition. Nagel is a 40-year-old man with hemi-prosopometamorphopsia who perceives distortions on the right half (RH) of faces and, less frequently, on the RH of objects. In Experiments 1-4, we presented 432 images of faces and hands while varying visual field location, visible region, viewpoint, and picture-plane orientation of the stimuli. Fisher’s exact tests revealed that Nagel’s distortions do not depend on where stimuli appear in his visual field (i.e., distortions are not retino-centered). However, the distortion location differed by category. For faces, distortions followed the RH of the face across orientations, affecting the same features even when faces were upside down (object-centered distortions). For hands, distortions instead remained on the RH of the stimulus image, meaning the distorted region shifted as hands rotated (stimulus-centered distortions). This dissociation could depend on whether a category has a canonical orientation: faces are almost always viewed upright, whereas hands are often seen in many orientations. To test this hypothesis, Experiment 5 presented 180 rotated images of a face, a hand, and 13 different objects with established canonical orientations, as confirmed by human raters. Distorted regions of nonface objects again shifted with stimulus rotation (stimulus-centered distortions), ruling out the canonical account. Experiment 6 replicated this pattern using 48 body images. Together, these findings provide robust evidence for object-centered RFs for faces and stimulus-centered RFs for objects within the VVP.

We thank the Hitchcock Foundation for funding this research.

Talk 4, 11:30 am

A Novel Explanation of the Inverted Face Effect

Garrison Cottrell1 (), Kira Fleischer1, Nikita Kachappilly1, Alexander Tahan1, Samuel Lee1, Xavier Chen1; 1UCSD

Vision researchers often assert that when shown an inverted face, subjects “revert to feature processing.” But how can they still use feature processing on inverted features? Eyebrows, eyes, noses and mouths are mono-oriented in everyday life. Furthermore, in the Thatcher illusion, subjects do not notice the features are right-side up relative to the face. This is a mystery: If subjects revert to feature processing, how would they use inverted features in one case, and not notice that they are upright in another? We suggest that, because of the log-polar mapping from the visual field to V1, in *both* cases, the features are *not* inverted when they enter the cortex. In this representation, rotation is just a vertical shift – features remain in the same orientation when the face is upright or inverted. So why are inverted faces difficult to recognize or remember? In face processing, small changes to configuration make the face appear to be someone else. Because the cortex is flat and not a torus, it can’t represent that 270 degrees is continuous with -90 degrees. When the face is upright, the nose is above the left eye; inverted, the nose is below the right eye, disrupting the configuration of the features. We test this hypothesis with a DCNN, trained using a log-polar version of faces. The model is disrupted by inverted faces, but it still recognizes 50% of familiar faces. A standard DCNN is nearly at chance, unlike humans. This effect is much smaller for inverted objects, trained to be recognized at the basic level, where configuration doesn't matter. When the model is trained to be a dog expert, it again is disrupted by inversion (Diamond & Carey, 1986). Hence, we have a novel explanation of the inverted face effect, based on the transformation that occurs in V1.

This work was supported by NSF CRCNS grant #2208362

Talk 5, 11:45 am

Face-trained deep neural network is severely misaligned with human perceptual judgments of face shape and texture.

Virginia E Strehle1 (), Frank Tong1; 1Vanderbilt University

Recent studies have suggested human perception of facial similarity can be accurately predicted by face-trained deep neural network (DNN) models (e.g., Jozwik et al., 2022; Dobs et al., 2023). However, it remains unclear whether human observers and DNN models are responding to the same dimensions of facial appearance. We explored this question by leveraging a 3D morphable model of face appearance (Paysan et al., 2009), which represented the shapes and textures of real human faces in separate principal component spaces. This model allowed us to independently and explicitly alter face shape and texture. In a forced-choice task, participants (n=26) viewed a target face and two modified faces that differed along a specific shape or texture principal component. Participants then chose the face that appeared more different. We conducted an analogous comparison on a ResNet-50 model trained on VGGFace2 by computing cosine distances between response patterns in the penultimate layer to those faces. Whereas humans perceived shape alterations as more salient, the model treated texture alterations as more distinct. Next, in a method-of-adjustment task, we generated pairs of faces that differed along multiple dimensions of shape and/or texture. Participants (n=23) traversed these dimensions by adjusting a probe face until they perceived a just-noticeable difference in identity from a reference face. We then compared human adjustment thresholds to the cosine distances between penultimate layer response patterns generated from the median human threshold faces and the associated reference faces. Both the human identity thresholds and the DNN response patterns were more sensitive to face shape than texture in this full dimensional space. However, human-to-human similarity far exceeded that of DNN-to-human similarity, especially in shape trials (human: r = 0.60; DNN: r = -0.31). Overall, our findings highlight major differences in how DNNs and humans process faces, particularly regarding face shape.

This research was supported by NEI grants R01EY035157 to FT and P30EY008126 to the Vanderbilt Vision Research Center.

Talk 6, 12:00 pm

Deep Neural Networks Predict Human Social Trait Judgements from Dynamic Faces

Po-Yuan Alan Hsiao1 (), Suvel Muttreja2, Diego Rodriguez1, Matteo Visconti di Oleggio Castello3, James V. Haxby4, Maria Ida Gobbini5, Guo Jiahui1; 1The University of Texas at Dallas, 2University of Southern California, 3University of California, Berkeley, 4Dartmouth College, 5University of Bologna

Humans constantly infer social traits from faces, and these subjective judgments play an important role in guiding social interactions. Dynamic faces offer information beyond physical facial features, providing contextual cues to aid judgements and creating ecologically valid experiences. However, most previous research has only focused on modeling social judgments from static faces. Thus, it is still unclear whether deep neural networks (DNNs) can also capture the information humans use to make social judgments from dynamic faces. To answer this question, we collected social judgments from a dynamic stimulus set consisting of 707 naturalistic four-second video clips of unfamiliar individuals. 171 subjects participated in the study and rated faces on a continuous 1–100 scale. We also include a static stimulus set from the Profile Image Dataset (1224 images, 160 raters, 1–9 scale). We used three face-trained DNNs (InsightFace, Alex-Face, VGG-Face) and two object-trained DNNs (AlexNet, VGG-16) to extract features from dynamic videos and static photos. Features were reduced using PCA, and prediction performance was evaluated by correlating model-predicted and human-rated traits. We found that face-trained DNNs successfully predicted human social judgments across all three traits from dynamic faces, consistently outperforming object-trained DNNs. Variance partitioning analysis showed that face-trained networks explained substantial unique variance in human social judgments, with only limited unique contributions from object-trained networks. Furthermore, models trained on one trait showed little generalization to other traits, indicating that the model captured trait-specific information. Most interestingly, generalization between models trained on dynamic and static faces was limited, suggesting that the shared representations between humans and DNNs differed for dynamic and static faces. These findings demonstrate that human social judgments from dynamic faces can be modeled successfully using DNNs trained with faces. The shared information between humans and DNNs is face-specific, trait-unique, and different from judgments based on static images.

This work is supported by startup funds provided by the School of Behavioral and Brain Sciences at the University of Texas at Dallas.

Talk 7, 12:15 pm

Dynamic Task-Dependent Feature Computations in Brain Pathways for Face Categorization

Yuening Yan1, Jiayu Zhan2, Yaocong Duan1, Oliver G. B. Garrod1, Robin A. A. Ince1, Chen Zhou1, Rachael E. Jack1, Philippe G. Schyns1; 1University of Glasgow, 2Peking University

Understanding how the brain extracts meaning from complex, dynamic visual input is a central goal of vision sciences. Faces are among the most informative stimuli we encounter, conveying stable identity, transient emotional state and a wide range of social information. Yet the computations that enable the brain to route, represent, and integrate these different sources of facial information remain only partially understood. We combined generative 4D face modelling, participant-specific millisecond-resolved MEG, data-driven information-theoretic analyses and explicit modelling to uncover how the brain gates, routes, and integrates facial features according to task demands–whether participants categorized identity, emotion, or both. This multimethod approach allowed us to ask three core questions: (i) How does the brain gate task-relevant features in dynamic faces? (ii) Through which cortical pathways are facial identity and emotion features routed? (iii) Where and when do these features converge to form an integrated semantic code (e.g. “happy Mary”)? Participants viewed the same 3,600 facial animations (dense-sampled design) and categorized them by identity, emotion or both (N = 8 per condition; all analyses performed per participant; p < 0.05, FWER-corrected). We found that task demands selectively gate and route facial information. During identity judgements, task-relevant shape features propagated through the ventral occipito-temporal pathway; during emotion judgements, dynamic movement features propagated through the lateral-social pathway. When these same features were task-irrelevant (e.g. movements during identity judgements), they were nonetheless represented but rapidly (< 140 ms) pruned in occipital cortex. The two pathways converged in temporal cortex, where MEG activity encoded synergistic interactions between identity and emotion features—an integrated, person-specific representation that emerged only for semantically known faces. Our findings reveal a dynamic computational mechanism through which the brain flexibly gates, routes and integrates perceptual features to construct social meaning, providing general principles for meaning formation in perceptual systems.

This work was funded by the Wellcome Trust (Senior Investigator Award, UK; 107802) and the Multidisciplinary University Research Initiative/Engineering and Physical Sciences Research Council (USA, UK; 172046-01) (P.G.S.); ERC [FACESYNTAX; 759796] (R.E.J.); the Wellcome Trust [214120/Z/18/Z] (R.A.AI.).