Face and Body Perception

Talk Session: Tuesday, May 21, 2024, 10:45 am – 12:15 pm, Talk Room 1
Moderator: Chris Baker, National Institutes of Health

Talk 1, 10:45 am, 52.11

Third Social Pathway computes Dynamic Action Unit Features for Emotion Decision Behavior

Yuening Yan1 (), Jiayu Zhan2, Oliver Garrod1, Robin A.A. Ince1, Rachael Jack1, Philippe Schyns1; 1University of Glasgow, 2Peking University

Faces convey stable identity via static 3D shape/complexion features and transient emotions via dynamic movements features (i.e. Action Units, AUs). With a transparent generative Virtual Human (VH), we studied how brain pathways dynamically compute (i.e. represent, communicate, integrate) AUs and 3D identity features for emotion decisions. In a behavioral task, the generative VH presented randomly parametrized AUs applied to 2,400 random 3D identities. This produced a different animation per trial that each participant (N=10) categorized as one emotion (happy, surprise, fear, disgust, anger, sad). Using participant’s responses, we modelled the AUs causing their perception of each emotion. In subsequent neuroimaging, each participant categorized their own emotion models applied to 8 new identities while we randomly varied each AU’s amplitude and concurrently measured MEG. Using information theoretical analyses, we traced where and when MEG source amplitudes represent each AU and how sources then integrate AUs for decisions. We compared these representations to covarying but decision-irrelevant 3D face identities. Our results replicate across all participants (p<0.05, FWER-corrected): (1) Social Pathway (Occipital Cortex to Superior Temporal Gyrus) directly represents AUs with time lags, with no Ventral involvement; (2) AUs represented early are maintained until STG integrates them with later AUs. In contrast, emotion-irrelevant 3D identities are reduced early, within Occipital Cortex. In summary, we show that the third “Social” Brain Pathway (not the dorsal pathway) dynamically represents facial action units with time lags that are resorbed by the time they reach STG, where they are integrated for emotion decision behavior; while the irrelevant 3D face identity is not represented beyond OC.

Acknowledgements: This work was funded by the Wellcome Trust (Senior Investigator Award, UK; 107802) and the Multidisciplinary University Research Initiative/Engineering and Physical Sciences Research Council (USA, UK; 172046-01), awarded to P.G.S; and the Wellcome Trust [214120/Z/18/Z], awarded to R.I..

Talk 2, 11:00 am, 52.12

Large-scale Deep Neural Network Benchmarking in Dynamic Social Vision

Kathy Garcia1, Colin Conwell1, Emalie McMahon1, Michael F. Bonner1, Leyla Isik1; 1Johns Hopkins University

Many Deep Neural Networks (DNNs) with diverse architectures and learning objectives have yielded high brain similarity and hierarchical correspondence to ventral stream responses to static images. However, they have not been evaluated on dynamic social scenes, which are thought to be processed primarily in the recently proposed lateral visual stream. Here, we ask whether DNNs are similarly good models of processing in the lateral stream and the superior temporal sulcus as they are in the ventral stream. To investigate this, we employ large-scale deep neural network benchmarking against fMRI responses to a curated dataset of 200 naturalistic social videos. We examine over 300 DNNs with diverse architectures, objectives, and training sets. Notably, we find a hierarchical correspondence between DNNs and lateral stream responses: earlier DNN layers correlate better with earlier visual areas (including early visual cortex and middle temporal cortex), middle layers match best with mid-level regions (extrastriate body area and lateral occipital cortex), and finally later layers in the most anterior regions (along the superior temporal sulcus). Pairwise permutation tests further confirm significant differences in average depth of the best layer match between each region of interest. Interestingly, we find no systematic differences between diverse network types in terms of either hierarchical correspondence or absolute correlation with neural data, suggesting drastically different network factors (like learning objective and training dataset) play little role in a network’s representational match to the lateral stream. Finally, while the best DNNs provided a representational match to ventral stream responses near the level of the noise ceiling, DNN correlations were significantly lower in all lateral stream regions. Together, these results provide evidence for a feedforward visual hierarchy in the lateral stream and underscore the need for further refinement in computational models to adeptly capture the nuances of dynamic, social visual processing.

Acknowledgements: NIH R01MH132826

Talk 3, 11:15 am, 52.13

Hyper-realistic reverse correlation reveals a novel gender bias in representations of leadership across political orientation

Stefan Uddenberg1 (), Zen Nguyen1, Daniel Albohn1, Alexander Todorov1; 1University of Chicago Booth School of Business

Appearance influences election outcomes via leadership stereotypes -- past work has shown that adults and even children can predict real-world elections solely on the basis of perceived competence judgments via photographs with relatively high accuracy. What are our visual stereotypes of leadership? And how do they differ according to political orientation? Here we explored this question using a novel reverse correlation technique powered by hyper-realistic generative face models (Albohn et al., 2022). Participants (N=300) viewed generated faces one at a time and judged whether they looked like a “good leader”, a “bad leader”, or “not sure”. Applying a simple algorithm to the aggregated choices yielded visually compelling and interpretable mental representations at both individual and group levels. While political group-averaged representations were similar along many subjective attributes (e.g., perceived "trustworthiness”, “attractiveness”; Peterson et al., 2022), they revealed a novel gender bias: right-leaning participants’ "good leaders" were more masculine than those of left-leaning participants. We directly replicated this result using richer latent face representations (N=300). We then validated individual participant models on new observers (N=150), probing their willingness to vote for different faces generated by past participants in an imaginary election. As predicted, participants were not only more willing to vote for "good" leader faces, but were most willing for faces generated by participants sharing their political orientation. Taken together, our results demonstrate how political orientation is linked to a novel gender bias in leadership representations, showcasing the utility of our reverse correlation technique.

Talk 4, 11:30 am, 52.14

The multidimensional representation of facial attributes.

Jessica Taubert1,2, Shruti Japee2, Amanda Robinson1, Houqiu Long1, Tijl Grootswagers3, Charles Zheng2, Francisco Pereira2, Chris Baker2; 1The University of Queensland, QLD Australia, 2The National Institute of Mental Health, MD United States., 3Western Sydney University, NSW Australia.

As primates, our social behaviour is shaped by our ability to read the faces of the people around us. Our current understanding of the neural processes governing ‘face reading’ comes primarily from studies that have focused on the recognition of facial expressions. However, these studies have often used staged facial expressions, potentially disconnecting facial morphology from genuine emotion and circumstance. Therefore, a reliance on staged stimuli might be obscuring our understanding of how faces are perceived and recognised during everyday life. Here our goal was to identify the core dimensions underlying the mental representation of expressive facial stimuli using a data driven approach. In two behavioural experiments (Experiment 1, N = 940; Experiment 2, N = 489), we used an odd-one-out task to measure perceived dissimilarity within two sets of faces; 900 highly-variable, naturalistic, expressive stimuli from the Wild Faces Database (Long, Peluso, et al., 2023 Sci Reports, 13: 5383) and 670 highly-controlled, staged stimuli from the NimStim database (Tottenham, Tanaka, et al., 2009 Psychiatry Res, 168: 3). Using Representational Similarity Analysis, we mapped the representation of the faces in the Wild and NimStim databases, separately, and compared these representations to behavioral and computational models. We also employed the state-of-the-art VICE model (Muttenthaler, Zheng, et al., 2022 Adv Neural Inf Process Syst) to uncover the dimensions that best explained behaviour towards each of the face sets. Collectively, these results indicate that the representation of the Wild Faces was best characterised by perceived social categories, such as gender, and emotional valence. By comparison, facial expression category explained more of the perceived dissimilarity among the NimStim faces than the Wild Faces. These findings underscore the importance of stimulus selection in visual cognition research and suggest that, under naturalistic circumstances, humans spontaneously use information about both social category and expression to evaluate faces.

Acknowledgements: This research was supported by the Intramural Research Program of the National Institute of Mental Health (ZIAMH002909 to CIB) and the Australian Research Council (FT200100843 to JT)

Talk 5, 11:45 am, 52.15

Comparing Human eye-tracking heatmaps with DNN saliency maps for faces at different spatial frequencies

Michal Fux1 (), Joydeep Monshi2, Hojin Jang1, Charlotte H Lahey3, Suayb S Arslan1, Walter V Dixon III2, Matthew Groth1, Pawan Sinha1; 1The Department of Brain and Cognitive Sciences, MIT, 2AI, Machine Learning and Computer Vision, GE Research, Niskayuna, NY, USA, 3Keene State College

Deep neural network (DNN)-based Face recognition (FR) models have improved greatly over the past decades achieving, or even exceeding, human-level accuracies under certain viewing conditions, such as frontal face views. However, as we reported in last year’s meeting (XXX et al., 2023), under challenging viewing conditions (e.g. large distances, non-frontal regard) humans outperform DNNs. To shed light on potential explanations for these differences in FR accuracies of humans and DNNs, we turned to eye-tracking paradigms to discern potentially important zones of information uptake for observers, and compare them with DNN-derived saliency maps. Despite the conceptual similarity between human eye tracking-based heat-maps and DNN saliency maps, the literature is sparse in terms of strategic efforts to quantitatively compare the two and translate human gaze and attention strategies to improve machine performance. We obtained gaze-contingent (GC) human eye-tracking heatmaps and DNN saliency maps, for faces, under three stimulus conditions: filtered for low-spatial frequency, high-spatial frequency, and full-resolution images. Human participants saw two sequentially presented faces and were asked to determine whether the individuals depicted were siblings (images from Vieira et. al., 2014) or two images of the same person (Stirling face database). While human eye-tracking heatmaps were collected during each occurrence of face images (sibling/stirling), DNN saliency maps were realized from differences in similarity score between the machine-interpreted face embeddings of pairs of face images using an efficient correlation-based explainable AI approach. We present the characterization and comparison of humans’ and DNN’s usage of the spatial frequency information in faces, and propose a model-agnostic translation strategy for improved face recognition performance utilizing an efficient training approach to bring DNN saliency maps into closer register with human eye-tracking heatmaps.

Acknowledgements: This research is supported by ODNI, IARPA. The views are of the authors and shouldn't be interpreted as representing official policies of ODNI, IARPA, or the U.S. Gov., which is authorized to reproduce & distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Talk 6, 12:00 pm, 52.16

Bayesian Decoding Reveals Retinotopic Selectivity for Body Positions in Body-Selective Regions

Yu Zhao1 (), Arnab Biswas1, Matthew W. Shinkle1, Mark D. Lescroart1; 1University of Nevada, Reno

The Extrastriate Body Area (EBA) represents information about human bodies. Though EBA is not usually considered a retinotopic area, past work has demonstrated visual field biases in different parts of EBA. Here, we probe the retinotopic position-specificity of EBA. Past research has used relatively coarse tests of position sensitivity, including contrasts between body parts presented in isolation in a few fixed retinotopic locations. To address this limitation, we modeled BOLD fMRI responses to stimuli consisting of rendered bodies performing actions in different retinotopic positions. To minimize naturalistic confounds of visual field location and motion, we varied the camera trajectory and added moving textures to bodies and backgrounds. We then extracted features describing the presence and retinotopic location of body parts, and applied linear regression to map these features onto fMRI responses and predict responses to withheld stimuli. This model yielded accurate predictions across cortical regions in and around EBA. Variance partitioning against a motion energy model revealed unique variance explained in these voxels by body features. To explore retinotopic position sensitivity in body selective regions, we computed contrasts between weights for body features reflecting different locations of bodies. As expected, these revealed left versus right visual field contralateral selectivity. We used two multivariate analyses to further quantify position selectivity.. Principal component analysis on the model weights revealed a dominant dimension of horizontal selectivity, alongside a less pronounced dimension that subtly suggests vertical selectivity. Consistent with this result, Bayesian decoding of body locations was more reliable than chance in the horizontal direction and in some cases more reliable than chance in the vertical direction as well. Overall, our findings suggest that EBA has more position sensitivity than has previously been appreciated. Even coarse coding of retinotopic body location could reveal socially relevant information about the position of bodies relative to gaze.