Scene Perception: Neural mechanisms, representations

Talk Session: Wednesday, May 22, 2024, 11:00 am – 12:45 pm, Talk Room 1
Moderator: Danny Dilks, Emory University

Talk 1, 11:00 am, 62.11

A new scene-selective region in the superior parietal lobule and its potential involvement in visually-guided navigation

Hee Kyung Yoon1 (), Yaelan Jung1, Daniel Dilks1; 1Emory University

Growing evidence indicates that the occipital place area (OPA) – a scene-selective region in adult humans – is involved in “visually-guided navigation”. Here, we present evidence that there is a new scene-selective region located in the superior parietal lobule – henceforth called the “superior place area” (SPA) – that may also be involved in visually-guided navigation. First, using functional magnetic resonance imaging (fMRI), we found that SPA responds significantly more to scene stimuli than to face and object stimuli across two different sets of stimuli (i.e., “dynamic” and “static”) – establishing SPA as yet another scene-selective region. Second, we found that SPA, like OPA, responds significantly more to dynamic scene stimuli (i.e., video clips of first-person perspective motion through scenes, mimicking the actual visual experience of walking through a place) than to static scene stimuli (i.e., static images taken from the same video clips, rearranged such that first-person perspective motion could not be inferred) – suggesting that SPA, like OPA, is involved in visually-guided navigation. Such sensitivity to first-person perspective motion information through scenes cannot be explained by scene selectivity alone, domain-general motion sensitivity, or low-level visual information. And third, resting-state fMRI data revealed that SPA is preferentially connected to OPA, compared to other scene regions – again consistent with the hypothesis that the SPA, like OPA, is involved in visually-guided navigation. Taken together, these results demonstrate a new scene-selective region that may be involved in visually-guided navigation, and raise interesting questions about the precise role that SPA (compared to OPA) plays in visually-guided navigation.

Talk 2, 11:15 am, 62.12

Dynamic functional connectivity via iEEG - fMRI correlation maps

Zeeshan Qadir1 (), Harvey Huang1, Morgan Montoya1, Michael Jensen1, Gabriela Ojeda Valencia1, Kai Miller1, Gregory Worrell1, Thomas Naselaris2, Kendrick Kay3, Dora Hermes1; 1Mayo Clinic, 2University of Minnesota

Understanding neural computations of vision require studying how different brain regions interact with one another. However, functional connectivity across brain regions is often computed as stationary maps, concealing the rich neural dynamics that change at a finer timescale. To better understand how functional connectivity evolves over time, we propose a multimodal framework combining data from intracranial-EEG (iEEG) and fMRI. We recorded iEEG data from early visual (V1/V2) electrodes in 4 patients. Each patient was shown a subset of 1000 stimuli from the NSD-fMRI dataset. Electrodes with significant broadband (70-170 Hz) power increases w.r.t the baseline were considered for further analysis. From the NSD-fMRI dataset, we obtained average fMRI beta-weights for the 1000 stimuli that were repeated thrice across the 8 subjects. Next, for each iEEG electrode we computed a Pearson correlation map with all the fMRI vertices, across the 1000 stimuli, giving us a time x vertices correlation matrix. This provided us with a brain-wide temporally evolving correlation map for each electrode. In all 4 subjects, we observed that the iEEG broadband significantly correlates with the fMRI beta-weights in V1, and with V2/V3 about 5-10 ms later, followed by the ventral temporal regions around 170 ms. Other parietal and frontal brain regions also showed significant correlations after 100 ms. Further, we also observed that these correlations reduce around 450 ms, even though the stimuli were presented for 800 ms. These temporally resolved correlation maps show that V1 representations are not stationary but share representations with higher order visual areas over time. These results may suggest that connectivity to V1 evolves over time revealing feedback inputs from higher order ventral areas around 100-170 ms. Overall, we propose that our multimodal framework enables us to compute functional connectivity at high spatiotemporal resolution reflecting the rich dynamics of interaction across different brain regions.

Acknowledgements: We thank the patients in this study for their participation, Cindy Nelson and Karla Crockett for their assistance, and Peter Brunner for support with BCI2000. Research reported in this publication was supported by the NEI (R01EY035533, R01EY023384)

Talk 3, 11:30 am, 62.13

Top-down alpha dynamics mediate the neural representation of coherent visual experiences

Daniel Kaiser1,2 (), Lixiang Chen1,3, Radoslaw M Cichy3; 1Mathematical Institute, Justus Liebig University Giessen, 2Center for Mind, Brain and Behavior, Philipps University Marburg and Justus Liebig University Giessen, 3Department of Education and Psychology, Freie Universität Berlin

In order to create coherent visual experiences, our visual system needs to aggregate inputs across space and time in a seamless manner. Here, we combine spectrally resolved EEG recordings and spatially resolved fMRI recordings to characterize the neural dynamics that mediate the integration of multiple spatiotemporally coherent inputs into a unified percept. To unveil integration-related brain dynamics, we experimentally manipulated the spatiotemporal coherence of two naturalistic videos presented in the left and right visual hemifields. In a first study, we show that only when spatiotemporally consistent information across both hemifields affords integration, EEG alpha dynamics carry stimulus-specific information. Combining the EEG data with regional mappings obtained from fMRI, we further show that these alpha dynamics can be localized to early visual cortex, indicating that integration-related alpha dynamics traverse the hierarchy in the top-down direction, all the way to the earliest stages of cortical vision. In a second study, we delineate boundary conditions for triggering integration-related alpha dynamics. Such alpha dynamics are observed when videos are coherent in their basic-level category and share critical features, but not when they are coherent in their superordinate category, thus characterizing the range of flexibility in cortical integration processes. Together, our results indicate that the construction of coherent visual experiences is not implemented within the visual bottom-up processing cascade. Our findings rather stress that integration relies on cortical feedback rhythms that fully traverse the visual hierarchy.

Acknowledgements: This work is supported by the DFG (CI241/1-1, CI241/3-1, CI241/7-1, KA4683/5-1, SFB/TRR 135), the ERC (ERC-2018-STG 803370, ERC-2022-STG 101076057), the China Scholarship Council, and “The Adaptive Mind”, funded by the Hessian Ministry of Higher Education, Science, Research and Art.

Talk 4, 11:45 am, 62.14

Neural responses in space and time to a massive set of natural scenes

Peter Brotherwood1 (), Emmanuel Lebeau1, Mathias Salvas-Hébert1, Marin Coignard1, Shahab Bakhtiari1, Frédéric Gosselin1, Kendrick Kay2, Ian Charest1; 1CerebrUM, Université de Montréal, 2Center for Magnetic Resonance Research, University of Minnesota

Understanding how neurons in the visual system support visual perception requires deep sampling of neural responses across a wide array of visual stimuli. Part of this challenge has been met by a recent large-scale 7T fMRI dataset, termed the Natural Scenes Dataset (NSD). This dataset provides extensive high-resolution spatial sampling of brain activity in eight observers while they view complex natural scenes. Here, we present the NSD-EEG, a large-scale electroencephalography (EEG) dataset that provides detailed characterisation of brain activity from a temporal perspective, thereby completing the characterisation of visual processing in the human brain. For this dataset, we optimised data quality by choosing 8 participants from a larger pool based on empirical signal-to-noise metrics and by using a high-density (164 channels) EEG system within a shielded Faraday cage. NSD images were shown for a duration of 250 ms, followed by a variable interstimulus interval of 750-1000 ms. Each participant viewed 10000 images 10 times, with a subset of 1000 images (common across participants) repeated 30 times. Preliminary analyses reveal remarkably consistent event-related potentials (ERPs) for each stimulus, with high inter-trial reliability even at a rapid one stimulus per second pace (max Pearson R: 0.8, p<0.001). Additionally, split-half representational dissimilarity matrices exhibit strong reliability (max Spearman R: 0.4, p<0.001), further affirming the robustness of our data. We plan to publicly release the NSD-EEG dataset in the near future, alongside an exhaustive battery of complementary behavioural and psychophysical data. In combination with the NSD dataset, this will enable a comprehensive examination of neural responses in space and time to complex natural scenes. Altogether, this will support the ongoing movement using machine learning, artificial intelligence, and other computational methods to characterise and understand the neural mechanisms of vision.

Acknowledgements: This work was supported by a UNIQUE postgraduate research grant (to PB), a Courtois Chair in Neuroscience (to IC), and an NSERC discovery grant (to IC).

Talk 5, 12:00 pm, 62.15

Less is more: Aesthetic liking is inversely related to metabolic expense by the visual system

Yikai Tang1, Wil Cunningham1,2, Dirk Bernhardt-Walther1; 1University of Toronto, 2Vector Institute

What makes us like a particular scene or object and dislike another? A variety of visual properties, the observers’ experience, familiarity, processing fluency, and self-relevance have been suggested to underlie aesthetic liking. Here we investigate whether the brain’s goal to reduce energy costs (Olshausen and Field 1997; Friston, 2010) explains the construction of aesthetic appreciation. We propose a simple, straightforward approach to explaining neural responses to visual stimuli with different levels of aesthetic preference: the total metabolic cost of firing of neurons within relevant regions of interest. We test this hypothesis in an in-silico model of the visual system (VGG19) as well as human observers and find strong evidence in both. Specifically, we compare the metabolic cost incurred by 4914 images of objects and scenes from the BOLD5000 dataset for a VGG19 network pretrained for object and scene categorization with randomly initialized versions of VGG19. We find a strong inverse relationship between aesthetic preferences for the images and their metabolic cost, but only in the network trained for categorization. We then test the same hypothesis in the human visual system by comparing aesthetic liking of visual stimuli to the metabolic activity measured with functional magnetic resonance imaging. Crucially, we find strong evidence for the hypothesized inverse relationship between metabolic expense and aesthetic liking in both early visual brain regions (V1 and V4) and high-level regions (FFA, OPA, PPA). These findings represent the first direct evidence for a physiological basis of visual aesthetics at the level of energy consumption by the visual system. Aesthetic pleasure may function as an adaptive homeostatic signal to help conserve energy resources for survival. Our metabolic account for aesthetic liking unifies empirical evidence for visual discomfort with theories of processing fluency, image complexity, expertise, and prototypicality for aesthetic liking in a simple, physiologically plausible framework.

Acknowledgements: This work was supported by an NSERC Discovery Grant (RGPIN-2018-05946) to WC and an NSERC Discovery Grant (RGPIN-2020-04097) as well as a SSHRC Insight Grant (435-2023-0015) to DBW.

Talk 6, 12:15 pm, 62.16

Is visual cortex really “language-aligned”? Perspectives from Model-to-Brain Comparisons in Human and Monkeys on the Natural Scenes Dataset

Colin Conwell1 (), Emalie McMahon1, Kasper Vinken3, Jacob S. Prince2, George Alvarez2, Talia Konkle2, Leyla Isik1, Margaret Livingstone3; 1Johns Hopkins University, Department of Cognitive Science, 2Harvard University, Department of Psychology, 3Harvard Medical School, Department of Neurobiology

Recent advances in multimodal deep learning and in particular “language-aligned” visual representation learning have re-ignited longstanding debates about the presence and magnitude of language-like semantic structure in the human visual system. A variety of recent works that involve mapping the representations of “language-aligned” vision models (e.g. CLIP) and even pure language models (e.g. GPT, BERT) to activity in the ventral visual stream have made claims that the human visual itself may be “language-aligned” much like recent models. These claims are in part predicated on the surprising finding that pure language models in particular can predict image-evoked activity in the ventral visual stream as well as the best pure vision models (e.g. SimCLR, BarlowTwins). But what would we make of this claim if the same procedures worked in the modeling of visual activity in a species that doesn’t speak language? Here, we deploy controlled comparisons of pure-vision, pure-language, and multimodal vision-language models in prediction of human (N=4) AND rhesus macaque (N=6, 5:IT, 1:V1) ventral stream activity evoked in response to the same set of 1000 captioned natural images (the NSD1000 images). We find (as in humans) that there is effectively no difference in the brain-predictive capacity of pure vision and “language-aligned” vision models in macaque high-level ventral stream (IT). Further, (as in humans) pure language models can predict responses in IT with substantial accuracy, but perform poorly in prediction of early visual cortex (V1). Unlike in humans, however, we find that pure language models perform slightly worse than pure vision models in macaque IT, a gap potentially explained by differences in neural recording alone (fMRI versus electrophysiology). Together, these results suggest that language model predictivity of the ventral stream is not necessarily due to language per se, but rather to the statistical structure of the visual world as reflected in language.

Talk 7, 12:30 pm, 62.17

The high-dimensional structure of natural image representations varies systematically across visual cortex

Raj Magesh Gauthaman1 (), Brice Ménard1, Michael Bonner1; 1Johns Hopkins University

The computational goal of the visual cortex is often described as systematic dimensionality reduction, where high-dimensional sensory input is gradually reduced to a low-dimensional manifold over multiple stages of processing. Recently, thanks to the unprecedented size of the Natural Scenes Dataset, we showed that the structure of human visual cortex representation is high-dimensional. We were able to reliably detect visual information encoded over many hundreds of latent dimensions. In an effort to reconcile these divergent theoretical predictions and empirical results, we set out to investigate how natural image representations are transformed along the visual hierarchy from a spectral perspective. Using a robust cross-decomposition approach, we estimated cross-validated covariance spectra of fMRI responses in several regions of interest in the visual cortex. In all of them, we observed power-law covariance spectra over hundreds of dimensions. Interestingly, we also noticed systematic trends: spectra decay more rapidly from earlier to later stages of visual processing. This could be seen from V1 to V4 and also from early- to mid- and late- stages of processing within the ventral, dorsal, and lateral visual streams. High-level functionally localized regions of visual cortex including face-, body-, scene- and object-selective cortex also show covariance spectra decaying more rapidly. Our findings demonstrate that while cortical representations of natural images are consistently high-dimensional across many stages of processing—thus using all available dimensions to encode visual information—there are, nonetheless, systematic regional variations in how information is concentrated along these dimensions. These differences in the representational structure of visual regions may provide insight into computational strategies in the human brain.