Scene Perception

Talk Session: Saturday, May 16, 2026, 5:15 – 7:00 pm, Talk Room 1
Moderator: Miguel Eckstein, UC Santa Barbara

Talk 1, 5:15 pm, 25.11

LAION-fMRI: A densely sampled 7T-fMRI dataset providing broad coverage of natural image diversity

Josefine Zerbe1,2, Johannes Roth1,2, Maggie Mae Mell1,2, Peer Herholz1,2, Tomas Knapen3,4,5, Martin N. Hebart1,2,6; 1Justus Liebig University, Giessen, Germany, 2Max-Planck Institute CBS, Leipzig, Germany, 3Spinoza Centre for Neuroimaging, Amsterdam, Netherlands, 4Netherlands Institute for Neuroscience, Amsterdam, Netherlands, 5Vrije Universiteit, Amsterdam, Netherlands, 6Center for Mind, Brain, and Behavior, Universities of Marburg, Giessen, and Darmstadt, Marburg, Germany

Research in visual neuroscience aims to arrive at generalizable conclusions, yet the complexity and breadth of our everyday visual experience make exhaustive stimulus sampling impossible. Massive fMRI datasets, such as BOLD5000, NSD or THINGS, have attempted broad coverage of the visual space by densely sampling a large-scale corpus of natural images from a small-scale cohort of human brains. Despite tremendous efforts involved in their collection, these datasets have challenges: First, recent work has questioned whether their stimulus sampling is sufficiently broad to allow for true generalization, given strongly overlapping training and testing distributions. And while some datasets provide out-of-distribution images, their scope remains limited. Further, existing datasets favor broad over deep sampling, making purely data-driven analyses of voxel selectivity challenging due to the noise inherent in the data. Addressing these challenges, we acquired a new multi-echo 7T functional MRI dataset (1.8mm isotropic, 1.9s TR) in which 5 individuals viewed 25,844 newly sampled unique images (2,284 shared) across 165 sessions. For broad representational sampling, most images were natural photographs derived from LAION-natural (120 million image-text pairs; Roth & Hebart, 2025), with an out-of-distribution subset including abstract shapes, visual illusions, and self-photographed images. Participants viewed centrally-presented stimuli (9×9 DVA; 2.5s on, 0.5s off; repeated 4-12x) while performing a memory task. Preprocessing revealed single-trial betas with consistently excellent noise ceilings and minimal head motion estimates across all participants. Extending the scope, LAION-fMRI was collected in individuals for whom we previously acquired extensive retinotopic mapping, diverse functional localizers, and precision diffusion data for u-fiber mapping, providing the basis for a detailed understanding of visual functional neuroanatomy. We believe LAION-fMRI will be of broad utility for testing and evaluating the generalizability of models of vision and language, positioning itself uniquely for fine-grained theory-driven and data-driven analyses of visual patterns of brain activity.

This work was supported by a Max Planck Research Group Grant (M.TN.A.NEPF0009) and the ERC Starting Grant COREDIM (ERC-StG-2021-101039712) both awarded to MNH.

Talk 2, 5:30 pm, 25.12

Bifurcation of scene perception and memory-related processes in superior parietal cortex

Adam Steel1, Nicole Tang1, Dominika Panek1, Catriona Scrivener2, Edward Silson2; 1University of Illinois, 2University of Edinburgh

Scene perception and memory are closely linked in cortex: each of the three scene-selective brain areas in posterior cerebral cortex (parahippocampal place area, PPA), (occipital place area, OPA), and medial place area, MPA) has a paired area that activates when participants recall familiar places. Recently, a fourth scene-selective brain area located on the posterior intraparietal gyrus, referred to as PIGS (Kennedy et al., 2024, Yoon et al., 2025). Here, in two independent datasets, we investigated the topography of perception and memory responses for visual scenes in posterior intraparietal gyrus. In Dataset 1, we localized PIGS in 12 participants by comparing fMRI activity while participants viewed static images of scenes versus faces. In Dataset 2, we localized PIGS by comparing fMRI activity while 12 participants viewed 3-second videos of scenes, faces, objects, and bodies. In both datasets, we compared the locations of PIGS with place memory selective activity, localized by comparing activation when participants visually recalled personally familiar places versus people. In both datasets, we found a swath of memory-selective activity immediately anterior and adjacent to PIGS, consistent with the posterior-anterior perception-memory arrangement observed for the other scene selective areas. We then characterized these areas’ position in the larger cortical hierarchy using resting-state functional connectivity. Intriguingly, PIGS sits at the confluence of two distinct cortical pathways: i) a posterior-medial pathway spans medial parietal cortex, including Area 23a, the parietal-occipital sulcus, and to the hippocampus, and ii) a lateral prefrontal pathway comprised of caudal and rostral intraparietal cortex, frontal eye fields, and inferior frontal junction. Together, these results show that all brain areas that process visual scenes have a paired memory-responsive area. Moreover, PIGS may be uniquely positioned at an intersection of two distinct processing streams that facilitate visually-guided navigation.

Brain and Behavior Research Foundation

Talk 3, 5:45 pm, 25.13

Intracranial Recordings in Human Posterior Parietal Cortex Reveal a Node in the Intuitive Physics Network

Vasiliki Bougou1 (), RT Pramod2, Jorge Gamez1, Emily Rosario3, Charles Liu4, Kelsie Pejsa1, Ausaf Bari5, Richard Andersen1, Nancy Kanwisher2; 1California Institute of Technology, 2Massachusetts Institute of Technology, 3Casa Colina Hospital and Center for Healthcare, 4University of Southern California, 5University of California Los Angeles

fMRI studies have identified parietal and frontal regions in humans that respond preferentially during intuitive physics judgments compared to difficulty matched control tasks (Fischer et al., 2016). Further, analysis of fMRI patterns has shown that these regions encode object mass, stability, contact, and imminent collisions (Pramod et al., 2021, 2025). However, clarifying the underlying neural computations requires finer-grained sampling of neural responses in space and time. Here, we recorded single unit activity (SUA) and local field potentials (LFPs) from two individuals with spinal cord injury. Participant RD had one Utah array in superior parietal lobule (SPL) and one in supramarginal gyrus (SMG), whereas participant JJ had one array in SPL and contributed only LFPs due to the long implantation duration. Across sessions in RD, we recorded from more than 500 units per array and directly replicated the fMRI intuitive physics experiments. In the tower falling task, firing rates were markedly higher during physical than color judgments (52 physics selective units in SPL, 37 in SMG, and fewer than 10 selective for the color task in either region). The same SPL array responded strongly to short videos showing physical events but minimally to videos showing social events (95 physical selective units versus 2 social). Yet, neither region showed increased responses or above chance decoding for difficult vs easy spatial working memory tasks, indicating that our findings cannot be attributed to these arrays being part of the multiple demand network. High gamma activity in both participants mirrored the spiking results from RD. Our results provide direct electrophysiological evidence that the posterior parietal cortex is engaged in intuitive physical reasoning, consistent with prior work showing that these areas encode internal models of the body and environment and form movement intentions, making intuitive physics a natural component of their function (Chivukula et al. 2025).

National Eye Institute grant UG1EY032039, Tianqiao and Chrissy Chen Brain-Machine Interface Center at Caltech, Swartz Foundation, Boswell Foundation, Fonds Wetenschappelijk Onderzoek (FWO) grant 1264926N

Talk 4, 6:00 pm, 25.14

Contributions of Foveation, Scene Ambiguity, and Visual Complexity To Human Time to Comprehend Scenes

Ziqi Wen1 (), Jonathan Skaza2, Sharvan Murlidaran3, Miguel P. Eckstein1,2,3; 1Department of Computer Science, UC Santa Barbara, 2Graduate Program in Dynamical Neuroscience, UC Santa Barbara, 3Department of Psychological and Brain Sciences, UC Santa Barbara

Introduction: Scene comprehension is central to everyday visual cognition. We understand the factors (e.g., target detectability/discriminability, crowding, expectations) influencing response times (RTs) for basic perceptual tasks such as pattern discrimination, identification, search, and scene classification. Furthermore, there are image-computable models to predict RTs for such tasks (Spoerer et al., 2020; Goetschalckx et al., 2023; Rafiei et al., 2024). However, less is known about the factors that contribute to RTs for scene comprehension, nor is there an image-computable model. Here, we investigate how the interactions between image content and foveation, visual complexity, and scene ambiguity contribute to human scene comprehension RTs. We also propose a multi-factor image-computable model to predict RTs and scene description accuracy under limited viewing time. Methods: We used image-computable scores to quantify the costs of foveation (foveated scene understanding, F-SUM; Wen et al., 2025), scene ambiguity (language entropy of Multi-Modal Large Language Models; LE, Malinin & Gales, 2020), and perceived image complexity (IC, using trained deep learning models; Skaza et al., 2025; Feng et al., 2023) for 277 scenes. We measured human (N = 100) RTs for comprehending the scenes, and the description accuracy for another observer group (N =20) allowed only two saccades. Results: Foveation and scene ambiguity correlated the most with RTs (F-SUM: 0.561; LE: 0.528; IC: 0.422, p<.05). A linear combination of F-SUM, LE, and IC scores resulted in higher correlations (F-SUM+LE+IC: 0.676, p<.0001) than either metric alone. When observers were limited to two saccades, foveation (F-SUM) became the better predictor of description accuracy (p<.0001: F-SUM, r=-0.54; LE, r=-0.405; and IC, r=-0.195). Conclusion: Foveation and scene ambiguity contribute more than image complexity to scene comprehension RTs, but all three are complementary image-computable factors to best predict human RT prediction.

Talk 5, 6:15 pm, 25.15

Seeing Without Seeing: Implicit Peripheral Color Processing

Shao-Min (Sean) Hung1,2 (), Junhao Jiang2, Sotaro Taniguchi2, Katsumi Watanabe2; 1Tohoku University, Sendai, Japan, 2Waseda University, Tokyo, Japan

Human visual experience often appears as subjectively rich and coherent across the visual field. However, recent work using virtual reality demonstrated that observers frequently fail to notice peripheral color removal even when explicitly instructed to detect it (Cohen et al., 2020), suggesting that peripheral color phenomenology may be impoverished. Critically, this does not preclude the possibility that unseen peripheral color is still processed and influences behavior. Here, we examined whether colors in the “undetectable periphery” contribute to visual performance. In Experiment 1 (n = 30), participants navigated a panoramic VR scene in which peripheral colors were gradually desaturated. At random intervals, a fixation cross appeared, followed by a target dot presented in one of three regions: saturated/near, boundary, or desaturated/peripheral. Participants responded faster and localized targets more accurately in saturated/near regions. In Experiment 2 (n = 20), we controlled for eccentricity by including both saturated and desaturated conditions in the periphery. To ensure that color differences remained undetectable, each participant completed a pre-experiment calibration to estimate their individual chance-level color detection radius. The results replicated: even when participants could not phenomenally distinguish saturated from desaturated periphery, targets appearing in physically saturated regions yielded better performance. These findings show that peripheral colors—despite our limited capacity to reliably detect them—can facilitate visual detection and localization. More broadly, they reveal a dissociation between peripheral color phenomenology and the functional use of peripheral color information in guiding behavior.

SMH was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI “Early-Career Scientists” (24K16877).

Talk 6, 6:30 pm, 25.16

How Gaze Behavior Explains Gender Differences in Spatial Learning in Mixed Reality Navigation

Yu Zhao1 (), Holly Gagnon, Jeanine Stefanucci2, Sarah Creem-Regehr2, Bobby Bodenheimer3; 1Kennesaw State University, 2University of Utah, 3Vanderbilt University

Gender differences are commonly observed in the adoption of strategies for spatial cognition as well as for certain immersive technologies. Understanding such gender-specific behaviors is crucial for designing inclusive technologies to support spatial tasks. We conducted a gender-balanced experiment with 90 participants who performed a navigation task with three mixed reality navigation interfaces to investigate how gaze behaviors relate to spatial learning performance, differ by gender, and potentially mediate the gap in spatial learning that is often observed between genders. The study consisted of a spatial learning phase (memorizing landmarks while following route) and a retrieval phase (estimating directions with a pointing task). Gaze metrics, including gaze dispersion, saccadic amplitude, fixation distance, and gaze entropy, were calculated. Using multilevel mediation analysis, we assessed (1) if gender predicted gaze behavior; (2) if gaze predicted pointing error; and (3) the mediation effect of gaze behavior using the Aroian Test, controlling for interface condition and trial order. Our results reveal a phase-dependent effect. During the learning phase, female participants adopted distinct visual strategies, including significantly wider horizontal scanning and higher gaze entropy compared to males. However, these strategies did not mediate spatial learning performance. In contrast, during the pointing phase, horizontal gaze dispersion emerged as a significant mediator. Females continued to exhibit wide horizontal scanning during retrieval, a behavior that strongly predicted higher pointing errors. We also distinguish this gender-linked strategy from general states of confusion: while gaze entropy predicted errors for all users, it did not differ by gender. These findings suggest that the gender gap in spatial learning is not driven by inefficient gaze behavior of information gathering during learning, but a horizon-scanning behavior during retrieval that is specific to females may lead to a performance deficit compared to males.

This material is based upon work supported by the Office of Naval Research under grant N0014-21-1-2583

Talk 7, 6:45 pm, 25.17

Idiosyncrasies in Internal Models Predict Individual Differences in Spatiotemporal Neural Processing of Natural Scenes

Micha Engeser1,2 (), Thea Schmitt1, Daniel Kaiser1,2,3; 1Neural Computation Group, Department of Mathematics and Computer Science, Physics, Geography, Justus Liebig University Giessen, 35392 Giessen, Germany, 2Center for Mind, Brain and Behavior (CMBB), Philipps University Marburg, Justus Liebig University Giessen, and Technical University Darmstadt, 35032 Marburg, Germany, 3Cluster of Excellence “The Adaptive Mind”, Justus Liebig University Giessen, Philipps University Marburg, and Technical University Darmstadt, 35392 Giessen, Germany

Why do humans differ in how they perceive the world around them? Traditionally, this question has received limited attention, with variability between participants often dismissed as noise. Building on predictive processing theories, we propose that idiosyncrasies in internal models—expectations about what the world should look like—are a key source of such perceptual variability. Using an inter-subject representational similarity analysis (IS-RSA), we tested whether inter-individual similarities in internal models for natural scene categories predict similarities in neural fMRI and EEG responses when viewing scenes from these categories. To characterize internal models, participants drew what they considered the most typical version of specific scene categories. We then used deep-learning tools to transform the drawings into photorealistic images and, in turn, quantify inter-individual similarities in the resulting images. Relating the resulting inter-individual similarities in internal models to inter-individual similarities in neural responses yielded two key insights: First, participants with more similar internal models showed greater alignment in fMRI BOLD time courses within lateral occipital and lateral prefrontal cortices. Second, participants with more similar internal models exhibited more similar scene representations in EEG signals, emerging around 400 ms after stimulus onset. Together, these findings demonstrate that individual priors regarding the structure of the world offer a parsimonious explanation for why spatiotemporal processing in the visual system varies across individuals.