VSS, May 13-18

Scene Perception

Talk Session: Tuesday, May 17, 2022, 5:15 – 7:15 pm EDT, Talk Room 2
Moderator: Caroline Robertson, Dartmouth College

Times are being displayed in EDT timezone (Florida time): Wednesday, July 6, 3:53 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 5:15 pm, 55.21

Coarse-to-fine processing drives the efficient coding of natural scenes in mouse visual cortex

Rolf Skyberg1 (), Seiji Tanabe1, Hui Chen1, JC Cang1; 1Department of Biology and Department of Psychology, University of Virginia, Charlottesville, VA, 22904, USA

The sequential analysis of information in a coarse-to-fine (CtF) manner is a fundamental processing strategy of the visual system. Previous studies have shown that neurons in the primary visual cortex (V1) of anesthetized animals can process spatial information in a CtF fashion, shifting their spatial frequency (SF) preference from low (coarse) to high (fine) throughout their response to static grating stimuli. However, many central questions regarding CtF processing, such as whether it occurs in awake behaving mice and potential computational advantages it may provide, remain unexplored. Here, we performed large-scale single unit recordings to characterize CtF processing in both anesthetized and awake mice, determine its developmental profile, and study its role in encoding ethologically relevant natural scenes. Using high-density multielectrode silicon probes and subspace mapping of receptive fields, we found that the vast majority of V1 neurons from awake adult mice displayed two temporally discrete peaks in their spatiotemporal receptive field, each with distinct SF preferences. The SF shift between these 2 peaks was large and nearly always from low to high (i.e. CtF). Additionally, we discovered CtF processing is significantly attenuated in anesthetized mice and develops postnatally via experience-dependent mechanisms. Finally, we show that awake mice process the complex spatial statistics of natural scenes in a CtF manner. Excitingly, we demonstrate that this CtF processing reduces redundancy in the neural representation of natural scenes by shifting the population response away from the high-power, low-SF statistical regularities in these stimuli. This redundancy reduction drove an increase in the representational efficiency of natural images that did not occur in anesthetized or dark-reared mice with significantly attenuated CtF processing. Collectively, these findings establish a novel, state-dependent, computation of cortical circuitry that develops after vision onset to allow the animal to efficiently encode the complex spatial statistics of natural scenes.

Acknowledgements: This work was sponsored by NIH grant 1F32EY032360-01A1

Talk 2, 5:30 pm, 55.22

The influence of spatial frequency and luminance on early visual processing: A fixation-related potentials approach

Anna Madison1,2, Jon Touryan1, Michael Nonte3, Anthony Ries1,2; 1DEVCOM Army Research Laboratory, Aberdeen Proving Ground, MD USA, 2Warfighter Effectiveness Research Center; U.S. Air Force Academy, CO USA, 3DCS Corp., Alexandria, VA USA

Scene processing occurs rapidly to create a coherent visual representation of our external environment over time. Historically, electrophysiological correlates of scene processing have been studied with experiments using static stimuli presented for discrete timescales where participants maintain a fixed eye position. Gaps remain in generalizing these findings to real world conditions where eye movements are made to select new visual information and where the environment remains stable but changes with our position in space drive dynamic visual stimulation. Co-recording of eye movements and electroencephalography (EEG) provides an approach to leverage fixations as time-locking events in the EEG recording under free-viewing conditions. The resulting fixation-related potential (FRP) provides a neural snapshot in which to study visual processing under more naturalistic conditions. The current experiment aimed to explore the influence of scene statistics, specifically spatial frequency and luminance, on the early visual components evoked from fixations in a dynamic, continuous task. We present co-recorded eye movement and EEG data from a virtual navigation and visual search task where spatial frequency and pixel-wise RGB luminance was calculated around a 5 deg patch, centered on fixation. As part of our FRP estimation process, we used Independent Component Analysis to remove ocular artifacts (Dimigen, 2020) and deconvolutional modeling to control for overlapping neural activity and nonlinear covariates (Ehinger & Dimigen, 2021). The results suggest that early visual components of the FRP are sensitive to luminance and spatial frequency scene statistics around fixation, separate from saccade-driven amplitude modulation.

Acknowledgements: Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-21-2-0187.

Talk 3, 5:45 pm, 55.23

Relationship between spatial frequency selectivity and receptive field size for scene perception

Charlotte Leferink1, Claudia Damiano2, Dirk Walther1; 1University of Toronto, 2KU Leuven

Scene information is conveyed by a large array of visual features, which are processed in specialized high-level visual areas after passing through early visual cortex. Scene processing is believed to rely mainly on low spatial frequencies, as inferred from the relative size of population receptive fields (pRFs) in scene processing areas. In an apparent contradiction with this account, information of scene content in high-level visual areas relies predominantly on high spatial frequency information. Here we attempt to reconcile these accounts by combining pRF mapping of visual cortex with the decoding of scene information from fMRI activity elicited by modified images. Specifically, we re-analyzed the fMRI data from Berman et al's (2017) study of spatial frequency-filtered scene images. We computed scene category prediction accuracy within functionally localized Parahippocampal Place Area (PPA) and primary visual cortex (V1). We then selected subsets of voxels within these ROIs based on their pRF properties, as derived from the Human Connectome Project retinotopy dataset (Benson et al., 2018). In area V1, we found significantly greater decoding accuracy for images filtered for low spatial frequencies (LSF) in peripheral voxels with large pRFs compared to foveal voxels with smaller pRFs, thereby confirming the presumed relationship between receptive field size and sensitivity to spatial frequencies. In the PPA, on the other hand, we found no such difference based on pRF size. Instead, the PPA is organized along the AP axis, with more accurate decoding of HSF scenes in anterior than posterior PPA. These findings demonstrate that while the commonly held relationship between receptive field size and specialization for spatial frequencies holds in primary visual cortex, it does not extend to the PPA with its more complex receptive field properties. Instead, an anatomical subdivision of the PPA along the AP axis dominates its specialization for particular spatial frequencies.

Acknowledgements: Natural Sciences and Engineering Research Council of Canada (NSERC)

Talk 4, 6:00 pm, 55.24

Full-field fMRI: a novel approach to study immersive vision

Jeongho Park1 (), Edward Soucy1, Jennifer Segawa1, Talia Konkle1; 1Harvard University

In everyday vision, we experience an >180 degree view of the world in front of us. However, traditional functional magnetic resonance imaging (fMRI) setups are limited to presenting scenes like postcards in the central 10-15 degrees of the visual field. Here, we develop a method for ultra-wide angle visual presentation in the scanner, and explore how the brain processes visual scene information when presented with immersive first-person views. To accomplish wide-angle projection, we bounced the image off two angled mirrors directly into the scanner bore onto a custom-built curved screen, creating an unobstructed visual presentation of over 175 degrees. Additionally, we presented images that depicted a compatible wide field-of-view, rendered from 3D scenes built in Unity software; using standard scene images led to distorted perceptions of the environment. With this setup, we measured brain responses to a range of stimuli, including scene images presented in the full-field and at a typical smaller visual size. We found that all classic scene areas (parahippocampal place area, retrosplenial cortex, and occipital place area) were activated significantly more to the full-field scenes compared to the postcard scenes, indicating their preference for the far-periphery. Crucially, we found that a large swath of cortex connecting these areas was also strongly activated by the full-field more than the postcard scenes, forming a ring-shape around the parieto-occipital sulcus. Theoretically, these findings raise an intriguing possibility that there are representational principles unifying what are currently considered separate scene-selective regions within a common large-scale organization. Methodologically, our approach provides a novel avenue to test hypotheses relating to the foveal-peripheral organization of higher-level visual areas, and measure scene processing mechanisms with an immersive experience of scale.

Acknowledgements: R21EY031867

Talk 5, 6:15 pm, 55.25

A cortical network representing spatial context of visual scenes in posterior cerebral cortex

Brenda Garcia1, Adam Steel1, Anna Mynick1, Kala Goyal1, Caroline Robertson1; 1Dartmouth College

As we navigate through the world, the visual scene in front of us is linked seamlessly to the broader environment. Where in the brain is this contextual information represented? Here, we tested where in the brain activity levels during perception of a scene view is modulated by the degree of spatial context associated with that view in memory. Participants (N=17) studied 20 real-world scenes using head-mounted VR comprising three study conditions, with varying amounts of spatial context: Image (45° of a panorama), Panorama (270° of a panorama), and Street (navigable environment of three contiguous panoramas). Using fMRI, we compared neural responses when participants perceived (Exp.1) or recalled (Exp.2) discrete fields-of-view from each place. We tested which brain regions are modulated by the degree of spatial context associated with a visual scene, focusing on the Scene Perception Areas (PPA, OPA, and MPA) and Place Memory Areas (PMAs) (Steel et al., 2021). As predicted, all SPAs were robustly activated during scene perception (all p<0.001), but their activity was not modulated by the degree of spatial context associated with a scene in memory (all p>0.4). In contrast, the PMAs showed significant modulation by spatial context, where scenes with greater spatial context induced greater PMA activity (all p<0.001). The same pattern of results was present during recall (Exp 2). Intriguingly, spatial context did not modulate hippocampal activity during recall (all p>0.6) and importantly, activity in control areas (V1 and FFA) was not impacted by spatial context (all p>0.05) in either experiment. Together, these results show that the PMAs are uniquely sensitive to the amount of spatial context associated with a real-world scene suggesting that they may be involved in providing spatial context to the SPAs to facilitate visually-guided behavior.

Talk 6, 6:30 pm, 55.26

Dynamic neural representations reveal flexible feature use during scene categorization

Michelle Greene1 (), Bruce Hansen2; 1Bates College, 2Colgate University

A fundamental goal of vision science is to map the representational states that transform ambient light arrays into perceived environments and events imbued with semantic meaning. Previous work has demonstrated that neural representations are associated with low-level visual features in early visual processing and resemble higher-level features later (Greene & Hansen, 2020). The goal of the current study is to assess the flexibility of feature use. Experiment 1 assessed feature preference in scene categorization using a variant of the triplet similarity task (Hebart et al., 2020). Observers were presented with three images and asked to select the least similar image. We created structured image triads. One pair was similar with respect to scene affordances and dissimilar concerning objects and texture. A different pair was similar concerning objects, and the last pair was similar with respect to texture. This allows us to assess which feature is most critical for scene similarity when forced to choose among competing features. We found that observers were twice as likely to choose affordance-based similarity and less likely to choose texture-based similarity. Do observers then use affordances to accomplish scene categorization? In Experiment 2, observers performed a scene categorization task while 64-channel EEG was recorded. Eight scene categories served as targets, and in different blocks, distractors were chosen that were either similar to each target with respect to affordances or with respect to texture. If affordances are used for categorization, observers would need to use alternative features in the affordance block, as affordances are no longer diagnostic of category. We found that observers were slower and less accurate in categorizing scenes in the affordance blocks. More strikingly, whole-brain EEG decoding revealed that neural representations of scene categories emerged ~50 ms slower in the affordance block, suggesting that the brain preferentially uses affordances over texture for categorization.

Acknowledgements: James S. McDonnell Foundation grant (220020430) to BCH; National Science Foundation grant (1736394) to BCH and MRG.

Talk 7, 6:45 pm, 55.27

The dynamics of scene understanding

Daniel Harari1 (), Alex Mars1, Hanna Benoni2, Shimon Ullman1; 1Weizmann AI Center, Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, 2Department of Psychology, The College of Management Academic Studies

Visual scene understanding involves processing and integration from different levels of visual tasks, including recognition of objects, actions and interactions. Here we study the dynamics of scene understanding over time. In particular, we study the time trajectory of scene interpretation, by controlling the exposure time with perceptual masking. 140 MTurk participants were instructed to provide a detailed free-recall description to 14 stimuli images portraying various interactions between animate agents (humans and pets) and other agents and objects. They were instructed to report the type of objects and agents in the image with their properties and inter-relations. For each image, subjects were assigned to one of seven exposure conditions: 50, 75, 100, 125, 200, 500 and 2000ms followed by a mask. A fixation cross at the center of the image frame appeared prior to image display. Participants had 15 minutes for task completion. Evaluation of the subjects’ responses was conducted by 4 scorers, who followed a detailed analysis protocol, which minimized subjective judgements. Preliminary results indicate consistent trends in the time evolution of scene perception: (i) human agents are reported earlier than objects and global scene description, even when objects appear at the center of fixation (e.g. ‘two men’ before ‘a park bench’); (ii) actions are reported earlier than the acted upon objects (e.g. ‘drinking’ before ‘cup’); (iii) for human agents, the number of agents is reported early, followed by age, and gender is reported on the average later (e.g. ‘two people’, before ‘two kids’, and then ‘two boys’). These findings are interesting from a modeling perspective since they do not fit the common scene understanding paradigm in computer vision, where objects are first detected and only then their inter-relations are processed. We will consider scene perception schemes that are more consistent with human dynamics of scene perception than current approaches.

Acknowledgements: Robin Chemers Neustein Artificial Intelligence Fellows Program

Talk 8, 7:00 pm, 55.28

Category learning biases in real-world scene perception

Gaeun Son1 (), Dirk B. Walther1, Michael L. Mack1; 1University of Toronto

In daily life, we experience complex visual environments in which numerous visual properties are tightly woven into holistic dimensions. Our visual system warps and compresses this visual input across its multiple stages of operations to arrive at perceptual insights that link to conceptual knowledge. Compelling demonstrations in object perception suggest high-level cognitive functions like categorization can impact how visual processing unfolds to, for example, distinctly biases or distort perception along category-relevant stimulus dimensions. However, whether or not such categorical perception mechanisms similarly impact the perception of real-world scenes remains an important open question. Here, we address this question in a novel learning task in which participants learned to categorize realistic scene images synthesized from an image space defined by continuously varying holistic visual properties. First, participants learned an arbitrary linear category boundary that divided scene space through feedback-based learning. Next, participants completed a visual working memory estimation task in which a target scene was briefly presented, then after a brief delay reconstructed from the continuous scene space. Memory reconstruction errors revealed systematic biases that tracked the subjective nature of each participant’s category learning. Specifically, errors were selectively biased along the diagnostic dimensions defined by participants’ acquired category boundaries. In other words, after only a short category learning session, scenes were remembered as being more similar to their respective learned categories at the expense of their veridical details. These results suggest that our visual system extracts diagnostic dimensions that optimize top-down task goals and actively leverages them for subsequent perception and memory. The highly complex and realistic nature of our stimulus space highlights the dynamic nature of visual perception and high-level cognition in an ecologically valid setting.

Acknowledgements: Natural Sciences and Engineering Research Council (NSERC) Discovery Grants (RGPIN-2017-06753 to MLM and RGPIN-2020-04097 to DBW) and Canada Foundation for Innovation and Ontario Research Fund (36601 to MLM).