Scene Perception

Talk Session: Tuesday, May 23, 2023, 2:30 – 4:15 pm, Talk Room 2
Moderator: Dirk B. Walther, University of Toronto

Talk 1, 2:30 pm, 54.21

Feedback processing shapes the categorical organization of the ventral stream

Yuanfang Zhao1, Simen Hagen1, Marius Peelen1; 1Donders Institute for Brain, Cognition and Behavior

The human ventral stream shows a categorical organization, with distinct regions responding selectively to faces, houses, tools, etc. Recent neuroimaging and computational studies have shown that this organization partly reflects the feedforward processing of category-specific visual features. However, other work has provided evidence for a similar categorical organization in the absence of visual input, suggesting that it may also be shaped by top-down feedback processing. Here, to reveal such feedback processing, we focus on the selective response to large objects (buildings) in the scene-selective parahippocampal place area (PPA). Specifically, we tested whether the selective response to buildings in the PPA: 1) can be observed when controlling for visual features typical of buildings (e.g., rectilinearity), 2) is delayed relative to the PPA response to scenes, and 3) reflects top-down activation of scene representations. In an fMRI study with high temporal resolution (TR=140 ms), participants (N=30) viewed images of isolated buildings, visually matched boxes, scenes, and chairs. Results showed a selective PPA response to buildings (vs boxes), despite their closely matched visual features. Interestingly, analyses of BOLD peak latency showed that building-selective PPA responses peaked about 200 ms later (4.87 s) than scene-selective PPA responses (4.66 s), consistent with the hypothesized delayed responses reflecting top-down feedback. This delayed PPA response to buildings was corroborated by an EEG study (N=32): multivariate decoding analyses across posterior electrodes revealed that building-selective response patterns emerged relatively late (350 ms after stimulus onset), about 200 ms later than scene-selective response patterns. Finally, building-selective response patterns at 350 ms after stimulus onset generalized to scene-selective response patterns at 200 ms after stimulus onset. Taken together, these results provide information about the nature of large-object selectivity in the PPA and, more generally, indicate that (at least some) category-selective responses in visual cortex can be decoupled from visual feature processing.

Acknowledgements: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 725970).

Talk 2, 2:45 pm, 54.22

The occipital place area (OPA) supports walking in 8-year-olds, not 5-year-olds

Yaelan Jung1 (), Daniel D. Dilks; 1Emory University

How does our ability to effortlessly move about the immediately visible environment – without running into the kitchen walls or banging into the table, for example – develop? One prominent, and intuitive, idea argues that “visually-guided navigation” develops early, when infants first begin moving independently through their surroundings via crawling. By contrast, some classic behavioral work as well as recent neuroimaging work has suggested that visually-guided navigation develops surprisingly late, not until children are “adult” walkers, around 8 years old. To directly test these hypotheses, using functional magnetic resonance imaging (fMRI) in children at 5 and 8 years of age, we measured the response in OPA – a brain region known to support visually-guided navigation in adults – to videos depicting the first-person visual experience of the two ways by which we move about the environment over development (i.e., a “crawling” perspective and a “walking” perspective), as well as two control conditions by which humans do not (i.e., a “flying” perspective and a “scrambled” perspective). We found that the OPA in 8-year-olds, like adults, responded more to the walking videos compared to the crawling, flying, and scrambled ones, and did respond any more to the crawling videos than to the flying or scramble ones, suggesting that OPA is adultlike by 8 years of age, and interestingly supports information from a walking perspective only. Surprisingly, the OPA in 5-year-olds showed a very different pattern, responding similarly across all videos, which indicates no “walking sensitivity”. Taken together, these findings i) reveal that the visually-guided navigation system undergoes protracted development, not even supporting walking in early childhood, and only emerges around 8 years of age, and ii) raise the intriguing possibility of whether crawling (and early walking) is a mode of visually-guided navigation at all, or is processed by a different neural system.

Talk 3, 3:00 pm, 54.23

Biased population coding of visual orientation in the human brain

William J. Harrison1 (), Paul M. Bays2, Reuben Rideaux1; 1The University of Queensland, 2University of Cambridge

It is generally accepted that biases in visual orientation perception can be understood in terms of active inference: perception involves combining prior expectations with noisy sensory estimates. Theoretical work has shown that sensory estimates can be instantiated in population code models in which tuning curve preferences are matched to the frequency of orientations in natural environments (i.e. efficient coding). In the present study, we link meso-scale neural responses in the human visual system to such theoretic models of orientation coding. We recorded human observers’ (n = 37) brain activity with EEG while they passively viewed randomly oriented gratings. Using univariate and multivariate decoding analyses, we found that neural responses to orientation were strongly anisotropic, but not in a way predicted from any leading model of neural coding. We therefore developed a novel generative modelling procedure to simulate EEG activity from arbitrarily specified sensory tuning functions. By applying decoding analyses to EEG data generated from population codes with known tuning properties, we were able to determine the coding scheme necessary to reproduce the empirical neural responses. We found that the underlying population code was one in which tuning preferences were redistributed to prioritise cardinal orientations, but, most critically, with a substantial over-representation of horizontal relative to vertical orientations. Moreover, a population code that prioritises horizontal orientations alone was sufficient to produce many (but not all) of the anisotropic neural responses. We relate these findings to prior psychophysical and computational work that foreshadowed the importance of horizontal environmental structures to vision. More generally, our results provide insight into the encoding of environmental statistics in biological systems.

Acknowledgements: This work was supported by an Australian Research Council Discovery Early Career Researcher Award to RR (DE210100790) and WJH (DE190100136).

Talk 4, 3:15 pm, 54.24

Distinct early and late neural mechanisms regulate feature-specific sensory adaptation in the human visual system

Reuben Rideaux1 (), Rebecca K West2, Dragan Rangelov1, Jason B Mettingley1,2; 1Queensland Brain Institute, University of Queensland, 2School of Psychology, University of Queensland

A canonical feature of sensory systems is that they adapt to prolonged or repeated inputs, suggesting the brain encodes the temporal context in which stimuli are embedded. Sensory adaptation has been observed in the central nervous systems of many animal species, using techniques sensitive to a broad range of spatiotemporal scales of neural activity. Two competing models have been proposed to account for the phenomenon. One assumes that adaptation reflects reduced neuronal sensitivity to sensory inputs over time (the ‘fatigue’ account); the other posits that adaptation arises due to increased neuronal selectivity (the ‘sharpening’ account). To adjudicate between these accounts, we exploited the well-known ‘tilt aftereffect’, which reflects adaptation to orientation information in visual stimuli. We recorded whole-brain activity with millisecond precision from human observers as they viewed oriented gratings before and after adaptation, and used inverted encoding modelling to characterise feature-specific neural responses. We found that both fatigue and sharpening mechanisms contribute to the tilt aftereffect, but that they operate at different points in the sensory processing cascade to produce qualitatively distinct outcomes. Specifically, fatigue operates during the initial stages of processing, consistent with tonic inhibition of feedforward responses, whereas sharpening occurs ~200 ms later, consistent with feedback or local recurrent activity. Our findings reconcile two major accounts of sensory adaptation, and reveal how this canonical process optimises the detection of change in sensory inputs through efficient neural coding.

Acknowledgements: This work was supported by an Australian Research Council Discovery Early Career Researcher Award to RR (DE210100790). DR and JBM were supported by National Health and Medical Research Council Ideas Grant (APP1186955) and Investigator Grants (GNT2010141), respectively.

Talk 5, 3:30 pm, 54.25

Making memorability of scenes better or worse by manipulating their contour properties

Seohee Han1 (), Morteza Rezanejad1, Dirk B. Walther1; 1University of Toronto

Why are some images more likely to be remembered than others? Past research has explored both low-level image properties, such as colour and spatial frequencies, and high-level properties, such as scene semantics. Recent work from our group suggests that memorability for line drawings and photographs of scenes is correlated with specific contour features, such as contour curvature and orientation, as well as mid-level perceptual grouping features, such as contour junctions. Here, we examine whether this relationship is merely correlational, or if manipulating these features causes images to be remembered better or worse. To this end, we manipulated contour properties as well as grouping properties that describe the spatial relationships between contours in the line drawings of real-world scenes and measured the effect of these manipulations on memorability. We trained a Random Forest model to predict scene memorability from contour and perceptual grouping features computed from the line drawings. Then, we used the trained model to predict the contribution of each contour to the memorability of the scene. Next, each line drawing was split into two half-images, one containing the contours with high predicted memorability scores and the other containing the contours with low predicted memorability scores. Since both versions were derived from the same original drawing, image identity was left intact by this manipulation. In a new memorability experiment, we find that the half-images predicted to be more memorable were indeed remembered better than the half-images predicted to be less memorable. Our findings suggest that specific contour and perceptual grouping cues are causally involved in committing real-world images to memory. We demonstrate that by measuring and manipulating these cues, we can isolate the contributions of image features at different visual processing stages to image memorability, thereby bridging the gap between low-level features and scene semantics in our understanding of memorability.

Talk 6, 3:45 pm, 54.26

Object-based attention during scene perception elicits boundary contraction in memory

Elizabeth H. Hall1,2 (), Joy J. Geng1,2; 1University of California, Davis, 2Center for Mind and Brain

Two types of boundary transformation, contraction and extension, are equally likely to occur in memory. In extension, viewers will extrapolate information beyond the edges of the image, whereas in contraction, viewers will forget information near the edges of the image. Recent work suggests that the direction of transformation is dependent on image composition. However cognitive factors, such as object-based attention, may also influence how scenes are encoded into memory. Here, participants (N=36) searched for target objects in 15 scenes, while a separate group (N=36) were asked just to memorize the images. Both groups drew the scenes from memory after a delay. Search participants foveated significantly less of the scene (4.48% vs. 6.19%), but spent more time fixating the target. Across the 518 scenes drawn from memory, regression-analyses found that participants engaged in search were more likely to draw targets, and for both groups, object size was highly predictive of recall. However, only search drawings had a significant tendency to show boundary contraction, with 64.66% of drawings showing contraction and 28.6% showing extension, compared to 41.81% of memorize drawings showing contraction and 47.8% showing extension. These results are especially dramatic given that participants studied the same images in both conditions for roughly the same amount of time. Contraction took the form of a “zoom-in” effect in the search drawings – while targets were drawn in their accurate spatial locations, point of views were zoomed in on the targets so that they were drawn significantly bigger than they originally appeared, and the farther an object was from a target, the less likely it was to be remembered. These results support the recently proposed dynamic-tension model (Park et al., 2021), suggesting that both cognitive factors, like attention, and static properties, like image composition, may influence whether a scene contracts or extends in memory.

Acknowledgements: NDSEG Fellowship to EHH. NIH-R01-MH113855 to JJG.

Talk 7, 4:00 pm, 54.27

A retinotopic reference frame structures communication between visual and memory systems

Adam Steel1 (), Brenda Garcia1, Edward Silson2, Caroline Robertson1; 1Dartmouth College, 2University of Edinburgh

We encode the visual world retinotopically, imposing a spatial reference frame on visual information processing. However, models of brain organization generally assume that retinotopic coding is replaced by abstract, amodal codes as information propagates through from visual to memory systems. This raises a puzzle for constructive accounts of visual memory: how can mnemonic and visual information interact if they are represented in different reference frames? To address this question, participants (N=15) underwent population receptive field (pRF) mapping during fMRI. We observed retinotopic coding throughout the visual system, including the scene perception areas (occipital place area (OPA) and parahippocampal place area (PPA)). Critically, consistent with prior work, we observed robust and reliable pRFs just beyond the anterior edge of visually-responsive cortex in high-level areas previously considered amodal. A large proportion of these high-level pRFs were located immediately anterior to scene-selective visual areas OPA and PPA, in memory-responsive cortex (Steel et al, 2022). We characterized these anterior pRFs, by localizing each participant’s OPA, PPA, and place memory areas (lateral (LPMA) and ventral (VPMA)) (Steel et al, 2022). Unlike visual areas OPA and PPA that contained almost exclusively prototypical positive pRFs, we observed a striking inversion of pRF amplitude in LPMA and VPMA, such that they exhibited spatially-selective negative BOLD responses. The visual field representation of negative pRFs in mnemonic areas closely matched their perceptual counterparts’, suggesting a common reference frame between perceptual and mnemonic regions. Finally, during a visual memory task, trial-wise activity of the positive and negative pRFs within the perceptual and memory areas was negatively correlated, suggesting a competitive push-pull dynamic between these neural systems. These results suggest that retinotopic coding, a fundamental organizing principle of visual cortex, persists in high-level, mnemonic cortex previously considered amodal. This shared code may provide a robust communication system aligning these neural systems.