VSS, May 13-18

Object Recognition: Neural mechanisms

Talk Session: Tuesday, May 17, 2022, 2:30 – 4:15 pm EDT, Talk Room 2
Moderator: Martin Hebart, Max Planck Institute of Human Cognitive and Brain Sciences

Times are being displayed in EDT timezone (Florida time): Tuesday, August 9, 12:20 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 2:30 pm, 54.21

Context effects on object recognition in real world environments

Victoria Nicholls1 (), Kyle Alsbury-Nealy2, Alexandra Krugliak1, Alex Clarke1; 1University of Cambridge, 2University of Toronto

The environment in which objects are located impacts recognition. This occurs through initial coding of global scene context, enabling the generation of predictions about potential objects in the environment (Bar, 2004; Trapp & Bar, 2015). When correct, these predictions facilitate object recognition, but when these predictions are violated object recognition is impeded, shown by slower RTs and larger N300/400 ERPs (Mudrik, et al, 2010; 2014; Lauer, et al, 2020). The majority of research on object recognition and visual contexts has been done in controlled laboratory settings, where objects and scenes often occur simultaneously. However, in the real world, the environment is relatively stable over time while objects come and go. Research in real world environments is the ultimate test of how context changes our perceptions, and is fundamental in determining how we understand what we see. In this research, we asked how the visual context influenced object recognition in real-world settings through a combination of mobile EEG (mEEG) and augmented reality (AR). During the experiment, participants approached AR arrows placed either in an office or an outdoor environment while mEEG was recorded. When participants reached the arrows, they changed colour indicating that a button could be pressed, which then revealed an object that was either congruent or incongruent with the environment. We analysed the ERP data (aligned to the appearance of the objects) with hierarchical generalised linear mixed models with a fixed factor of congruency, and object and participants as random factors. Similarly to laboratory experiments, we found that scene-object incongruence impeded object recognition as shown through larger amplitudes of the N300/N400. These findings suggest that visual contexts constrain our predictions of likely objects even in real-world environments, helping to bridge between research in laboratory and real life situations.

Acknowledgements: This work was supported by a Royal Society and Wellcome Trust Sir Henry Dale Fellowship to AC (211200/Z/18/Z)

Talk 2, 2:45 pm, 54.22

Forming 3-dimensional multimodal object representations relies on integrative coding

Aedan Y. Li1 (), Natalia Ladyka-Wojcik1, Chris B. Martin2, Heba Qazilbash1, Ali Golestani1, Dirk B. Walther1,3, Morgan D. Barense1,3; 1Department of Psychology, University of Toronto, 2Florida State University, 3Rotman Research Institute, Baycrest Health Sciences

How do we combine complex multimodal information to form a coherent representation of “what” an object is? Existing literature has predominantly used visual stimuli to study the neural architecture of well-established object representations. Here, we studied how new multimodal object representations are formed in the first place, using a set of well-characterized 3D-printed shapes embedded with audio speakers. Applying multi-echo fMRI across a four-day learning paradigm, we examined the behavioral and neural changes that occurred before and after shape-sound features were paired to form objects. To quantify learning, we developed a within-subject measure of representational geometry based on collected similarity ratings. Before shape and sound features were paired together, representational geometry was driven by modality-specific information, providing direct evidence of feature-based representations. After shape-sound features were paired to form objects, representational geometry was now additionally driven by information about the pairing, providing causal evidence for an integrated object representation distinct from its features. Complementing these behavioral results, we observed a robust learning-related change in pattern similarity for shape-sound pairings in the anterior temporal lobes. Intriguingly, we also observed greater pre-learning activity for visual over auditory features in the ventral visual stream extending into perirhinal cortex, with the visual bias in perirhinal cortex attenuated after the shape-sound relationships were learned. Collectively, these results provide causal evidence that forming new multimodal object representations relies on integrative coding in the anterior temporal lobes and perirhinal cortex.

Talk 3, 3:00 pm, 54.23

Functionally distinct sub-regions of the parahippocampal place area revealed by model-based neural control

Apurva Ratan Murty1,2 (), Alex Abate1,2, Frederik Kamps1,2, James DiCarlo1,2, Nancy Kanwisher1,2; 1McGovern Institute for Brain Research, Massachusetts Institute of Technology, 2Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology

Abundant evidence supports a role for the parahippocampal place area (PPA) in visual scene perception, but fundamental questions remain. Here we ask whether the PPA contains distinct sub-regions that encode different aspects of scenes. To address this question, we used data-driven clustering to identify groups of PPA voxels with similar responses to a large set of images in extensively scanned individual brains (185 images, 20 repetitions per image, N = 4). We found that >95% of the variance of PPA voxel responses was explained by just two clusters, mapped approximately along the anterior-posterior axis, consistent with previous findings (Baldassano et al., 2013; Nasr et al., 2013; Cukur et al., 2016; Steel et al., 2021). But what distinct scene features do these sub-regions encode? Responses profiles of the two subregions were quite correlated, and visual inspection of stimuli eliciting high and low responses in each sub-region did not reveal any obvious functional differences between them. We therefore built artificial neural network-based encoding models of each PPA sub-region, which were highly accurate at predicting responses to held-out stimuli (each R > 0.70, P< 0.00001), and harnessed these models to find new images predicted to maximally dissociate responses of the two sub-regions. These predictions were then tested in a new fMRI experiment, which produced a clear double dissociation between the two sub-regions in all four PPAs tested (two participants x two hemispheres each): The anterior sub-region responded more to images containing relatively bare spatial layouts than images containing object arrays and textures, while the more posterior region showed the opposite pattern. Taken together, this approach revealed distinct sub-regions of the PPA and produced highly accurate computational models of each, which in turn identified stimuli that could differentially activate the two subregions, providing an initial hint about the functional differences between them.

Talk 4, 3:15 pm, 54.24

Recapitulation of cortical visual hierarchy in the human pulvinar

Michael Arcaro1, Daniel Guest2, Emily Allen2, Kendrick Kay2; 1University of Pennsylvania, 2University of Minnesota

The pulvinar is highly interconnected with both low- and high-level visual cortex. While extensive work in non-human primates has demonstrated the presence of retinotopic maps and sensitivity to low-level visual features such as local contrast in the pulvinar, connectivity with high-level visual cortex suggests that the pulvinar may also play a role in high-level vision. To explore this possibility, we investigated subcortical activity in the 7T fMRI Natural Scenes Dataset consisting of 1.8-mm responses to 9,000–10,000 unique natural scenes in each of 8 participants. We fit population receptive field models to individual voxels, systematically evaluating different stimulus features (contrast, saliency, faces, bodies, foreground, background, words) that might be encoded in voxel responses. This analysis confirmed that, consistent with prior work, the LGN and inferior-lateral pulvinar respond selectively to local contrast and are retinotopically organized. However, the analysis also revealed an area of the pulvinar, located medial and posterior to the contrast-selective region, that responds selectively to bodies and faces in the contralateral visual hemifield. To further explore these findings, we performed a thalamocortical correlation analysis in which stimulus-evoked responses in the thalamus were correlated with stimulus-evoked responses in cortex. This analysis revealed that pulvino-cortical correlations are largely restricted to the visual cortex and have rich structure consistent with prior anatomical data. The contrast-selective portion of pulvinar correlates most with early visual cortex while the body- and face-selective portion of pulvinar correlates most with face- and body-selective cortical regions. More generally, there is a gradient of pulvino-cortical correlations such that progression from anterior-lateral to posterior-medial in the pulvinar recapitulates the posterior-to-anterior organization of visual cortex. Our results indicate that the pulvinar also likely plays an important role in high-level vision and illustrate that principles of cortical organization may hold for the thalamus as well.

Acknowledgements: Collection of the NSD dataset was supported by NSF IIS-1822683 and NSF IIS-1822929.

Talk 5, 3:30 pm, 54.25

Precise and generalizable cartography of functional topographies in individual brains

Ma Feilong1 (), Samuel A. Nastase2, Guo Jiahui1, Yaroslav O. Halchenko1, M. Ida Gobbini1,3, James V. Haxby1; 1Dartmouth College, 2Princeton University, 3Università di Bologna

Each brain has unique functional topographies. The same functional region differs in size, shape, and topology across individual brains. The multivariate spatial patterns that encode neural representations also have distinct topographies across individuals. In this work, we present an individualized model of brain function that has a fine spatial resolution, which precisely captures these topographic idiosyncrasies of each brain and accurately predicts the brain's response patterns to new stimuli. This model, which we call "warp hyperalignment", first creates a functional brain template based on a group of participants, and the features (e.g., voxels) of this functional template comprise a high-dimensional feature space. The functional profiles of each individual brain is modeled as a linear transformation ("warping") of this template feature space. We applied warp hyperalignment to two fMRI datasets that comprised movie-watching, object category localizers, and retinotopic scans. First, we found that: (a) the modeled brain functional profiles based on independent movie data of the same individual were highly similar, and much more similar than those based on different individuals. Second, the model trained on movie data can accurately predict brain response patterns to object categories and retinotopic maps of each individual. The quality of these model-predicted maps sometimes exceeds the quality of maps based on localizer scans with typical durations. Third, the model accurately predicts fine-grained spatial patterns. The model trained on half of the movie data can accurately predict brain responses to the other half. Based on the similarity of measured and predicted response patterns to the movie, we were able to predict which individual time point (TR) the subject was watching with approximately 50% accuracy (chance accuracy < 0.1%). In summary, we present an individualized model of brain function that is precise, specific to the individual, and has a fine spatial resolution.

Acknowledgements: This work was supported by NSF grants 1835200 (M.I.G) and 1607845 (J.V.H).

Talk 6, 3:45 pm, 54.26

Temporal dynamics of shape-invariant real-world object size processing

Simen Hagen1 (), Yuan-Fang Zhao1, Marius V. Peelen1; 1Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands

Real-world size is a behaviorally relevant object property that is automatically encoded, is reflected in the organization of the human ventral temporal cortex, and can be decoded from neural responses as early as 150 ms after stimulus onset. However, while real-world size is a distinct, conceptual object property, it strongly correlates with at least two other object properties: rectilinearity (large objects typically have more rectilinear features) and fixedness (large objects are more often fixed in the environment). Here, we aimed to dissociate the temporal profile of object size processing from that of covarying shape and fixedness properties. During EEG recording, participants (N=33) viewed isolated objects that were drawn from a 2 (real-world size: large, small) x 2 (shape: rectilinear, curvilinear) x 2 (fixedness: fixed, transportable) design. This design allowed us to decode each dimension (e.g., size) across the other dimensions (e.g., shape, fixedness). For example, we tested whether (and when) a classifier trained to distinguish large from small fixed and/or rectilinear objects (e.g., bed vs mailbox) successfully generalized to distinguish large from small transportable-curvilinear objects (e.g., airballoon vs balloon). Across posterior electrodes, cross-decoding of real-world size was significant from 350 ms after stimulus onset for all cross-decoding splits. Similar cross-decoding analyses of the other two object properties revealed cross-decoding of shape from 170 ms and no significant cross-decoding of fixedness at any time point. These results indicate that higher-level (shape-invariant) representations of real-world object size emerge relatively late during visual processing.

Acknowledgements: This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 725970).

Talk 7, 4:00 pm, 54.27

Detectability of optogenetic stimulation of inferior temporal cortex depends significantly on visibility of visual input

Rosa Lafer-Sousa1 (), Karen Wang1, Arash Afraz1; 1NIMH

Stimulation of visually sensitive regions of ventral temporal cortex in humans alters visual perception, yet the precise nature of these effects remains unclear. Clarifying the perceptual nature of these perturbations is essential for bridging the causal gap between neuronal activity and vision as a behavior, and for the development of effective visual prosthetics for patients with severe visual impairments. To test whether stimulation of inferior temporal (IT) cortex causes additive (“hallucinatory”) versus distortive effects, we carried out optogenetic stimulation in macaque monkeys performing a stimulation-detection task while viewing images of objects that varied in visibility. Visibility was degraded by reducing contrast, saturation, and spatial frequency of object images to gray in five steps. Hypothetically, if stimulation causes an additive effect, varying visibility of the visual input should not affect detectability of stimulation. If anything, stimulation should be easier to detect when the visual input is less visible since there would be no underlying image to parse from the hallucinatory percept. If on the other hand cortical stimulation has a distortive effect, it should be easier to detect when the visual input is more visible, as the effect would necessarily be a function of the visual input. Two macaque monkeys were implanted with Opto-Arrays over a region of their IT cortex transduced with the depolarizing opsin C1V1. In each trial, following fixation an image was displayed for 1s. In half of the trials, randomly selected, a 200ms illumination impulse was delivered halfway through image presentation, and the animal was rewarded for correctly identifying whether the trial did or did not contain cortical stimulation. Visibility of the visual input significantly affected stimulation detectability (ANOVA, p < 0.003 for both animals). Consistent with the distortion model, the animals were more accurate at detecting stimulation when the visual input was more visible.

Acknowledgements: NIMH Intramural Research Training Award (IRTA) Fellowship Program; NIMH Grant ZIAMH002958