Object Recognition: Categories, neural mechanisms

Talk Session: Tuesday, May 23, 2023, 10:45 am – 12:30 pm, Talk Room 2
Moderator: Arash Afraz, NIMH/NIH

Talk 1, 10:45 am, 52.21

Preserved visual categorical coding in the ventral occipito-temporal cortex despite transient early blindness and permanent alteration in the functional response of early visual regions

Olivier Collignon1, Mohamed Rezk2, Xiaoqing Gao3, Junghyun Nam4, Zhong-Xu Liu5, Terri Lewis6, Daphne Maurer7, Stefania Mattioni8; 1UCLouvain, 2HES-SO Valais-Wallis, The Sense Innovation and Research Center, 3Zhejiang University, China, 4University of Toronto, 5McMaster University, 6Ghent University

It has been suggested that a transient period of postnatal visual deprivation affect the development of object categorization in the visual system. Here we overturn this assumption by demonstrating typical categorical coding in the ventral occipito-temporal cortex (VOTC) despite early visual deprivation and pervasive alteration in the functional response in the early visual cortex (EVC). We used fMRI to characterize the brain response to five visual categories (faces, bodies, objects, buildings and words) in a group of cataract-reversal individuals who experienced a short and transient period of blindness early in life, and in a group of control participants with typical visual development. Using a combination of uni- and multi-variate analyses, we show that the encoding of low-level visual properties of our stimuli is impaired in EVC in cataract-reversal participants, while there is a preservation of the categorical response in the VOTC. When altering the visual properties of our stimuli to mimic in controls the deficit of EVC response of the cataract, we observe a cascading alteration of the categorical coding from EVC to VOTC that is not observed in the cataract-reversal group. Our results suggest that we do not need visual experience early in life to develop the typical visual categorical organization in VOTC, even in the presence of impaired low-level visual processing in EVC. These results challenge the classical view of a feedforward development of categorical selectivity in VOTC according to which the categorical organization of high-level regions depends on low-level visual protomaps.

Talk 2, 11:00 am, 52.22

Perceptography: Revealing the causal contribution of the inferior temporal cortex to visual perception.

Elia Shahbazi1 (), Timothy Ma2, Arash Afraz1; 1National Institutes of Health, 2Center for Neural Science, New York University

Cortical stimulation in high-level visual areas causes complex perturbations in visual perception. Understanding the nature of stimulation-induced visual perception is necessary for characterizing visual hallucinations in psychiatric diseases and developing visual prosthetics. Most evidence is derived from anecdotal observations of human patients, but systematic studies have been severely limited due to the lack of language faculty in nonhuman primates. We developed a novel method, perceptography, to “take pictures” of the complex visual percepts induced by optogenetic stimulation of the inferior temporal (IT) cortex in macaque monkeys. Each trial started with a fixation on a computer-generated image. Halfway through the image presentation (1s), we perturbed the image features for 200ms. At the same time, IT cortex was optogenetically stimulated via an implanted LED array in half of the trials at random. The animals were rewarded for detecting cortical stimulation by looking at one of the two subsequently presented targets. Under the hood, two deep learning systems, DaVinci (GAN) and Ahab (Deep-learning feature extraction pipeline), controlled image alterations and tracked the animals’ behavioral responses, respectively. We hypothesized that false alarms (FA) are more likely to happen when an image alteration shares common features with the percept induced by cortical stimulation. In a functional closed loop with the animal, Ahab guided DaVinci to make image alterations that reduce the discriminability between stimulated and non-stimulated trials and increase the chances of FA. This closed-loop paradigm increased the FA rate from 3-4% to up to 85%. These images are called Perceptograms because seeing them is difficult for the animal to discern from the state of being cortically stimulated. We discovered that the structure of stimulation-induced percepts depends more on the concurrent visual input than the choice of cortical position. Although perceptograms obtained from anterior, IT follows the natural image manifold more than the posterior ones.

Talk 3, 11:15 am, 52.23

Both mOTS-words and pOTS-words prefer emoji stimuli over text stimuli during a reading task

Alexia Dalski1,2 (), Holly Kular3, Julia G. Jorgensen3, Kalanit Grill-Spector3,4, Mareike Grotheer1,2; 1Department of Psychology, Philipps-Universität Marburg Germany, 2Center for Mind, Brain and Behavior – CMBB, Philipps-Universität Marburg and Justus-Liebig-Universität Giessen, Germany, 3Department of Psychology, Stanford University, USA, 4Wu Tsai Neurosciences Institute, Stanford University, USA

The visual word form area in the occipitotemporal sulcus, here referred to as OTS-words, responds more strongly to text than other visual stimuli and plays a critical role in reading. Here we hypothesized, that this regions preference for text may be driven by a preference for reading tasks, as in most prior fMRI studies only the text stimuli were readable. To test this, we performed three fMRI experiments (N=15) and systematically varied the participant’s task and the visual stimulus, investigating mOTS-words and pOTS-words subregions. In experiment 1, we contrasted text stimuli with non-readable visual stimuli (faces, limbs, houses, and objects). In experiment 2, we used a fMRI adaptation paradigm, presenting the same or different compound words in text or emoji formats. In experiment 3, participants performed either a reading or a color task on compound words, presented in text or emoji format. Using experiment 1 data, we identified left mOTS-words and pOTS-words in all participants by contrasting text stimuli with non-readable stimuli. In experiment 2, pOTS-words, but not mOTS-words, showed fMRI adaptation for compound words in both text and emoji formats. In experiment 3, surprisingly, both mOTS-words and pOTS-words showed higher responses to compound words in emoji than text formats. Moreover, mOTS-words, but not pOTS-words, also showed higher responses during the reading than color task and more so for words in the emoji format. Multivariate analyses of experiment 3 data showed that distributed responses in pOTS-words encode the visual stimulus, whereas distributed responses in mOTS-words encode both the stimulus and the task. Together, our findings suggest that the function of the OTS-words subregions goes beyond the specific visual processing of text and that these regions are flexibly recruited whenever semantic meaning needs to be assigned to visual input.

Talk 4, 11:30 am, 52.24

Objects, faces, and spaces

Heida Maria SIGURDARDOTTIR1,3 (), Inga María Ólafsdóttir2,3; 1University of Iceland, 2Reykjavik University, 3Icelandic Vision Lab

What are the organizational principles of visual object perception as evidenced by individual differences in behavior? What specific abilities and disabilities in object discrimination go together? In this preregistered study (https://osf.io/q5ne8), we collected data from a large (N=511) heterogeneous sample to amplify individual differences in visual discrimination abilities. We primarily targeted people with self-declared face recognition abilities on opposite sides of the spectrum, ranging from poor to excellent face recognizers. We then administered a visual foraging task where people had to discriminate between various faces, other familiar objects, and novel objects. Each image had a known location in both face space and object space, which both were defined based on activation patterns in a convolutional neural network trained on object classification. Face space captures the main dimensions on which faces visually differ from one another while object space captures the main diagnostic dimensions across various objects. Distance between two images in face/object space can be calculated, where greater distance indicates that the images are visually different from one another on dimensions that are diagnostic for telling apart different faces/objects. Our results suggest that there simply are not any measurable stable individual differences in the usage of face space. However, we furthermore show that people who struggle with telling apart different faces also have some difficulties with visual processing of objects that share visual qualities with faces as measured by their location in object space. Face discrimination may therefore not rely on completely domain-specific abilities but may tap into mechanisms that support other object discrimination. We discuss how these results may or may not provide support for the existence of an object space in human high-level vision.

Acknowledgements: This work was supported by The Icelandic Research Fund (Grants No. 228916 and 218092) and the University of Iceland Research Fund.

Talk 5, 11:45 am, 52.25

THINGS-drawings: A large-scale dataset containing human sketches of 1,854 object concepts

Judith E. Fan1 (), Kushin Mukherjee2, Holly Huey1, Martin N. Hebart3,4, Wilma A. Bainbridge5; 1University of California, San Diego, 2University of Wisconsin-Madison, 3Justus Liebig University, Giessen, Germany, 4Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany, 5University of Chicago

People’s knowledge about objects has traditionally been probed using a combination of feature-listing and rating tasks. However, feature listing fails to capture nuances in what people know about how objects look — their visual knowledge — which cannot easily be described in words. Moreover, rating tasks are limited by the set of attributes that researchers even think to consider. By contrast, freehand sketching provides a way for people to externalize their visual knowledge about objects in an open-ended fashion. As such, sketch behavior provides a versatile substrate for asking a wide range of questions about visual object knowledge that go beyond the scope of a typical study. Here we introduce THINGS-drawings, a new crowdsourced dataset containing multiple freehand sketches of the 1,854 object concepts in the THINGS database (Hebart et al., 2019). THINGS-drawings contains fine-grained information about the stroke-by-stroke dynamics by which participants produced each sketch, as well as a rich set of other metadata, including ratings on various attributes, feature lists, and demographic characteristics of the participants contributing each sketch. As such, THINGS-drawings provides more comprehensive coverage of real-world visual concepts than previous sketch datasets (Eitz et al., 2012; Sangkloy et al., 2016; Jongejan, et al., 2016), which contain less richly annotated sketches of a smaller number of concepts (i.e., ~100-300). This broader scope enables stronger tests of the capabilities of current artificial intelligence systems to understand abstract visual inputs, and thus a benchmark for driving the development of systems that display more human-like image understanding across visual modalities. Moreover, we envision THINGS-drawings as a resource to the vision science community for investigating the richer aspects of many perceptual and cognitive phenomena in a unified manner, including visual imagery, memorability, semantic cognition, and visual communication.

Acknowledgements: NSF CAREER Award #2047191

Talk 6, 12:00 pm, 52.26

Uncovering neural-based visual-orthographic representations from mental imagery

Shouyu Ling1,2 (), Lorna García Pentón3, Blair C. Armstrong1,3, Andy C.H. Lee1,4, Adrian Nestor1; 1Department of Psychology at Scarborough, University of Toronto, Toronto, Ontario, Canada, 2Department of Ophthalmology, University of Pittsburgh, Pittsburgh, PA, US, 3MRC Cognition & Brain Sciences Unit, University of Cambridge, Cambridge, UK, 4BCBL. Basque Center on Cognition, Brain, and Language, San Sebastián, Spain, 5Rotman Research Institute, Baycrest Centre, Toronto, Ontario, Canada

Clarifying the neural and representational basis of mental imagery has elicited significant interest in the study of visual recognition. Recently, numerous attempts have been directed at uncovering the structure and the content of visual imagery. However, these attempts have mostly targeted simple visual features (e.g., orientations, shapes, or single letters), limiting the theoretical and practical implications of this research. To address these limitations, the current study aimed to decode and to reconstruct the appearance of single words from mental imagery with the aid of functional magnetic resonance imaging (fMRI). We collected fMRI data from 13 healthy right-handed adults while they passively viewed or mentally imagined the appearance of three-letter concrete nouns with a consonant-vowel-consonant structure. Consistent with previous findings, multivariate analyses demonstrated that pairs of words can be discriminated from neural patterns when words are viewed and, also, when they are imagined. However, decoding relied more extensively on early visual areas in the former case, for perception, and more extensively on higher-level visual areas, such as the visual word form area (vWFA), in the latter case, for imagery. To assess and to visualize the representational content underlying successful decoding, imagery-based image reconstruction was conducted by mapping the neural patterns of visual words during imagery onto a representational feature space extracted from neural signals during perception. This analysis revealed successful levels of imagery-based image reconstruction for single words in the early visual cortex as well as in the vWFA. Thus, our findings speak to overlapping neural representations between imagery and perception, both in low-level visual areas and higher-order visual cortex. Further, they shed light on the fine-grained neural representations of visual-orthographic information during mental imagery.

Talk 7, 12:15 pm, 52.27

Putative excitatory and inhibitory neurons in the macaque inferior temporal cortex play distinct roles in core object recognition

Sachi Sanghavi1, Kohitij Kar2; 1University of Wisconsin–Madison, 2York University

Distributed neural population activity in the macaque inferior temporal (IT) cortex, which lies at the apex of the visual ventral stream hierarchy, is critical in supporting an array of object recognition behavior. Previous research, however, has been agnostic to the relevance of specific cell types, inhibitory vs. excitatory, in the formation of "behaviorally sufficient" IT population codes that can accurately predict primate object confusion patterns. Therefore, here, we first compared the strength of behavioral predictions of neural decoding ("readout") models constructed from specific (putative) cell types in the IT cortex. We performed large-scale neural recordings while monkeys (n=3) fixated images (640) presented (100ms) in their central (8 degrees) field of view. Monkeys (n=3) also performed binary object discrimination tasks (8 objects; 640 images; 28 binary tasks). We performed PCA (and spike shape) based spike sorting analysis to categorize the recorded neural signals into two groups: broad-spiking (104; putative excitatory) and narrow-spiking (33; putative inhibitory) neurons. We observed that decoding strategies (205 linking hypotheses tested) derived from excitatory neurons significantly outperform those produced by inhibitory neurons in overall accuracy and image-by-image match to monkey behavioral patterns. Given that current artificial neural network (ANN) models of the ventral stream (as documented in Brain-Score) explain ~50% of macaque IT neural variance and produce human-like accuracies in object recognition tasks, we compared their predictions of putative excitatory (Exc) vs. inhibitory (Inh) IT neurons. Interestingly, we observed that ANNs predict Exc neurons significantly better than Inh neurons (Exc-Inh = 10%; p<0.0001). Taken together, the correlative evidence for cell-type specificity in the linkage between IT population activity and object recognition behavior, along with the novel cell-type specific benchmarks (that disrupt the current Brain-Score ranking of the encoding models for macaque IT), provides valuable guidance for the next generation of more refined brain models.

Acknowledgements: KK was supported by the Canada Research Chair Program. This research was undertaken thanks in part to funding from the Canada First Research Excellence Fund. We also thank Jim DiCarlo, and Sarah Goulding for their support.