Object Recognition: Categories and features

Talk Session: Tuesday, May 21, 2024, 8:15 – 9:45 am, Talk Room 1

Talk 1, 8:15 am

Thalamocortical pathways underlying unconscious action-related visual information- evidence from the neural representations of binocularly suppressed tool images

Zhiqing Deng1 (), Fuying Zhu1, Jie Gao1, Zhiqiang Chen2,3,4, Peng Zhang2,3,5, Juan Chen1,6; 1Center for the Study of Applied Psychology, Guangdong Key Laboratory of Mental Health and Cognitive Science, and the School of Psychology, South China Normal University, Guangzhou, Guangdong Province, 510631, China, 2State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China, 3School of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China, 4Sino-Danish College, University of Chinese Academy of Sciences, Beijing 100049, China, 5Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 15 230026, China, 6Key Laboratory of Brain, Cognition and Education Sciences (South China Normal University), Ministry of Education

In exploring the integral role of human subcortical structures, including the thalamus, brainstem, and basal ganglia, our study delves into their contribution to unconscious high-level perceptual processing, notably focusing on visual categorization. While extensive behavioral and fMRI studies have suggested the significance of subcortical pathways for residual vision within blindsight, the specific involvement of the thalamus, superior colliculus, and basal ganglia in unconscious high-level perceptual processing remains elusive in healthy humans. Here, we employed functional magnetic resonance imaging (fMRI) to investigate the representation of toolness (tools vs. non-tools) and shape (elongated vs. stubby) within subcortical structures while images were made invisible using continuous flash suppression (CFS). Both univariate analysis and multivoxel pattern analysis (MVPA) based on blood-oxygenation level-dependent (BOLD) signals revealed a significant toolness representation in the left thalamus, with the left ventral anterior thalamus (VA, motor-related) as the most important thalamic subregion contributing to the toolness representation under CFS. In the basal ganglia, the left striatum (STR) exhibited robust toolness representation in both univariate analysis and MVPA. Among cortical regions, only area 9a and anterior 10p (part of the dorsolateral prefrontal cortex, DLPFC) demonstrated significant toolness representation. Functional connectivity results indicated that elongated tools increased the connectivity between the bilateral VA in the thalamus and left 9a (DLPFC) and between the right VA and left STR in the basal ganglia, compared to elongated non-tools in CFS. Notably, dynamic causal modeling (DCM) results unveiled a thalamocortical pathway from the left VA in the thalamus to the left 9a (DLPFC) contributing to toolness representation when tool images are rendered invisible by CFS. These findings shed light on the role of the subcortical structures, particularly the thalamus and basal ganglia, and highlight a thalamocortical pathway in healthy humans engaged in unconscious high-level perceptual processing, especially unconscious visual categorization.

Acknowledgements: This research was supported by the Supported by two National Natural Science Foundation of China grants (31970981 and 31800908) and by the National Science and Technology Innovation 2030 Major Program (STI2030-Major Projects 2022ZD0204802 to JC).

Talk 2, 8:30 am

A General Ability for Simple and Complex Ensemble Judgments

Isabel Gauthier1 (), Ting-Yun Chang1, Oakyoon Cha2; 1Vanderbilt University, 2Sungshin Women’s University

People can report summary statistics for various features about a group of objects. One theory is that different abilities support ensemble judgments about low-level features like color vs. high-level features like identity. Existing research mostly evaluates such claims based on evidence of correlations within and between feature domains. However, correlations between two identical tasks that only differ in the type of feature for ensemble judgments can be inflated by method variance. Another concern is that conclusions about high-level features are mostly based on faces. We used latent variable methods on data from 237 participants to investigate the abilities supporting low-level and high-level feature ensemble judgments. Ensemble judgment was measured with six distinct tests, each requiring judgments for a distinct low-level (orientation, lightness, aspect ratio) or high-level feature (bird species, Ziggerin identity, Transformer identity), using different task requirements in each task (mean estimation, mean matching, diversity comparison). We also controlled for other general visual abilities when examining how low-level and high-level ensemble abilities relate to each other. Confirmatory factor analyses showed a perfect correlation between the two factors, suggesting a single ability. A nested model comparison confirmed that using one ensemble perception (EP) factor rather than two did not impair model fit. There was a strong unique relationship (.9) between these two factors, beyond the influence of object recognition and perceptual speed. Additional results from 117 of the same participants also ruled out an important role for working memory in explaining the EP factor. Our results demonstrate that the ability common to a variety of ensemble judgments with low-level features is the same as that common to a variety of ensemble judgments with high-level features.

Acknowledgements: This work was supported by the David K. Wilson Chair Research Fund (Vanderbilt University) and the Taiwanese Overseas Pioneers Grants (TOP Grants) for PhD Candidates from Ministry of Science and Technology, Taiwan.

Talk 3, 8:45 am

Developmental changes in the precision of visual concept knowledge

Bria Long1 (), Wanjing Anya Ma1, Rebecca Silverman1, Jason Yeatman1, Michael C. Frank1; 1Stanford University

How precise is children’s visual concept knowledge, and how does this change across development? We created a gamified picture-matching task where children heard a word (e.g., “swordfish”) and had to choose the picture “that goes with the word.” Critically, we chose distractor items with high, medium, and low similarity to each target word, allowing us to examine the granularity of visual representations. We derived similarity via cosine embedding similarity of the target and distractor words in CLIP, a language-vision pre-training model (Radford et al., 2021). Photographs were taken from the THINGS+ dataset and combined with age-of-acquisition (AoA) ratings, yielding 108 items with unique targets and three distractors with estimated AoA ratings within 3 years of each other; we created 2AFC trials with high similarity distractors, 3AFC with high and medium similarity distractors, and 4AFC trials that included a low similarity distractor. Data were then collected from children in preschools (N=66 3-5 year-olds), 6 elementary schools, and 9 charter schools across multiple states (N=1369, 6-11 year-olds) and adults online (N=205). We modeled changes in the proportion of children who chose a given image for a certain word over development using linear mixed-effect models. We found gradual developmental changes in children's ability to identify the correct category. Error analysis from 3- and 4-AFC trials revealed that children were more likely to choose higher-similarity distractors as they grew older; children’s error patterns were increasingly correlated with CLIP target-distractor similarity. Overall, these analyses suggest a transition from coarse to finer-grained visual representations over early and middle childhood. Children’s visual concept knowledge gradually becomes more refined as children learn what distinguishes similar visual concepts from one another. Broadly, these findings demonstrate the utility of combining gamified experiments and similarity estimates from computational models to probe the content of children’s evolving visual representations.

Acknowledgements: NIH K99 HD108386

Talk 4, 9:00 am

Two-dimensional attributes in tactile depictions that convey three-dimensionality of objects.

Anchal Sharma1 (), Srinivasan Venkataraman1, PV Madhusudhan Rao1; 1Indian Institute of Technology Delhi

Three-dimensional objects are readily identifiable in standardized representations like perspective/orthographic projections in the visual domain. While readily understandable visually, these representations are not easily interpretable by blind individuals, especially those congenitally blind. This leaves a major gap in their education and points to a need for investigating a representation technique to convey the volumetric attributes of an object. In this study, 5 geometric objects and 20 tactile representations were provided to 20 participants comprising born-blind, late-blind, and blindfolded-sighted individuals (aged 18-44). Each object was presented on tactile sheets in four different representation styles in randomized order: Generator-director, surface development, isometric view, and dual-view. Participants were first familiarized with the original objects via haptic exploration. Subsequently, they were presented with the aforementioned tactile representations, and for each stimulus, were required to indicate whether it related to any of the original objects. If identified, detailed descriptions were recorded for what features of a particular representation style made it relatable to the identified 3D object. Analysis of the data revealed that surface development and generator-director representation styles were better associated with the objects. Participants’ open-ended responses offer deeper insights into why certain representations were preferred over others. Factors included a better indication of surface details and more discernible spatial arrangements. Interestingly, some objects were well identified irrespective of the representation style they were depicted in, pointing to certain unique features present in all of the styles. We discuss tentative constituents of what makes a 2D representation align closely with its 3D counterpart, including local salient features and specific spatial configurations. These results offer hints regarding the cues that allow for translation between 3D structures and 2D tactile depictions, pointing to interesting questions regarding features that are informative in the visual versus tactile domains, and have relevance for conveying graphical information to blind students.

Talk 5, 9:15 am

Error consistency between humans and machines as a function of presentation duration

Thomas Klein1,2 (), Wieland Brendel2, Felix Wichmann1; 1Universität Tübingen, 2Max Planck Institute for Intelligent Systems, Tübingen, Germany

Within the last decade, Artificial Neural Networks (ANNs) have emerged as powerful computer vision systems that match or exceed human performance on some benchmark tasks such as image classification. But whether current ANNs are suitable computational models of the human visual system remains an open question: While ANNs have proven to be capable of predicting neural activations in primate visual cortex, psychophysical experiments show behavioral differences between ANNs and human subjects as quantified by error consistency. Error consistency is typically measured by briefly presenting natural or corrupted images to human subjects and asking them to perform an n-way classification task under time pressure. But for how long should stimuli ideally be presented to guarantee a fair comparison with ANNs? Here we investigate the role of presentation time and find that it strongly affects error consistency. We systematically vary presentation times from 8.3ms to >1000ms, followed by a noise mask, and measure human performance and reaction times on natural, lowpass-filtered and noisy images. Our experiment constitutes a fine-grained analysis of human image classification under both image corruption and time pressure, showing that even drastically time-constrained humans who are exposed to the stimuli for only a single frame, i.e. 8.3ms, can still solve our 8-way classification task with success rates above chance. Importantly, the shift and slope of the psychometric function relating recognition accuracy to presentation time also depends on the type of corruption. In addition we find that error consistency also depends systematically on presentation time. Together our findings raise the question of how to properly set presentation time in human-machine comparisons. Second, the differential benefit of longer presentation times depending on image corruption is consistent with the notion that recurrent processing plays a role in human object recognition, at least for images that are difficult to recognise.

Acknowledgements: Funded by the German Research Foundation (DFG) under Emmy Noether grant BR 6382/1-1. Supported by EXC grant 2064/1, project 390727645 and SFB 1233, project 276693517. TK would like to thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for support.

Talk 6, 9:30 am

The neural representation of the fake objects

Xieyi Liu1, Pinglei Bao1,2,3; 1Peking-Tsinghua Center for Life Sciences, Peking Univ., Beijing, China, 2IDG/McGovern Institute for Brain Research, Peking Univ., Beijing, China, 3School of Psychological and Cognitive Sciences, Peking Univ., Beijing, China

Earlier research suggested that the IT cortex's functional structure can be understood through an object space model with DCNN. However, category-specific regions in the IT cortex, such as areas dedicated to faces and bodies, imply that its organization might also be based on semantic categories. To distinguish between these two hypotheses, we used fMRI to measure human subjects' responses to artificial images, referred to as “fake objects”, which were generated with GAN and lacked semantic category information. We projected these generated fake objects onto the PC1-PC2 space, built with the fMRI responses to 500 real objects. We chose 100 fake objects based on their projections onto the space, resulting in a ring-like structure. Subjects were instructed to perform three tasks in separate scans: two image categorization tasks based on the images' projection onto the two orthogonal axes in the object space and a fixation color discrimination task. The study's results show that the IT cortex can be effectively modulated by these fake objects, and the modulation of each voxel can be accurately represented by the object space model as the projection on the preferred axis. This holds true even for voxels located in category-selective regions, such as the FFA and EBA. Furthermore, the preferred axis of each voxel in the IT cortex remained consistent across the three tasks, although the absolute selectivity decreased in the fixation task. Additionally, the modulation of the two different image categorization tasks was more noticeable in the frontal and parietal cortex. Our results demonstrate that the functional organization of the IT cortex can be better explained by the object space model than the semantic model, and the representation of object space is relatively stable across different tasks, whose outputs can be read out by the later stages of the brain.

Acknowledgements: National Science and Technology Innovation 2030-Major Project 2022ZD0204803, Natural Science Foundation of China Grant NSFC 32271081,32230043 to P. B. Natural Science Foundation of China Grant NSFC32200857, China Postdoctoral Science Foundation Grant 2023M740125, 2022T150021, 2021M700004 to J.Y.