Object Recognition
Talk Session: Sunday, May 17, 2026, 8:15 – 9:45 am, Talk Room 2
Moderator: Zeynep Saygin, The Ohio State University
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Talk 1, 8:15 am, 31.21
Characterizing visual feature encoding in Drosophila Visual Projection Neurons LPLC1 and LPLC2
Bryce Hina1 (), Haley Croke1, Anthony Moreno-Sanchez2, Omika Wadhwa2, Natalie Smolin1, Jessica Ausborn2, Catherine von Reyn1,2; 1Drexel University, School of Biomedical Engineering, Science & Health Systems, 2Drexel University College of Medicine, Neurobiology & Anatomy
Visual systems play a crucial role in quickly detecting important features in our surroundings, such as potential threats. However, understanding how visual systems encode visual features has been challenging, as many visual feature encoding cell types have yet to be fully identified. We here benefit from the fruit fly, Drosophila melanogaster, where all neurons in the visual system have recently been identified and mapped at EM resolution. We focus on a class of neurons called visual projection neurons (VPNs), which have been hypothesized to selectively encode visual features. Here, we provide the first evidence that VPNs encode visual information through spiking activity, with a well-defined axonal initial segment. By recording single cell activity while presenting visual stimuli, we reveal two VPN cell types (LPLC1 and LPLC2) exhibit robust spiking responses to both expanding (looming) objects and small, translating objects, suggesting VPNs are less selective to visual features than previously hypothesized. Next, we mapped the receptive fields of VPN using expanding disks and translating stimuli, and found receptive fields based on spiking responses are smaller than those reported with Ca2+ imaging. Finally, we developed a pipeline for rapid receptive field mapping to further investigate differences in VPN tuning across a stimulus parameter space and examine habituation in visual responses. Our work suggests the encoding of visual features enlists spiking activity across multiple VPN cell types, generating a population code more similar to what is seen in vertebrate visual processing. Our work therefore has potential to uncover these population codes, and to also establish how they are generated through cellular and circuit mechanisms.
Funding was provided by the National Science Foundation Grant No. IOS-1921065 (Catherine R. von Reyn) and the National Institutes of Health NINDS R01NS118562 (Catherine R. von Reyn, Jessica Ausborn) and NEI F31EY037197 (Bryce W. Hina).
Talk 2, 8:30 am, 31.22
Characterizing the inputs to infants’ object category representations
Jane Yang1 (), Tarun Sepuri1, Alvin W.M. Tan2, Khai Loong Aw2, Michael C. Frank2, Bria Long1; 1University of California San Diego, 2Stanford University
Infants acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? Here, we capitalize on innovations in data and in machine learning to characterize the structure of naturalistic infant experience. We analyzed egocentric videos from the BabyView dataset (N=31, 5–36 months, 868 hours), using an object detection model (YOLOE) to detect 205 categories from a vocabulary checklist in 3.68M frames (1 fps). Manual annotations of 480 frames suggested reasonable detection performance for 163 categories (average precision = 0.6, SD = 0.26); false alarms were filtered by comparing vision embeddings of cropped objects to text embeddings of their predicted labels in a multimodal CLIP model, retaining detections with cosine similarity > 0.26. Consistent with prior work, infants' object exposure was highly skewed: a few categories (chair, toy, book) dominated infants’ visual experiences while most categories appeared rarely. We compared infants' object exposures to exemplars of categories in a curated image dataset (THINGS) by examining average category embeddings across all exemplars in both datasets in a multimodal (CLIP) and a self-supervised visual encoder (DINOv3). The category-level similarity between BabyView and THINGS varied widely (e.g., zebra, r = 0.75, slipper, r = 0.15), as infants encountered object categories in non-canonical forms—occluded, cluttered, as depictions, and from unusual angles. Nonetheless, we found that representational dissimilarity matrices of the between-category similarity in BabyView vs. THINGS were moderately correlated (CLIP Spearman’s rho = 0.37, p<.01; DINOv3 Spearman’s rho = 0.31, p<.01). These results suggest that infants experience infrequent exemplars of many categories with notably different types of diversity than is captured in modern curated datasets. Models seeking to emulate human-like category learning must grapple with how children learn so efficiently from these radically different experiences with object categories.
We thank the families who participated in the BabyView Dataset. This work was supported by NIH R00HD108386 (to B.L.), Schmidt Futures, Meta, the Stanford Center for the Study of Language and Information, and the Stanford HAI Hoffman-Yee program.
Talk 3, 8:45 am, 31.23
vOT Specialization for Reading in Congenitally Blind Individuals
MARIA CZARNECKA1, Florencia Martinez Addiego2, Marcin Szwed3; 1Jagiellonian University, 2Georgia Institute of Technology, 3Jagiellonian University
A central component of reading is linking written symbols to their meanings. The left ventral occipitotemporal cortex (vOT) is considered a key interface for this function. This study examines what shapes the development of the neural correlates of reading—whether they stem from an innate sensitivity to shapes refined through visual experience, or instead from mechanisms that emerge independently of vision. To investigate this, we conducted an fMRI experiment with 21 congenitally blind and 21 sighted participants who read words through touch (Braille) or vision (print), respectively. The stimuli were designed to tap into multiple processing levels: low-level (visual for print, spatial for Braille), orthographic, and semantic. Using representational similarity analysis (RSA), we observed that both orthographic and semantic representations are present in the left vOT of blind readers, suggesting that this region adopts a reading-related role even without visual input. The primary group difference appeared at a low-level stage. In sighted participants, processing was localized to the early visual cortex, whereas in blind participants it recruited sensorimotor regions. Overall, these findings demonstrate that despite differences in low-level sensory pathways, the left vOT assumes a similar functional role in reading regardless of visual experience.
Talk 4, 9:00 am, 31.24
Linking Functional Architecture to Pathology: Neural Overlap and Inter-Category Associations Dictate Stroke-Induced Visual Recognition Deficits
Laura Soen1, Roos Malpart1, Céline Gillebert1, Hans Op de Beeck1; 1KU Leuven
Visual recognition relies on specialized, category-selective functional architecture within the occipitotemporal cortex (OTC). While focal stroke damage can induce classic category-specific deficits (e.g., prosopagnosia, pure alexia, object agnosia), lesions rarely affect a single category. The resulting patterns of co-occurring impairments remain theoretically and clinically unclear. We provide a comprehensive, multi-modal framework to systematically study the OTC organization and demonstrate how this structure dictates stroke-induced deficits. We employed a three-way validation approach integrating behavioral, neural, and patient data. We developed and validated the Word, Object, and Face Categorization Test (WOF-CT) to assess recognition across 10 neuropsychologically relevant categories in three cohorts: young adults (age<30, N = 250), healthy older adults (age>50, N = 95), and stroke patients with damage limited to the posterior cerebral artery territory (N = 22 ). In healthy participants (N = 28), we used fMRI with Multivoxel Pattern Analysis (MVPA) to map the fine-grained neural overlap and associations within the OTC. Behavioral analyses established systematic associations stronger within animated and inanimate domains than across them. These behavioral associations were validated by the fMRI findings, which confirmed distinct neural overlap within these domains. The stroke patient data revealed deficit patterns that can be explained by the behavioral data and neural organization of the healthy brain. Furthermore, our data showed a differential age-related decline in visual recognition for word and face recognition, but not for house and animal recognition, demonstrating category-specific vulnerability. Our findings demonstrate that stroke-induced visual deficits follow structured patterns dictated by the fundamental organization of the OTC. The observed neural overlap between related categories explains why OTC lesions produce co-occurring deficits in associated categories. This research establishes a neuro-behavioral framework to interpret and anticipate visual recognition impairments. These empirically-derived principles can inform and guide the development of more effective and personalized neurorehabilitation protocols in the future.
Fonds Wetenschappelijk Onderzoek
Talk 5, 9:15 am, 31.25
Temporal Dynamics of IT Representations Depend on Image-Manifold Scale
Ammar I Marvi1, Jacob S Prince1, George A Alvarez1,2, Talia Konkle1,2; 1Harvard University, 2Kempner Institute
The time course of neural activity offers a window into the mechanisms of high-level visual processing. However, challenges posed by high-resolution brain measurement across space, time, and natural images have left many questions regarding the nature and time-scale of representational dynamics unanswered. We analyzed single-unit macaque electrophysiology (Triple-N dataset; Li et al, 2025) in 33 patches of inferotemporal (IT) cortex to explore how their representational geometry changed over time. Using time-resolved analyses, we measured representational dynamics at various image ‘scales,’ sampling image sets of increasing size based on overall response magnitude such that smaller sets contained the most preferred images. We reasoned that a larger, less selective set of images would span a larger portion of the natural image manifold to reveal global tuning dynamics, while smaller scales—using a subset of highly activating images—would reveal more granular, local changes to geometry. We find scale-dependent representational dynamics across IT cortex. With global image sets, geometry evolved smoothly and generalization via linear decoding remained relatively stable over time. In contrast, local sets yielded clearer temporal reconfiguration: representational geometry changed over ~100–250 msec post-image onset, e.g. one medial face patch showed a sharp transition between two distinct geometric configurations. More generally, the regional time-time geometry similarity matrices were more structured over local image sets compared to the full image set, with effective dimensionality (ED) compressed by up to 50%. Critically, these scale-linked dynamics were greatly reduced when we disrupted structure with scale-matched, rank-randomized controls (median ED change 0.6%). Together, our results suggest that IT geometry appears globally stable when measured over a diverse set of images yet locally dynamic for preferred inputs. This scale dependent view reconciles findings of temporal stability with recent findings of rapid tuning shifts and points toward dynamic coding mechanisms that locally refine visual representations over time.
This work was supported by funding from the Kempner Institute (to T.K.) and the Harvard Department of Psychology
Talk 6, 9:30 am, 31.26
Probing the granularity of human-machine alignment
Yash Mehta1 (), Raj Gauthaman, Michael Bonner; 1Johns Hopkins University
Deep neural networks trained on fine-grained object classification (e.g., ImageNet 1000-way) are currently the leading models of the primate ventral visual stream. However, it remains unclear whether dense semantic supervision is a prerequisite for brain-model alignment, or if coarser categorical distinctions suffice. To address this, we derived category boundaries directly from visual features, avoiding external semantic hierarchies. Using recursive PCA on pre-trained AlexNet and ViT representations, we generated hierarchical label sets ranging from 2 to 64 coarse categories while keeping the training images constant. We then trained networks from scratch on these coarse tasks and evaluated their representations against the Natural Scenes Dataset (fMRI) and the THINGS database (behavioral similarity). We found that fine-grained supervision is not necessary for brain-like representations. Strikingly, models trained to make only broad distinctions (e.g., just a handful of coarse categories) achieved the highest alignment with human behavioral similarity judgments, significantly outperforming standard 1000-class models. For neural alignment, models trained on intermediate granularity (32–64 categories) matched or surpassed fine-grained models in the early visual cortex and maintained comparable performance in the ventral stream. Additionally, coarse-trained models demonstrated superior stability when tested on out-of-distribution synthetic stimuli, suggesting that their alignment with human vision is more robust than conventional networks. Control analyses confirm these representations are structurally distinct from low-dimensional projections of fine-grained models. These findings suggest that networks trained on broader categorical distinctions are better at capturing the most cognitively salient organizing principles of human vision.