Object Recognition: Neural mechanisms

Talk Session: Monday, May 20, 2024, 8:15 – 9:45 am, Talk Room 2

Talk 1, 8:15 am

Vernier acuity in single neurons of monkey inferior temporal cortex

Sini Simon1 (), SP Arun1; 1Indian Institute of Science

Humans can discriminate small offsets between nearly collinear lines, a phenomenon known as vernier acuity. However, the underlying neural correlates have not been investigated. To investigate this issue, we asked whether monkeys experience visual acuity like humans and which brain regions form the underlying neural basis. We created stimuli containing a square frame with a disk that could be moved along a horizontal or vertical line, in the presence of a horizontal or vertical bar at the center. If vernier acuity is present in behavioural or neural responses, we should observe greater sensitivity to small changes in the horizontal position of the disk when a nearby bar is oriented vertically rather than horizontally. Conversely, there should be greater sensitivity to changes in vertical position when the nearby bar is oriented horizontally but not vertically. We tested these predictions on monkeys performing a same-different task, as well as using neural responses recorded from their inferior temporal cortex, while they passively viewed the same stimuli. In Experiment 1, we tested 3 monkeys trained to perform a same-different task. Here, all three animals showed higher sensitivity to position changes in the vernier conditions compared to the non-vernier conditions. In Experiment 2, we tested these predictions using wireless brain recordings from the inferior temporal cortex, a region critical for object recognition, while monkeys viewed these stimuli in a fixation task. Here too, we observed greater neural dissimilarity between the stimuli in the vernier condition compared to the non-vernier condition. Interestingly, this effect arose late in the neural response, suggesting that this effect arises through computation and is not simply inherited from the early visual areas. Taken together, our results show that monkeys, like humans, experience vernier acuity and this effect is likely driven by single neurons in the inferior temporal cortex.

Acknowledgements: This work was supported through a Senior Fellowship from the DBT-Wellcome India Alliance to SPA.

Talk 2, 8:30 am

Does Leveraging the Human Ventral Visual Stream Improve Neural Network Robustness?

Zhenan Shao1,2 (), Linjian Ma3, Bo Li3,4, Diane M. Beck1,2; 1Department of Psychology, University of Illinois Urbana-Champaign, 2Beckman Institute, University of Illinois Urbana-Champaign, 3Department of Computer Science, University of Illinois Urbana-Champaign, 4Department of Computer Science, University of Chicago

Human object recognition is robust to a variety of object transformations, including changes in lighting, rotations, and translations, as well as other image manipulations, including the addition of various forms of noise. Invariance has been shown to emerge gradually along the ventral visual stream with later regions showing higher tolerance to object transformations. In contrast, despite their unprecedented performance on numerous visual tasks, Deep Neural Networks (DNNs) fall short in achieving human-level robustness to image perturbations (adversarial attacks), even those that are visually imperceptible to humans. One potential explanation for this difference is that brains, but not DNNs, build increasingly disentangled and therefore robust object representations with each successive stage of the ventral visual stream. Here, we asked whether training DNNs to emulate human representation can enhance their robustness and, more importantly, whether different stages of the ventral visual stream enable progressively increased robustness, reflecting the potentially evolving representation crucial for human perceptual invariance. We extracted neural activity patterns in five hierarchical regions of interest (ROIs) in the ventral visual stream: V1, V2, V4, LO, and TO from a 7T fMRI dataset (Allen et al., 2022) obtained when human participants viewed natural images. DNN models were trained to perform image classification tasks while aligning their penultimate layer representations with neural activity from each ROI. Our findings reveal not only a significant improvement in DNN robustness but also a hierarchical effect: greater robustness gains were observed when trained with neural representations from later stages of the visual hierarchy. Our results not only show that ventral visual cortex representations improve DNN robustness but also support the gradual emergence of robustness along the ventral visual stream.

Acknowledgements: This work used NCSA Delta GPU through allocation SOC230011 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

Talk 3, 8:45 am

Human EEG and artificial neural networks reveal disentangled representations of object real-world size in natural images

Zitong Lu1 (), Julie D. Golomb1; 1The Ohio State University

Remarkably, human brains have the ability to accurately perceive and process the real-world size of objects, despite vast differences in distance and perspective. While previous studies have delved into this phenomenon, distinguishing this ability from other visual perceptions, like depth, has been challenging. Using the THINGS EEG2 dataset with high time-resolution human brain recordings and more ecologically valid naturalistic stimuli, our study uses an innovative approach to disentangle neural representations of object real-world size from visual size and perceived real-world depth in a way that was not previously possible. Leveraging this state-of-the-art dataset, our EEG representational similarity results reveal a pure representation of object real-world size in human brains. We report a representational timeline of visual object processing: pixel-wise differences appeared first, then real-world depth and retinal size, and finally, real-world size. Additionally, we input both these naturalistic images and object-only images without natural background into artificial neural networks. Consistent with the human EEG findings, we also successfully disentangled representation of object real-world size from visual size and real-world depth in all three types of artificial neural networks (visual-only ResNet, visual-language CLIP, and language-only Word2Vec). Moreover, our multi-modal representational comparison framework across human EEG and artificial neural networks reveals real-world size as a stable and higher-level dimension in object space incorporating both visual and semantic information. Our research provides a detailed and clear characterization of the object processing process, which offers further advances and insights into our understanding of object space and the construction of more brain-like visual models.

Acknowledgements: NIH R01-EY025648 (JG), NSF 1848939 (JG)

Talk 4, 9:00 am

Visual processing of soft objects automatically activates physics-based representations in the human brain

Wenyan Bi1 (), Qi Lin2, Kailong Peng1, Aalap Shah1, Ilker Yildirim1; 1Yale University, 2Riken

When encountering soft objects, say a garment draping on a surface or a pillow being pressed, in the wrinkles and folds they make, we don't just see low-level properties such as edges, contours, or colors, but also seemingly higher level ones, such as mass, elasticity and stiffness. What neural and computational mechanisms underlie these percepts? Previously, using psychophysics and modeling, we found that human soft object perception is best explained by a model that incorporates "intuitive physics", as opposed to performance-matched alternatives that only consider pattern recognition (implemented as a CNN). Here, we hypothesize that, in the human brain, these intuitive physics-based representations (i) are computed spontaneously during visual processing, i.e., in the absence of physics-related tasks, (ii) occur in regions common with that of physical reasoning about rigid objects, and (iii) generalize across qualitatively different scene configurations. To address this, we used fMRI to scan participants (N=20) as they passively viewed animations of cloths at two stiffness levels (stiff and soft) undergoing naturalistic deformations in four different scene configurations (e.g., blowing in the wind, draping on an uneven surface). We identified each participant's regions of interest ("physics-ROI") using a previously validated localizer of physical inferences based on rigid objects. Univariate analysis showed that both physics-ROI and V1 were modulated by the soft vs. stiff cloths, but physics-ROI was modulated by this contrast to a significantly greater degree than V1. Moreover, multivariate analysis revealed successful cross-scene decoding of stiffness levels in physics-ROI (ACC=0.61). Notably, fine-grained rankings of cross-decoding accuracy across different scene configurations were well-captured by representations inferred in our physics-based computational model (Kendall’s τ=0.64) but not those in the performance-matched CNN (τ=-0.02). These results help reveal the implementation of physics-based representations of soft objects in the brain.

Talk 5, 9:15 am

Asymmetry of neural circuits for word and face recognition in readers of Roman and Arabic script

Zahra Hussain1,2 (), Sarah Abdou2, Aaliya Sageer1, Julien Besle1,2; 1University of Plymouth, 2American University of Beirut

Words and faces preferentially engage regions in opposite hemispheres of the brain, with corresponding differences in recognition ability between the visual fields. These face and word asymmetries have not been compared simultaneously in readers of different scripts. We compared cortical and behavioural laterality between monolingual readers of Roman or Arabic script, and bilingual readers of both scripts, to evaluate whether reading experience and script properties (e.g., reading direction), alter the representation of words and faces. Cortical activation was measured using 3T fMRI in 21 subjects (6-8 per language group; 3 groups) who viewed faces, English and Arabic words, and control stimuli, whilst performing a one-back task. Cortical regions of interest (ROI) were identified for faces (contrast: faces > phase-scrambled faces and faces > houses; ROIs: fusiform face area, occipital face area, superior temporal sulcus), and for English and Arabic words (contrast: words > phase-scrambled words; ROI: visual word form area). BOLD activation and number of voxels were measured in each ROI. Behaviour was measured outside the scanner in four tasks involving stimuli presented in the left or right visual field (lexical decision, same-different word discrimination, 10AFC face identification, chimeric face identity), with eye movements monitored throughout. We found effects of group on cortical and behavioural laterality for words and faces both. All groups showed cortical left-hemispheric dominance for words in the habitually-read script, but this effect was strongest for English readers, and only English readers showed superior word recognition in the right than left visual field. Likewise, cortical right-hemispheric dominance for faces was strongest in English readers, intermediate for bilinguals, and was virtually absent in Arabic readers. These effects were paralleled in chimeric face judgements. Thus, reading experience or the properties of the habitually-read script alter the symmetry of neural representations for words and faces.

Acknowledgements: This work was funded by a University Research Board grant from the American University of Beirut

Talk 6, 9:30 am

Optogenetic stimulation in macaque V4 cortex induces robust detectable visual events

Rosa Lafer-Sousa1 (), Lilly Kelemen1, Reza Azadi1, Elia Shahbazi1, Arash Afraz1; 1NIMH

Understanding the nature of the perceptual events evoked by neural perturbations is essential for bridging the causal gap between neuronal activity and vision as a behavior. Here we assess behavioral detectability of optogenetic stimulation in monkey V4 cortex. Two macaque monkeys were chronically implanted with LED arrays over a region of V4 cortex transduced with the depolarizing opsin C1V1. The animals were trained to detect stimulation while fixating at different images. In each trial an image was displayed on the screen for 1s. In half of trials, randomly selected, a 200ms optical impulse was delivered halfway through image presentation, and the animal was rewarded for correctly identifying whether the trial contained cortical stimulation. Both animals learned to perform the task significantly above chance within 11 and 7 sessions respectively (Chi-sq, p-values < 0.01) and improved their performance to 90% and 83% after 27 and 13 more training days (Chi-sq, p-values < 0.001). After the training phase, 20 novel images were used to test whether stimulation detection depends on the choice of onscreen image. The choice of image had a significant effect on stimulation detection (permutation test, p-values < 0.001). Further, the effect varied as a function of cortical location. Taken together, the results suggest the effect of stimulation is visual in nature and stimulation of different subregions evoke different perceptual events. Next we asked whether the stimulation-evoked events are additive in nature, by varying the visibility of the onscreen images. In contrast to inferotemporal cortex, reducing the visibility of the onscreen images did not systematically reduce stimulation detection. These results suggest the events evoked by stimulation in V4 are additive. The findings show for the first time that optogenetic stimulation of V4 cortex induces robust detectable visual events, opening the door to systematic causal studies of V4 with optogenetic methods.

Acknowledgements: NIMH Intramural Research Training Award (IRTA) Fellowship Program; NIMH Grant ZIAMH002958