Artificial neural networks and vision

Talk Session: Saturday, May 20, 2023, 2:30 – 4:15 pm, Talk Room 2
Moderator: Frank Tong, Vanderbilt University

Talk 1, 2:30 pm, 24.21

Harmonizing the visual strategies of image-computable models with humans yields more performant and interpretable models of primate visual system function.

Ivan Felipe Rodriguez1 (), Drew Linsley1,4, Jay Gopal1, Thomas Fel1,2, Michael J. Acaro3, Saloni Sharma3, Margaret Livingstone3, Thomas Serre1,4; 1Brown University, 2Artificial and Natural Intelligence Toulouse Institute, 3Harvard University, 4Carney instititue for Brain Science

Over the past decade, deep neural networks (DNNs) have been the standard paradigm for modeling biological brains and behavior. While initial reports suggested that the ability of DNNs to model biology correlated with their object classification accuracy (Yamins et al., 2014), this no longer appears to be the case: image-evoked activity in a self-supervised ResNet50 — an architecture introduced seven years ago — has the highest correlation with IT recordings on We recently discovered that DNNs are also becoming progressively less aligned with human perception as their object classification accuracy has increased. This problem however can be resolved through “neural harmonization”: a drop-in training module for DNNs that forces their learned visual strategies to be consistent with those of humans (Fel et al., 2022). DNNs that are trained for object classification and harmonized with behavioral data describing human visual strategies for the same task are more interpretable, performant, and accurate at predicting human behavior. Here, we investigated if harmonizing DNNs with human behavioral data could also yield better models of the primate visual system. To test this, we turned to recordings of primate IT while animals viewed complex natural images (Arcaro et al., 2020). These experiments produced spatially resolved activity maps, which illustrate how neurons respond to every part of an image, thus revealing which features drove neural responses. After fitting a variety of state-of-the-art DNNs trained for object classification to this data, ranging from convolutional neural networks to vision transformers, we discovered that harmonizing these models with human visual strategies significantly improved their predictions of IT neural activity and reproduced qualitative features of neurons’ spatial activity maps that unharmonized models did not. Our findings demonstrate the importance of large-scale human behavioral and psychophysics data for generating more accurate and interpretable models of brain and behavior.

Acknowledgements: ONR (N00014-19-1-2029), NSF (IIS-1912280 and EAR-1925481), DARPA (D19AC00015), NIH/NINDS (R21 NS 112743), and the ANR-3IA ANITI (ANR-19-PI3A-0004). Carney Institute for Brain Science and the Center for Computation and Visualization (CCV). Google TFRC program. NIH S10OD025181.

Talk 2, 2:45 pm, 24.22

Canonical Dimensions of Neural Visual Representation

Zirui Chen1 (), Michael Bonner1; 1Johns Hopkins University

What key factors of deep neural networks (DNNs) account for their representational similarity to visual cortex? Many properties that neuroscientists proposed to be critical, such as architecture or training task, have turned out to have surprisingly little explanatory power. Instead, there appears to be a high degree of “degeneracy,” as many DNNs with distinct designs yield equally good models of visual cortex. Here, we suggest that a more global perspective is needed to understand the relationship between DNNs and the brain. We reasoned that the most essential visual representations are general-purpose and thus naturally emerge from systems with diverse architectures or neuroanatomies. This leads to a specific hypothesis: it should be possible to identify a set of canonical dimensions, extensively learned by many DNNs, that best explain cortical visual representations. To test this hypothesis, we developed a novel metric, called canonical strength, that quantifies the degree to which a representational feature in a DNN can be observed in the latent space of many other DNNs with varied construction. We computed this metric for every principal component (PC) from a large and diverse population of trained DNN layers. Our analysis showed a strong positive association between a dimension’s canonical strength and its representational similarity to both human and macaque visual cortices. Furthermore, we found that the representational similarity between visual cortex and the PCs of a DNN layer, or any set of orthogonal DNN dimensions, is well predicted by the simple summation of their canonical strengths. These results support our theory that canonical visual representations extensively emerge across brains and machines – suggesting that “degeneracy” is, in fact, a signature of broadly useful visual representations.

Talk 3, 3:00 pm, 24.23

Net2Brain: A Toolbox to compare artificial vision models with human brain responses

Domenic Bersch1, Kshitij Dwivedi1, Martina Vilas1,2, Radoslaw Martin Cichy3,4,5, Gemma Roig1; 1Johann Wolfgang Goethe-Universität Frankfurt, 2Ernst Struengmann Institute for Neuroscience, 3Department of Education and Psychology, Freie Universität Berlin, 4Berlin School of Mind and Brain, Faculty of Philosophy, 5Bernstein Center for Computational Neuroscience Berlin

Several studies have demonstrated the potential of deep neural networks (DNNs) to serve as state-of-the-art computational models of the primate visual cortex. In the last decade, different implementations of DNNs (varying, for example, their architecture, objective function, or training algorithm) have been compared to uncover the computational principles, algorithms, and neurobiological mechanisms behind visual processing (Cadieu et al. 2014; Khaligh-Razavi and Kriegeskorte 2014; Yamins et al. 2014; Guclu and Gerven 2015; Cichy, Khosla, et al. 2016). To promote this line of research, new benchmarks, datasets, and challenges relevant to cognitive neuroscience experiments have been developed (Cichy, Roig, Alex Andonian, et al. 2019; Cichy, Roig, and Oliva 2019; Cichy, Kshitij Dwivedi, et al. 2021; Schrimpf et al. 2018; Nili et al. 2014). There are some toolboxes, that already facilitate the extraction of model activations, but these mainly focus on supervised image classification models (Muttenthaler and Hebart 2021). However, studies have shown that DNNs trained for different tasks could provide new information about the visual cortex (Tang, LeBel, and Huth 2021; Dwivedi et al. 2021). We, therefore, introduce Net2Brain, a toolbox for mapping model representations to human brain data. Net2Brain allows the extraction of activations over image and video datasets of any inserted custom model or any of the 600+ included DNNs trained for various visual tasks (e.g., semantic segmentation, depth estimation, action recognition), including multimodal models. In contrast to other toolboxes, Net2Brain handles all steps from feature extraction to analysis through a simple pipeline. It computes the representational dissimilarity matrices (RDMs) over the activations and compares them to brain recordings using representational similarity analysis (RSA), and weighted RSA, both using ROI-based and searchlight analysis. Net2Brain is open source and comes with brain data for immediate testing, and it is also straightforward to use your own recorded data.

Acknowledgements: This work was funded with the support from the Alfons and Gertrud Kassel Foundation (G.R.), by the Hessian Center for AI Germany (, by the German Research Foundation (DFG, CI241/1-1, CI241/3-1 to R.M.C.) and by the European Research Council (ERC, 803370 to R.M.C.).

Talk 4, 3:15 pm, 24.24

Unsupervised contrastive learning and supervised classification training have opposite effects on the human-likeness of CNNs during occluded object processing

David Coggan1 (), Frank Tong1; 1Vanderbilt University

Human observers can readily perceive and recognize visual objects even when occluding stimuli obscure much of the object from view. By contrast, state-of-the-art convolutional neural network (CNN) models of primate vision perform poorly at classifying occluded objects (Coggan and Tong, VSS, 2022). A key difference between biological and artificial visual systems is how they learn from visual examples. CNNs are typically trained using supervised methods to classify images by object category based on labelled data. By contrast, humans learn about objects with a broader range of learning objectives and fewer opportunities for supervised feedback. Here, we asked whether a more naturalistic approach to training CNNs might yield more occlusion-robust models that better predict human neural and behavioural responses to occluded objects. To address this question, we trained an array of CORnet-S model instances with either supervised classification or unsupervised contrastive learning. We also augmented the standard ImageNet dataset by superimposing artificial occluders onto the images. The contrastive learning objective was to produce similar unit activations in the highest layer to differently occluded instances of the same underlying object image. Once training was complete, each model was tested for occlusion robustness and compared to human behavioural and neural responses to occluded objects. For supervised models, we found that training on the occluded dataset led to substantial improvements in classification accuracy for novel occluded objects, relative to the standard dataset. Despite these improvements, occlusion-trained models performed worse at predicting both human behavioural and neural responses to occluded objects, suggesting that these supervised models learned a different type of occlusion-robust mechanism. By contrast, the layer-wise activity patterns found in the unsupervised, contrastively trained models exhibited stronger occlusion robustness and greater human-likeness than any other model, suggesting that human robustness to occlusion may be attributable in part to a natural, unsupervised visual learning environment.

Acknowledgements: This research was supported by grants from the National Institutes of Health R01-EY029278 (to F.T.) and P30-EY008126 to the Vanderbilt Vision Research Center (Director Dr. Calkins).

Talk 5, 3:30 pm, 24.25

Deep learning classifiers match human accuracies but not the quirks

Joseph MacInnes1, Natalia Zhozhikashvili, Kirill Koretaev2, Feurra Matteo3; 1Swansea University, 2Purple Gaze, 3HSE University

Deep learning convolutional neural networks (CNN) have shown impressive results on many computer vision tasks. They have also performed well modelling human vision, leading some to suggest that are inherently good models of human visual processing. Since CNNs are classifiers, they typically transform problems into classification and excel when measuring results as an accuracy score. Less well studied is a CNN’s ability to model human errors, mistakes and other incongruities that people make when interpreting their visual world. We tested CNNs in their ability to model human data for cognitive and neural phenomena that highlighted peculiarities of human vision. Specifically, the ability of a CNN trained on upright faces and houses to model results from the face inversion effect (FIE) and the impact of TMS to face and object recognition. We gathered data from 19 participants performing a matching task for faces or houses. Behavioural conditions included upright and inverted stimuli. TMS conditions included rOFA, rOPA or Sham. Human accuracy scores showed a typical FIE and our TMS manipulation reduced the FIE by impairing identification accuracy of upright faces pairs (although we did not replicate the expected double dissociation produced by Pitcher et al(2011) and Dilks et al (2013)). We trained a series of CNNs on upright faces and houses to match human matching accuracy and further tested them on the same inverted stimuli shown to human participants. While we could easily match human performance on upright faces, none of the networks showed the FIE when tested on inverted stimuli. In fact, the only interaction from a CNN solution was a house inversion effect. We further modified our CNN solutions by perturbing the weights of the mid network layers to simulate the virtual lesioning of the TMS conditions. Again, the CNN lesioning was not able to match human TMS results.

Talk 6, 3:45 pm, 24.26

Noise reduction as a unified mechanism of perceptual learning in both artificial and biological visual systems

Yu-Ang Cheng1,2 (), Ke Jia8,9,10, Takeo Watanabe2, Sheng Li4,5,6,7, Ru-Yuan Zhang1,3; 1Institute of Psychology and Behavioral Science, Antai College of Economics and Behavioral Sciences, Shanghai Jiao Tong University, Shanghai, China, 2Brown University, Department of Cognitive, Linguistic and Psychological Sciences, RI, USA, 3Shanghai Mental Health Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, China, 4School of Psychological and Cognitive Sciences, Peking University, Beijing, China, 5Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China, 6PKU-IDG/McGovern Institute for Brain Research, Peking University, Beijing, China, 7Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, China, 8Department of Neurobiology, Affiliated Mental Health Center & Hangzhou Seventh People, 9Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China, 10NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou, China

Although signal enhancement and/or noise reduction have been proposed as key computational mechanisms of visual perceptual learning (VPL), their links to behavioral and neural consequences of VPL remain elusive. To better bridge previous theoretical and empirical findings, we built a deep neural network (DNN) model of VPL. The DNN is a Siamese neural network that inherits the first five convolutional layers from the pretrained AlexNet to emulate the early visual system and appends one linear readout layer to make binary perceptual decisions. We trained it on an orientation discrimination task consisting of Gabor stimuli with varying levels of external noises. After training, the DNN model reproduced several key psychophysical, human imaging, and neurophysiological findings in VPL literature: (1) training uniformly shifts down the behavioral Threshold vs. Noise functions; (2) training improves stimulus decoding accuracy at the population level in the last four layers; (3) training sharpens the orientation tuning curves of individual neurons in the first two layers and reduces Fano factors and inter-neuron noise correlations in all layers. Furthermore, we used an information-theoretic approach to analyze two high-dimensional distributions of population responses that correspond to the two Gabor stimuli being discriminated. The results showed that VPL improves population codes primarily by reducing the (co)variance of population responses (i.e., noise reduction) rather than enlarging the Euclidean distance between the two response distributions (i.e., signal enhancement). Most importantly, our model generates novel predictions that VPL systematically warps and rotates the two response distributions in high-dimensional representational spaces. These predictions were supported by the results of a human fMRI experiment on perceptual learning of motion direction discrimination. Taken together, our DNN model can reproduce a broad range of psychophysical, human imaging, and neurophysiological findings reported in VPL literature. Systematic analyses of population responses strongly support the noise reduction theory of VPL.

Talk 7, 4:00 pm, 24.27

Comparing motion and static feature selectivity between the macaque dorsal and ventral temporal visual cortical body patches

Rajani Raman1,2 (), Anna Bognár1,2, Ghazaleh Ghamkhari Nejad1,2, Nick Taubert3, Beatrice de Gelder4,5, Martin A Giese3, Rufin Vogels1,2; 1Department of Neuroscience, KU Leuven, Leuven, Belgium, 2Leuven Brain Institute, KU Leuven, Leuven, Belgium, 3HIH&CIN, Department of Cognitive Neurology, University Clinic Tübingen, Tübingen, Germany, 4Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, Netherlands, 5Department of Computer Science, University College London, London, United Kingdom

Previous studies identified body patches in the macaque inferotemporal cortex that were activated more strongly by static images of bodies compared to faces and objects (Vogels, 2022). Recently we mapped 'dynamic body patches' using twenty 1 s videos of dynamic monkey bodies, 20 dynamic monkey faces, and 20 dynamic objects (Bognar et al., SfN 2022). In this fMRI-guided single-unit study, we investigated the contribution of shape and motion to the neural representations underlying the visual processing of moving bodies. We recorded single neurons, using the same videos and monkey subjects as for the fMRI mapping, in the dorsal (upper bank/ fundus Superior Temporal Sulcus (STS)) and ventral (ventral bank/ ventral to STS) dynamic body patches in the anterior temporal visual cortex. Most neurons responded more to body videos in both ventral and dorsal patches. Many neurons responded equally well to static frames and the original videos, whereas others responded only to videos, requiring motion. Some cells were also selective for the frame order of a video (video reversal). Estimated optical flow and neural inter-video distances were highly correlated across body videos, indicating that neurons captured the dynamics of body movements. Dorsal patches captured more body motion than ventral patches (p < 0.05, bootstrapping). These regions also tended to respond to facial movements, but not to the movements of objects. Deeper layer (5-7) AlexNet feature and neural inter-video distances were correlated across body videos in the ventral patches, indicating selectivity for static features of bodies. Interestingly, no significant correlation was found in the dorsal patches. These findings suggest that both the dorsal and ventral body patches in the anterior temporal visual cortex are sensitive to body motion. Unlike ventral body patches, the selectivity of dorsal patches for moving bodies is not accounted for by Alexnet features.

Acknowledgements: This work was supported by ERC 2019-SyG-RELEVANCE-856495 and FWO-G0E0220N.