Object Recognition: Artificial neural networks, models

Talk Session: Sunday, May 21, 2023, 2:30 – 4:15 pm, Talk Room 1
Moderator: Marieke Mur, Western University

Talk 1, 2:30 pm, 34.11

A large and rich EEG dataset for modeling human visual object recognition

Alessandro T. Gifford1 (), Kshitij Dwivedi2, Gemma Roig2, Radoslaw M. Cichy1; 1Freie Universität Berlin, 2Goethe Universität Frankfurt am Main

The human brain achieves visual object recognition through multiple stages of transformations operating at a millisecond scale. To predict and explain these rapid transformations, computational neuroscientists employ machine learning modeling techniques. However, state-of-the-art models require massive amounts of data to properly train, and to the present day there is a lack of vast brain datasets which extensively sample the temporal dynamics of visual object recognition. Here we collected a massive millisecond resolution electroencephalography (EEG) dataset of human brain responses to images of objects on a natural background from the THINGS database. We used a time-efficient rapid serial visual presentation paradigm to extensively sample 10 participants, each with 16,740 image conditions repeated over 82,160 trials. We then leveraged the unprecedented size and richness of our dataset to train and evaluate deep neural network (DNN) based encoding models. The results showcase the quality of the dataset and its potential for computational modeling in five ways. First, we trained linearizing encoding models that successfully synthesized the EEG responses to arbitrary images. Second, we correctly identified the recorded EEG data image conditions in a zero-shot fashion, using EEG synthesized responses to hundreds of thousands of candidate image conditions. Third, we show that both the number of conditions and trial repetitions of the EEG dataset contribute to the trained models’ prediction accuracy. Fourth, we built encoding models whose predictions well generalize to novel participants. Fifth, we demonstrate full end-to-end training of randomly initialized DNNs that output EEG responses for arbitrary input images. We release the dataset as a tool to foster research in computational neuroscience and computer vision. We believe it will be of great use to further understanding of visual object recognition through the development of high-temporal resolution computational models of the visual brain, and to optimize artificial intelligence models through biological intelligence data.

Acknowledgements: A.T.G. is supported by a PhD fellowship of the Einstein Center for Neurosciences. R.M.C. is supported by German Research Council (DFG) grants (CI 241/1-1, CI 241/3-1, CI 241/1-7) and the European Research Council (ERC) starting grant (ERC-StG-2018-803370).

Talk 2, 2:45 pm, 34.12

Torchlens: A Python package for extracting and visualizing all hidden layer activations from arbitrary PyTorch models with minimal code

JohnMark Taylor1 (), Nikolaus Kriegeskorte1; 1Columbia University

Deep neural networks (DNNs) remain the dominant AI models at many visual tasks, and as models for biological vision, making it crucial to better understand the internal representations and operations undergirding their successes and failures, and to carefully compare these processing stages to those found in the brain. PyTorch has emerged as the leading framework for building DNN models; it would thus be highly desirable to have a method for easily and exhaustively extracting and characterizing the results of the internal operations of any arbitrary PyTorch model. Here we introduce Torchlens, a new open source Python package for extracting and characterizing hidden layer activations from PyTorch models. Uniquely among existing approaches for this task, Torchlens has the following features: 1) it exhaustively extracts the results of all intermediate operations, not just those associated with PyTorch module objects, yielding a full record of every step in the model's computational graph, 2) in addition to logging the outputs of each operation, it encodes metadata about each computational step in a model's forward pass, both facilitating further analysis and enabling an automatic intuitive visualization (in rolled or unrolled format) of the model's complete computational graph, 3) it contains a built-in validation procedure to algorithmically verify the accuracy of all saved hidden layer activations, and 4) the approach it uses can be automatically applied to any arbitrary PyTorch model with no modifications, including models with conditional (if-then) logic in their forward pass, recurrent models, branching models, and models with internally generated tensors (e.g., that add random noise). Furthermore, Torchlens requires minimal user-facing code, making it easy to incorporate into existing pipelines for model development and analysis, use as a pedagogical aid when teaching deep learning concepts, and more broadly, accelerate the process of understanding the internal operating principles of DNNs trained on visual tasks.

Acknowledgements: This work was supported by the National Eye Institute of the NIH under award number 1F32EY033654.

Talk 3, 3:00 pm, 34.13

Invariant object recognition in deep neural networks: impact of visual diet and learning goals

Haider Al-Tahan1,2, Farzad Shayanfar, Ehsan Tousi1, Marieke Mur1; 1Western University, 2Meta AI

Invariant object recognition is a hallmark of human vision. Humans recognize objects across a wide range of rotations, positions, and scales. A good model of human object recognition should, like humans, be able to generalize across real-world object transformations. Deep neural networks are currently the most popular computational models of the human ventral visual stream. Prior studies reported that these models show signatures of invariant object recognition but showed mixed results on how closely the models match human performance. Inconsistencies across studies in the ability of deep neural networks to recognize objects across transformations may be due to differences in the tested model architectures or training regimes. Here we test object recognition performance for different families of pretrained feedforward deep neural networks across object rotation, position, and scale. We included 95 models and defined model families based on three dimensions: model architecture, visual diet, and learning objective. Along the model architecture dimension, we tested convolutional neural networks and visual transformers. For each architecture, we tested models trained on relatively poor and relatively rich visual diets, ranging from 1.2 to 14 million training images, and models trained with supervised and unsupervised learning objectives. We created test images using ThreeDWorld, a 3D virtual world simulation platform that includes 583 3D objects from 58 ImageNet categories. We found that all tested model families show a drop in object recognition performance after applying object transformations, with lowest performance for object rotation and scale. Model architecture did not noticeably affect model performance, but models trained with rich visual diets and unsupervised generative learning objectives outperformed the other model families in our set. Our results suggest that, while different models agree on which object transformations are most challenging, visual diet and learning goals affect their ability to match human performance at invariant object recognition.

Acknowledgements: We thank Jeremy Schwartz and Seth Alter for assisting in resolving various issues inregard to ThreeDWorld

Talk 4, 3:15 pm, 34.14

Local texture manipulation further illuminates the intrinsic difference between CNNs and human vision

Alish Dipani1, Huaizu Jiang2, MiYoung Kwon1; 1Department of Psychology, Northeastern University, Boston, MA, 2Khoury College of Computer Sciences, Northeastern University, Boston, MA

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance on a wide range of visual tasks and have provided currently the best computational models of visual processing in the primate brain. However, CNNs are strongly biased towards textures rather than shapes, which may harm object recognition. This is rather surprising as human vision benefits from textures and shapes as complementary and independent cues for recognizing objects (e.g., humans can reliably recognize an apple by either its texture or shape alone). Considering this stark contrast, it is important to understand how CNNs use the two cues, object texture and shape, for object recognition. Here we address this very question. To this end, we compare multi-label image-classification accuracy when the models are trained on either original (intact), object (local), or scene (global) texture-manipulated datasets. We then evaluate the models’ ability to generalize to other unseen datasets. We tested CNNs and Transformers known to have a strong shape bias. A psychophysical experiment was also conducted to evaluate human performance. We employ images from the COCO dataset containing natural scenes with multiple objects. Local textures are manipulated by replacing each object's texture with a random, artificial texture chosen from the DTD dataset. Global textures are manipulated by using image style transfer with a random texture. We find noticeable differences in the models’ ability to generalize to untrained datasets. Specifically, both CNNs and Transformers trained on the original dataset show a sharp decrease in accuracy when tested on texture-manipulated datasets. However, CNNs, but not Transformers, trained on local texture-manipulated datasets perform well on both the original and global texture-manipulated datasets. As expected, human observers show difficulty recognizing local texture-manipulated images. Our findings suggest that, unlike humans, CNNs do not use texture and shape independently. Instead, textures appear to be used to define object shape per se.

Acknowledgements: This work was supported by NIH/NEI Grant R01EY027857, Northeastern University Tier-1 Seed grant, and Research to Prevent Blindness (RPB)/Lions Clubs International Foundation (LICF) low vision research award.

Talk 5, 3:30 pm, 34.15

TALK CANCELLED: Evaluating the influence of ML models on human judgment of non-physical attributes in images

Shruthi Sukumar1,2, Vijay Veerabadran3, Jascha Sohl-Dickstein1, Michael Mozer1, Gamaleldin Elsayed1; 1Google Research, Brain Team, 2University of Colorado Boulder, 3University of California San Diego

Recent breakthroughs in computer vision have given us generative models that can create visual stimuli from natural language descriptions or prompts (e.g., ‘photorealistic image of blue alligator riding a small tricycle’). These models are often trained on datasets with multiple modalities such as vision and language and learn from data labeled by humans that not only includes physical attributes (e.g., object classes) but also human judgements of non-physical traits (such as pleasantness). Here, we investigate whether these models can generate subtleties in images that do not have simple physical manifestations, including affective state, in a way that is detectable by humans. We hypothesize that these models will likely be successful in encoding those non-physical attributes and elicit human judgment in agreement with those attributes. To this end, we first utilized a text-to-image generative model to generate an image corresponding to a particular attribute, such as ‘pleasant’, or ‘calm’, embedded within the natural language text prompt. Next, we conducted an empirical investigation on human participants and asked them to rate these generated images based on the degree to which they perceived whether an image reflects an attribute, like pleasantness, on a 5-point likert scale (very unpleasant to very pleasant). We then assess the degree of alignment between participants’ ratings and the attribute used to generate the corresponding images. Our findings show that there is significant alignment between the attributes used to generate images and subjects’ judgment of the image attributes (p<0.001), indicating that machine learning models can indeed elicit required judgments of non-physical attributes in visual stimuli. This work illustrates that machine-learning models are capable of representing subtleties in visual stimuli that can reliably influence human perception regarding non-physical attributes. Our experimental protocol was granted an Institutional Review Board (IRB) exemption by an external, independent, ethics board (Advarra)

Acknowledgements: Pieter Kinder, Alex Alemi, Ben Poole, Been Kim, Isabelle Guyon, Paul Nicholas, Patrick Gage Kelley, Bhavna Daryani

Talk 6, 3:45 pm, 34.16

Reconstructing visual experience from a large-scale biologically realistic model of mouse primary visual cortex

Reza Abbasi-Asl1 (), Yizhou Chi1, Huibo Yang1, Kael Dai2, Anton Arkhipov2; 1UCSF, 2Allen Institute for Brain Sciences

Decoding the visual stimuli from large-scale recordings of neurons in the visual cortex is key to understanding visual processing in the brain and could enable the groundwork for a successful brain-computer interface. Data-driven development of a comprehensive decoder requires simultaneous measurements from hundreds of thousands of neurons in the brain in response to a large number of image stimuli. Measuring this amount of simultaneous neural data with high temporal frequency is extremely challenging given the current state of neural recording technologies. Here, we leverage a large-scale biologically realistic model of the visual cortex to investigate neural responses and reconstruct visual experience. We utilized a biophysical model of the mouse primary visual cortex (V1) consisting of 230,000 neurons in 17 different cell types. Using this model, we simulated the simultaneous neural responses to 80,000 natural images. We then developed a computational framework to reconstruct the visual stimuli with plausible geometric information and semantic details. Our framework is based on a conditional generative adversarial structure to learn the self-supervised representation of the mouse V1 neuronal responses, with a generative model that reconstructs the stimulus images from the latent space of the model. To build this latent space, we trained a decoder to differentiate whether the representation of the V1 neuronal responses matches the stimulus images. Meanwhile, a constantly evolving generator is learned to reconstruct the geometrically interpretable images. Our framework generates stimuli images with high reconstruction accuracy and could be eventually tested on real neuronal responses from the mouse visual cortex.

Acknowledgements: Reza Abbasi-Asl was supported by the National Institute of Mental Health of the National Institutes of Health under award number RF1MH128672.

Talk 7, 4:00 pm, 34.17

Can deep convolutional networks explain the semantic structure that humans see in photographs?

Siddharth Suresh1,2 (), Kushin Mukherjee1,2, Timothy T. Rogers1,2; 1University of Wisconsin-Madison, Department of Psychology, 2McPherson Eye Research Institute

Deep convolutional networks (DCN) have been proposed as useful models of the ventral visual processing stream. This study evaluates whether such models can capture the rich semantic similarities that people discern amongst photographs of familiar objects. We first created a new dataset that merges representative images of everyday concepts (taken from the Ecoset) with the large semantic feature set collected by the Leuven group. The resulting set includes ~300,000 images depicting items in 86 different semantic categories including 46 animate items (reptiles, insects and mammals) and 40 inanimate (vehicles, instruments, tools, and kitchen items). Each category is also associated with values on a set of ~2000 semantic features, generated by human raters in a prior study. We then trained two variants of the AlexNet architecture on these items: one that learned to activate just the corresponding category label, and a second that learned to generate all of an item’s semantic features. Finally, we evaluated how accurately the learned representations in each model could predict human decisions in a triplet-judgment task conducted using photographs from the training set. Both models predicted some human triplet judgments better than chance, but the model trained to output semantic feature vectors performed better and captured more levels of semantic similarity. Neither model, however, performed as well as an embedding computed directly from the semantic feature norms themselves. The results suggest that deep convolutional image classifiers alone do a poor job capturing the semantic similarity structure that drives human judgments, but that alterations in the training task–in particular, training on output vectors that express richer semantic structure–can greatly overcome this limitation.