Object Recognition: Models

Talk Session: Saturday, May 18, 2024, 8:15 – 9:45 am, Talk Room 2
Moderator: Leila Wehbe, Carnegie Mellon University

Talk 1, 8:15 am, 21.21

Building better models of biological vision by searching for more ecological data diets and learning objectives

Drew Linsley1 (), Akash Nagaraj1, Alekh Ashok1, Francis Lewis1, Peisen Zhou1, Thomas Serre1; 1Brown University

The many successes of deep neural networks (DNNs) over the past decade have been driven by data and computational scale rather than biological insights. However, as DNNs have continued to improve on benchmarks like ImageNet, they have worsened as models of biological brains and behavior. For instance, recent DNNs with human-level object classification accuracy are no better at predicting human perception or image-evoked responses in primate inferotemporal (IT) cortex than DNNs from a decade ago (e.g., Linsley et al., 2023). Here, we build better DNN models of biological vision by finding data diets and objective functions that more closely resemble those that shape biological brains. We began by building a platform for searching through naturalistic data diets and objective functions for training a standardized DNN architecture at scale. Each DNN’s data diet was sampled from our rendering engine, which generates life-like videos of objects in real-world scenes. In parallel, each model’s objective function was sampled from a parametrized space of image reconstruction objectives, which made it possible to train models to learn combinations of causal and acausal recognition strategies over space or space and time. We evaluated the ability of hundreds of DNNs trained on this platform to predict human performance on a novel “Greebles” object recognition task (Ashworth et al., 2008). We found that DNNs trained to capture the causal structure of data were significantly more predictive of human decisions and reaction times than any other DNN tested. Moreover, these causal DNNs learned strong equivariance to out-of-plane variations in pose, recapitulating classical theory on the foundations of object constancy (Sinha & Poggio,1996) despite no explicit constraints to do so. Our work identifies key limitations in how DNNs are trained today and introduces a better approach for building DNN-based models of human vision that can ultimately advance perceptual science.

Acknowledgements: This work was supported by ONR (N00014-19-1-2029), NSF (IIS-1912280 and EAR-1925481), DARPA (D19AC00015), and NIH/NINDS (R21 NS 112743), Cloud TPU hardware through the TensorFlow Research Cloud (TFRC) program as well as computing hardware supported by NIH Office of the Director grant S10OD025181.

Talk 2, 8:30 am, 21.22

Generating objects in peripheral vision using attention-guided diffusion models

Ritik Raina1 (), Seoyoung Ahn1, Gregory Zelinsky1; 1Stony Brook University

Despite the majority of our visual field being blurry in the periphery, with only the central ~2 degrees offering high-resolution inputs, we have no difficulty perceiving and interacting with objects around us. We hypothesize that the human perception of a stable visual world is mediated by an active generation of objects from blurred peripheral vision. Furthermore, we hypothesize that this active peripheral generation is task-dependent, guided by information extracted from fixations, with the goal of constructing a relevant object and scene context for the current task. We test these hypotheses by using latent diffusion models and evaluating the influence of fixated image information on generating objects in the blurred periphery. We ask this question in the context of an object referral task, in which participants hear a spoken description of the search target (e.g., “right white van”). We recorded eye movements from participants (n=220) as they viewed 1,619 images and attempted to localize the referred targets. The model received high-resolution input only from fixated regions, mimicking foveated vision, and generated high-resolution objects in the originally blurred peripheral areas. We found that using foveated-image inputs corresponding to observed behavioral fixations led to the model generating target objects in the periphery with greater fidelity compared to randomly located fixations, as measured by squared pixel difference (Human Fixation SSE = 178.27; Random Fixation SSE = 212.42; averaged over the first 20 fixations). This fixation-driven advantage specifically applied to the reconstruction of task-relevant objects, such as objects of the same referred category, and did not extend to non-targets or background elements. Our findings support the idea that human perception actively generates relevant objects in the blurry periphery as a means of building a stable object context, which is guided by goal-directed attention control mechanisms.

Acknowledgements: This work was supported in part by NSF IIS awards 1763981 and 2123920 to G.Z.

Talk 3, 8:45 am, 21.23

Learning to discriminate by learning to generate: zero-shot generative models increase human object recognition alignment

Robert Geirhos1, Kevin Clark1, Priyank Jaini1; 1Google DeepMind

How does the human visual system recognize objects---through discriminative inference (fast but potentially unreliable) or using a generative model of the world (slow but potentially more robust)? The question of how the brain combines the best of both worlds to achieve fast and robust inference has been termed "the deep mystery of vision" (Kriegeskorte 2015). Yet most of today's leading computational models of human vision are simply based on discriminative inference, such as convolutional neural networks or vision transformers trained on object recognition. In contrast, we here revisit the concept of vision as generative inference. This idea dates back to the notion of vision as unconscious inference proposed by Helmholtz (1867), who hypothesized that the brain uses a generative model of the world to infer probable causes of sensory input. In order to build a generative model capable of recognizing objects, we take some of the world's most powerful generative text-to-image models (Stable Diffusion, Imagen and Parti) and turn them into zero-shot image classifiers using Bayesian inference. We then compare those generative classifiers against a broad range of discriminative classifiers and against human psychophysical object recognition data from the "model-vs-human" toolbox (Geirhos et al. 2021). We discover four emergent properties of generative classifiers: They show a record-breaking human-like shape bias (99% for Imagen), near human-level accuracy on challenging distorted images, and state-of-the-art alignment with human classification errors. Last but not least, generative classifiers understand certain perceptual illusions such as the famous bistable rabbit-duck illusion or Giuseppe Arcimboldo's portrait of a man's face composed entirely of vegetables, speaking to their ability to discern ambiguous input and distinguish local from global information. Taken together, our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data remarkably well.

Talk 4, 9:00 am, 21.24

Out-of-Distribution generalization behavior of DNN-based encoding models for the visual cortex

Spandan Madan1,3, Mingran Cao2, Will Xiao1, Hanspeter Pfister1, Gabriel Kreiman1,3; 1Harvard University, 2The Francis Crick Institute, 3Boston Children's Hospital

Deep Neural Networks (DNNs) trained for object classification have remarkably similar internal feature representations to neural representations in the primate ventral visual stream. This has led to the widespread use of encoding models of the visual cortex utilizing linear combinations of pre-trained DNN unit activities. However, DNNs struggle with generalization under distribution shifts, particularly when faced with out-of-distribution (OOD) samples. While DNNs excel at interpolating between training data points, they perform poorly when extrapolating beyond the bounds of the training data (e.g., Hasson et al., 2020). We characterized the generalization capabilities of DNN-based encoding models when predicting neuronal responses from the primate visual ventral stream. Using a large-scale dataset of neuronal responses from the macaque inferior temporal cortex to over 100,000 images, we simulated the effect of OOD neural activity prediction by dividing the images into multiple training and test sets, holding out subsets of the data to introduce different OOD domain shifts. This includes OOD low-level image features like contrast, hue, and size; OOD high-level features like animate vs inanimate, food vs non-food, different semantic object categories; and OOD K-means clusters in the distributed representations of ResNet features and neural data. For each feature, an OOD test set was constructed by defining a parametric value for that feature, and withholding from training a subset of possible values for testing. Overall, models performed much worse when predicting out-of-distribution image responses compared to standard cross-validation. Prediction on an IID test set with no distribution shift had an r^2 = 0.5, while OOD prediction ranges from 0.48 (images with OOD contrast shift) to as low as 0.1 (images with OOD hue). This indicates a deep problem in modern models of the visual cortex—the promise of current image-computable models remains limited to the training image distribution.

Acknowledgements: This work has been partially supported by NSF grant IIS-1901030.

Talk 5, 9:15 am, 21.25

Higher visual areas act like domain-general filters with strong selectivity and functional specialization

Meenakshi Khosla1 (), Leila Wehbe2; 1University of California, San Diego, 2Carnegie Mellon University

Modeling neural responses to naturalistic stimuli has been instrumental in advancing our understanding of the visual system. Dominant computational modeling efforts have been deeply rooted in preconceived hypotheses. Here, we develop a hypothesis-neutral computational methodology which brings neuroscience data directly to bear on the model development process. We demonstrate the effectiveness of this technique in modeling as well as systematically characterizing voxel tuning properties. We leverage the unprecedented scale of the Natural Scenes Dataset to constrain parametrized neural models of higher-order visual systems with brain response measurements and achieve novel predictive precision, outperforming the predictive success of state-of-the-art models. Next, we ask what kinds of functional properties emerge spontaneously in these response-optimized models? We examine trained networks through structural and functional analysis by running `virtual' fMRI experiments on large-scale probe datasets. Strikingly, despite no category-level supervision, since the models are optimized for brain response prediction from scratch, the units in the networks after optimization act strongly as detectors for semantic concepts like `faces' or `words', thereby providing one of the strongest evidences for categorical selectivity in these areas. Importantly, this selectivity is maintained when training the networks without images that contain the preferred category, strongly suggesting that selectivity is not domain-specific machinery, but sensitivity to generic patterns that characterize preferred categories. Beyond characterizing tuning properties, we study the transferability of representations in response-optimized networks on different perceptual tasks. We find that the sole objective of reproducing neural targets, without any task-specific supervision, grants different networks intriguing functionalities. Finally, our models show selectivity only for a limited number of categories, all previously identified, suggesting that the essential categories are already known. Together, this new class of response-optimized models combined with novel interpretability techniques reveal themselves as a powerful framework for probing the nature of representations and computations in the brain.

Talk 6, 9:30 am, 21.26

Emergence of illusory contours in robust deep neural networks by accumulation of implicit priors

Tahereh Toosi1 (), Kenneth Miller1; 1Columbia University

Deep neural networks (DNNs), trained for object recognition, exhibit similarities to neural responses in the monkey visual cortex and are currently considered the best models of the primate visual system. It remains unclear whether psychophysical effects, such as illusory contours perceived by humans, also emerge in these models. Utilizing the invertibility properties of robustly trained feedforward neural networks, we demonstrated that illusory contours and shapes emerge when the network integrates its learned implicit priors. Our visual system is believed to store perceptual priors, with visual information learned and embedded in neural connections across all visual areas. This stored information is harnessed when required, for instance, during occlusion resolution or visual imagination generation. While the significance of feedback connections in these processes is well recognized, the precise neural mechanism that aggregates dispersed information throughout the visual cortex remains elusive. In this study, we leverage a ResNet50 neural network, conventionally used in image recognition, to shed light on the neural basis of illusory contour perception through its inherent feedback mechanism during error backpropagation. By iteratively accumulating the gradients of the loss with respect to an input—a Kanizsa Square—within an adversarially trained network, we observed the emergence of edge-like patterns in the area of the perceived 'white square'. This process, which unfolds over multiple iterations, echoes the time-dependent emergence of illusory contours in the visual cortices of rodents and primates as seen in experimental studies. Notably, the ResNet50 employed in this study was neither specifically enhanced with feedback capabilities nor optimized to detect or decode these illusory contours; it was merely trained for robust object recognition against adversarial examples. These findings highlight a compelling parallel, suggesting that the ability to perceive illusory contours might be an incidental consequence of the network's ability to handle adversarial noise during its training regime.

Acknowledgements: T.T. is supported by NIH 1K99EY035357-01. This work was also supported by NIH RF1DA056397, NSF 1707398, and Gatsby Charitable Foundation GAT3708.