VSS, May 13-18

Human Vision and Neural Networks: General considerations

Talk Session: Wednesday, May 18, 2022, 10:45 am – 12:30 pm EDT, Talk Room 2
Moderator: Felix Wichmann, University of Tübingen

Times are being displayed in EDT timezone (Florida time): Wednesday, July 6, 2:18 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 10:45 am, 62.21

Lack of experience with blurry visual input may cause CNNs to deviate from biological visual systems

Hojin Jang1,2 (), Frank Tong1,2; 1Vanderbilt University, 2Vanderbilt Vision Research Center

Our subjective impression of the visual world is that it appears clear when in fact large portions of the retinal image will often consist of degraded input due to optical defocus and low resolution in the periphery. Convolutional neural networks (CNNs) are believed to provide the best current model of biological vision, yet the typical training regime for CNNs predominantly consists of clear images. We hypothesized that a lack of blurry input may cause CNNs to acquire representations that rely excessively on the high-spatial frequency content of visual objects (Jang & Tong, Journal of Vision, 2021), causing deviations from biological visual systems. We sought to test this idea by comparing two types of CNNs, those trained with both blurry and clear images and those trained with clear images only. Multiple data sets were employed to compare CNN performance, including human fMRI data (Xu & Vaziri-Pashkam, 2021; Jang et al., 2021), monkey neurophysiological data (Cadena et al., 2019; Schrimpf et al., 2020), and human behavioral data (Geirhos et al., 2019; Hendrycks & Dietterich, 2019). We found that blur-trained CNNs outperformed clear-trained CNNs at approximating the representational structure of objects in the human ventral visual pathway across multiple viewing conditions, where objects were high-pass filtered, degraded by noise, or presented clearly. Additionally, blurry image training was found to improve CNN prediction of monkeys’ neuronal responses, particularly in the early visual areas. Furthermore, the blur-trained CNNs demonstrated greater shape bias and greater noise robustness than the clear-trained CNNs, thereby showing better correspondence with human behavior. Taken together, our findings suggest that modern CNN models are heavily biased towards learning high-spatial frequency representations of objects, while the human visual system may benefit from blurry visual experiences in daily life to attain more robust object processing.

Acknowledgements: Supported by an NIH R01EY029278 grant to FT.

Talk 2, 11:00 am, 62.22

A neural network family for systematic analysis of RF size and computational-path-length distribution as determinants of neural predictivity and behavioral performance

Benjamin Peters1, Lucas Stoffl4, Nikolaus Kriegeskorte1,2,3; 1Zuckerman Mind Brain Behavior Institute, Columbia University, 2Department of Psychology, Columbia University, 3Department of Neuroscience, Columbia University, 4Brain Mind Institute, Ecole polytechnique fédérale de Lausanne, Switzerland

Deep feedforward convolutional neural network models (FCNNs) explain aspects of the representational transformations in the visual hierarchy. However, particular models implement idiosyncratic combinations of architectural hyperparameters, which limits theoretical progress. In particular, the size of receptive fields (RFs) and the distribution of computational path lengths (CPL; number of nonlinearities encountered) leading up to a representational stage are confounded across layers of the same architecture (deeper layers have larger RFs) and depend on idiosyncratic choices (kernel sizes, depth, skipping connections) across architectures. Here we introduce HBox, a family of architectures designed to break the confoundation of RF size and CPL. Like conventional FCNNs, an HBox model contains a feedforward hierarchy of convolutional feature maps. Unlike FCNNs, each map has a predefined RF size that can result from shorter or longer computational paths or any combination thereof (through skipping connections). We implemented a large sample of HBox models and investigated how RF size and CPL jointly account for neural predictivity and behavioral performance. The model set also provides insights into the joint contribution of deep and broad pathways which achieve complexity, respectively, through long or numerous computational paths. When controlling for the number of parameters, we find that visual tasks with higher complexity (CIFAR10, Imagenet) and occlusion (Digitclutter; Spoerer et al., 2017) show peak performance in models that trade off breadth to achieve higher depth (average CPL). The opposite holds for a simpler task (MNIST). We further disentangle the contribution of CPL, and RF size to the match of brain and model representation by assessing the ability of HBox models to predict visual representations in regions-of-interests in a large-scale fMRI benchmark (natural scenes dataset; Allen et al., 2021). The HBox architecture family illustrates how high-parametric task-performing vision models can be used systematically to gain theoretical insights into the neural mechanisms of vision.

Talk 3, 11:15 am, 62.23

Shape bias at a glance: Comparing human and machine vision on equal terms

Katherine L. Hermann1 (), Chaz Firestone2; 1Stanford University, 2Johns Hopkins University

Recent work has highlighted a seemingly sharp divergence between human and machine vision: whereas people exhibit a shape bias, preferring to classify objects according to their shape (Landau et al. 1988), standard ImageNet-trained CNNs prefer to use texture (Geirhos et al. 2018). However, existing studies have tested people under different conditions from those faced by a feedforward CNN, presenting stimuli long enough for feedback and attentive processes to come online, and using tasks which may bias judgments towards shape. Does this divergence remain when testing conditions are more fairly aligned? In six pre-registered experiments (total N=1064) using brief stimulus presentations (50ms), we asked participants whether a stimulus exactly matched a target image (e.g. a feather-textured bear). Stimuli either matched (a) exactly (the same image), (b) in shape but not texture (“shape lure”, e.g. a pineapple-textured bear), (c) in texture but not shape (“texture lure”, e.g. a feather-textured scooter), or did not match in either shape or texture (“filler”). We tested whether false-alarm rates differed for shape lures versus fillers, for texture lures versus fillers, and for texture lures versus shape lures. This paradigm avoids explicit object categorization and naming, allowing us to test whether a shape bias is already present in perception, regardless of how shape is weighted in subsequent cognitive and linguistic processing. We find that people do rely on shape more than texture, false-alarming significantly more often for shape lures than texture lures. However, although shape-biased, participants are still lured by texture information, false-alarming significantly more often for texture lures than for fillers. These findings are robust to stimulus type (including multiple previously studied stimulus sets) and mask type (pink noise, scramble, no mask), and establish a new benchmark for assessing the extent to which feedforward computer vision models are “humanlike” in their shape bias.

Talk 4, 11:30 am, 62.24

The bittersweet lesson: data-rich models narrow the behavioural gap to human vision

Robert Geirhos1,2 (), Kantharaju Narayanappa1, Benjamin Mitzkus1, Tizian Thieringer1, Matthias Bethge1, Felix A. Wichmann1, Wieland Brendel1; 1University of Tübingen, 2International Max Planck Research School for Intelligent Systems

A major obstacle to understanding human visual object recognition is our lack of behaviourally faithful models. Even the best models based on deep learning classifiers strikingly deviate from human perception in many ways. To study this deviation in more detail, we collected a massive set of human psychophysical classification data under highly controlled conditions (17 datasets, 85K trials across 90 observers). We made this data publicly available as an open-sourced Python toolkit and behavioural benchmark called "model-vs-human", which we use for investigating the very latest generation of models. Generally, in terms of robustness, standard machine vision models make much more errors on distorted images, and in terms of image-level consistency, they make very different errors than humans. Excitingly, however, a number of recent models make substantial progress towards closing this behavioural gap: "simply" training models on large-scale datasets (between one and three orders of magnitude larger than standard ImageNet) is sufficient to, first, reach or surpass human-level distortion robustness and, second, to improve image-level error consistency between models and humans. This is significant given that none of those models is particularly biologically faithful on the implementational level, and in fact, large-scale training appears much more effective than, e.g., biologically-motivated self-supervised learning. In the light of these findings, it is hard to avoid drawing parallels to the "bitter lesson" formulated by Rich Sutton, who argued that "building in how we think we think does not work in the long run" - and ultimately, scale would be all that matters. While human-level distortion robustness and improved behavioural consistency with human decisions through large-scale training is certainly a sweet surprise, this leaves us with a nagging question: Should we, perhaps, worry less about biologically faithful implementations and more about the algorithmic similarities between human and machine vision induced by training on large-scale datasets?

Acknowledgements: This work was supported by the IMPRS-IS, the Collaborative Research Center (276693517), the German Federal Ministry of Education and Research (01IS18039A), the Machine Learning Cluster of Excellence (2064/1---390727645), and the German Research Foundation (BR 6382/1-1).

Talk 5, 11:45 am, 62.25

Latent dimensionality scales with the performance of deep learning models of visual cortex

Eric Elmoznino1 (), Michael Bonner1; 1Johns Hopkins University

The ventral visual stream is a complex, nonlinear system whose internal representations are currently best approximated through deep learning in convolutional neural networks (CNNs). Neuroscientists have been in search of the core principles that explain why some CNNs are better than others at predicting the responses in visual cortex. Previous efforts have focused on factors related to architecture, training task, visual diet, and interpretable properties of the learned features. Here, we take a different approach and seek to understand the performance of CNN models of the ventral stream in terms of their latent geometric properties. Specifically, we focus on latent dimensionality, which is the number of dimensions spanned by the activity space of a CNN’s responses to natural images. While low dimensionality can promote invariance to incidental image properties, high dimensionality increases expressivity and can support a wider range of behaviors. Thus, we asked: what level of CNN dimensionality is best for modeling activity in the primate ventral stream? To address this question, we estimated the dimensionality of a large set of CNNs trained on a variety of tasks using multiple datasets, and we assessed how well these CNNs performed as linear encoding models of object-evoked responses in ventral visual cortex using both human fMRI and monkey electrophysiology data. These analyses revealed a striking effect: higher dimensional CNNs were better at predicting cortical responses. Importantly, our results cannot be explained as trivial statistical effects of dimensionality when fitting linear encoding models. There are, in fact, many alternative conditions under which high-dimensional models cannot accurately predict neural data, which we demonstrated both empirically and through simulations. Together, our findings suggest that CNN models of visual cortex may be best understood in terms of the latent geometric properties of their representations, rather than the idiosyncratic details of their architectures or training procedures.

Talk 6, 12:00 pm, 62.26

Global information processing in feedforward deep networks

Ben Lonnqvist1 (), Alban Bornet1, Adrien Doerig2, Michael H. Herzog1; 1Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland, 2Donders Institute for Brain, Cognition & Behaviour, Nijmegen, Netherlands

While deep neural networks are state-of-the-art models of many parts of the human visual system, here we show that they fail to process global information in a humanlike manner. First, using visual crowding as a probe into global visual information processing, we found that regardless of architecture, feedforward deep networks successfully model an elementary version of crowding, but cannot exhibit its global counterpart (“uncrowding”). It is not yet well-understood whether this limitation could be ameliorated by substantially larger and more naturalistic training conditions, or by attentional mechanisms. To investigate this, we studied models trained with the CLIP (Contrastive Language-Image Pretraining) procedure, which is a training procedure for a set of attention-based models intended for zero-shot classification of images. CLIP models are trained by self-supervised pairing of generated labels with image inputs on a composite dataset of approximately 400 million images. Due to this training procedure, CLIP models have shown to exhibit highly abstract representations, state-of-the-art performance in zero-shot classification, and to make classification errors that are more in line with the errors humans make than previous models. Despite these advances, we show, by fitting logistic regression models to the activations of layers in CLIP models, that training procedure, architectural differences, nor training dataset size can ameliorate feedforward networks’ inability to reproduce humanlike global information processing in an uncrowding task. This highlights an important aspect of visual information processing: feedforward computations alone are not enough to explain how visual information in humans is combined globally.

Acknowledgements: BL was supported by the Swiss National Science Foundation grant n. 176153 "Basics of visual processing : from elements to figures".

Talk 7, 12:15 pm, 62.27

Brain-optimized neural networks reveal evidence for non-hierarchical representation in human visual cortex

Ghislain St-Yves1, Emily Allen1, Yihan Wu1, Kendrick Kay1, Thomas Naselaris1; 1University of Minnesota

Task-optimized deep neural networks (DNNs) have been shown to yield impressively accurate predictions of brain activity in the primate visual system. For most networks, network layer depth generally aligns with V1-V4. This result has been construed as evidence that V1-V4 instantiates hierarchical computation. To test this interpretation, we analyzed the Natural Scenes Dataset, a massive dataset consisting of 7T fMRI measurements of human brain activity in response to up to 30,000 natural scene presentations per subject. We used this dataset to directly optimize DNNs to predict responses in V1-V4, flexibly allowing features to distribute across layers in any way that improves prediction of brain activity. Our results challenge three aspects of hierarchical computation. First we find only marginal advantage from jointly training on V1-V4 relative to training independent DNNs on each of these brain areas. This suggests that data from different areas offer largely independent constraints on the model. Second, the independent DNNs do not show the typical alignment of network layer depth with visual areas. This suggests that alignment may arise for other reasons than computational depth. Finally, we performed transfer learning between the DNN features learned on each visual area. We show that features learned on anterior areas (e.g. V4) poorly generalized to the representations found in more posterior areas (e.g. V1). Together, these results indicate that the features represented in V1-V4 do not necessarily bear hierarchical relationships to one another. Overall, we suggest that human visual areas V1-V4 do not only serve as a pre-processing stream for generating higher visual representations, but may also operate as a parallel system of representation that can serve multiple independent functions.

Acknowledgements: Collection of the NSD dataset was supported by NSF CRCNS grants IIS-1822683 and IIS-1822929