Using deep networks to re-imagine object-based attention and perception

Symposium: Friday, May 17, 2024, 5:00 – 7:00 pm, Talk Room 2

Organizers: Hossein Adeli1, Seoyoung Ahn2, Gregory Zelinsky2; 1Columbia University, 2Stony Brook University
Presenters: Patrick Cavanagh, Frank Tong, Paolo Papale, Alekh Karkada Ashok, Hossein Adeli, Melissa Le-Hoa Võ

What can Deep Neural Network (DNN) methods tell us about the brain mechanisms that transform visual features into object percepts? Using different state-of-the-art models, the speakers in this symposium will reexamine different cognitive and neural mechanisms of object-based attention (OBA) and perception and consider new computational mechanisms for how the visual system groups visual features into coherent object percepts. Our first speaker, Patrick Cavanagh, helped create the field of OBA and is therefore uniquely suited to give a perspective on how this question, essentially the feature-binding problem, has evolved over the years and has been shaped by paradigms and available methods. He will conclude by outlining his vision for how DNN architectures create new perspectives on understanding OBA. The next two speakers will review the recent behavioral and neural findings on object-based attention and feature grouping. Frank Tong will discuss the neural and behavioral signatures of OBA through the utilization of fMRI and eye tracking methods. He will demonstrate how the human visual system represents objects across the hierarchy of visual areas. Paolo Papale will discuss neurophysiological evidence for the role of OBA and grouping in object perception. Using stimuli systematically increasing in complexity from lines to natural objects (against cluttered backgrounds) he shows that OBA and grouping are iterative processes. Both talks will also include discussions of current modeling efforts, and what additional measures may be needed to realize more human-like object perception. The following two talks will provide concrete examples of how DNNs can be used to predict human behavior during different tasks. Lore Goetschalckx will focus on the importance of considering the time-course of grouping in object perception and will discuss her recent work on developing a method to analyze dynamics of different models. Using this method, she shows how a deep recurrent model trained on an object grouping task predicts human reaction time. Hossein Adeli will review modeling work on three theories of how OBA binds features into objects: one that implements object-files, another that uses generative processes to reconstruct an object percept, and a third model of spreading attention through association fields. In the context of these modeling studies, he will describe how each of these mechanisms was implemented as a DNN architecture. Lastly, Melissa Võ will drive home the importance of object representations and how they collectively create an object context that humans use to control their attention behavior in naturalistic settings. She shows how GANs can be used to study the hidden representations underlying our perception of objects. This symposium is timely because the advances in computational methods have made it possible to put old theories to the test and to develop new theories of OBA mechanisms that engage the role played by attention in creating object-centric representations.

Talk 1

The Architecture of Object-Based Attention

Patrick Cavanagh1, Gideon P. Caplovitz2, Taissa K. Lytchenko2, Marvin R. Maechler3, Peter U. Tse3, David R. Sheinberg4; 1Glendon College, York University, 2University of Nevada, Reno, 3Dartmouth College, 4Brown University

Evidence for the existence of object-based attention raises several important questions: what are objects, how does attention access them, and what anatomical regions are involved? What are the “objects” that attention can access? Several studies have shown that items in visual search tasks are only loose collections of features prior to the arrival of attention. Nevertheless, findings from a wide variety of paradigms, including unconscious priming and cuing, have overturned this view. Instead, the targets of object-based attention appear to be fully developed object representations that have reached the level of identity prior to the arrival of attention. Where do the downward projections of object-based attention originate? Current research indicates that the control of object-based attention must come from ventral visual areas specialized in object analysis that project downward to early visual areas. If so, how can feedback from object areas accurately target the object’s early locations and features when the object areas have only crude location information? Critically, recent work on autoencoders has made this plausible as they are capable of recovering the locations and features of the target objects from the high level, low dimensional codes in the object areas. I will outline the architecture of object-based attention, the novel predictions it brings, and discuss how it works in parallel with other attention pathways.

Talk 2

Behavioral and neural signatures of object-based attention in the human visual system

Frank Tong1, Sonia Poltoratski1, David Coggan1, Lasyapriya Pidaparthi1, Elias Cohen1; 1Vanderbilt University

How might one demonstrate the existence of an object representation in the visual system? Does objecthood arise preattentively, attentively, or via a confluence of bottom-up and top-down processes? Our fMRI work reveals that orientation-defined figures are represented by enhanced neural activity in the early visual system. We observe enhanced fMRI responses in the lateral geniculate nucleus and V1, even for unattended figures, implying that core aspects of scene segmentation arise from automatic perceptual processes. In related work, we find compelling evidence of object completion in early visual areas. fMRI response patterns to partially occluded object images resemble those evoked by unoccluded objects, with comparable effects of pattern completion found for unattended and attended objects. However, in other instances, we find powerful effects of top-down attention. When participants must attend to one of two overlapping objects (e.g., face vs. house), activity patterns from V1 through inferotemporal cortex are biased in favor of the covertly attended object, with functional coupling of the strength of object-specific modulation found across brain areas. Finally, we have developed a novel eye-tracking paradigm to predict the focus of object-based attention while observers view two dynamically moving objects that mostly overlap. Estimates of the precision of gaze following suggest that observers can entirely filter out the complex motion signals arising from the task-irrelevant object. To conclude, I will discuss whether current AI models can adequately account for these behavioral and neural properties of object-based attention, and what additional measures may be needed to realize more human-like object processing.

Talk 3

The spread of object attention in artificial and cortical neurons

Paolo Papale1, Matthew Self1, Pieter Roelfsema1; 1Netherlands Institute for Neuroscience

A crucial function of our visual system is to group local image fragments into coherent perceptual objects. Behavioral evidence has shown that this process is iterative and time-consuming. A simple theory suggested that visual neurons can solve this challenging task relying on recurrent processing: attending to an object could produce a gradual spread of enhancement across its representation in the visual cortex. Here, I will present results from a biologically plausible artificial neural network that can solve object segmentation by attention. This model was able to identify and segregate individual objects in cluttered scenes with extreme accuracy, only using modulatory top-down feedback as observed in visual cortical neurons. Then, I will present comparable results from large-scale electrophysiology recordings in the macaque visual cortex. We tested the effect of object attention with stimuli of increasing complexity, from lines to natural objects against cluttered backgrounds. Consistent with behavioral observations, the iterative model correctly predicted the spread of attentional modulation in visual neurons for simple stimuli. However, for more complex stimuli containing recognizable objects, we observed asynchronous but not iterative modulation. Thus, we produced a set of hybrid stimuli, combining local elements of two different objects, that we alternated with the presentation of stimuli of intact objects. By doing so, we made local information unreliable, forcing the monkey to solve the task iteratively. Indeed, we observed that this set of stimuli induced iterative attentional modulations. These results provide the first systematic investigation on object attention in both artificial and cortical neurons.

Talk 4

Time to consider time: Comparing human reaction times to dynamical signatures from recurrent vision models on a perceptual grouping task

Alekh Karkada Ashok1, Lore Goetschalckx1, Lakshmi Narasimhan Govindarajan1, Aarit Ahuja1, David Sheinberg1, Thomas Serre1; 1Brown University

To make sense of its retinal inputs, our visual system organizes perceptual elements into coherent figural objects. This perceptual grouping process, like many aspects of visual cognition, is believed to be dynamic and at least partially reliant on feedback. Indeed, cognitive scientists have studied its time course through reaction time measurements (RT) and have associated it with a serial spread of object-based attention. Recent progress in biologically-inspired machine learning, has put forward convolutional recurrent neural networks (cRNNs) capable of exhibiting and mimicking visual cortical dynamics. To understand how the visual routines learned by cRNNs compare to humans, we need ways to extract meaningful dynamical signatures from a cRNN and study temporal human-model alignment. We introduce a framework to train, analyze, and interpret cRNN dynamics. Our framework triangulates insights from attractor-based dynamics and evidential learning theory. We derive a stimulus-dependent metric, ξ, and directly compare it to existing human RT data on the same task: a grouping task designed to study object-based attention. The results reveal a “filling-in” strategy learned by the cRNN, reminiscent of the serial spread of object-based attention in humans. We also observe a remarkable alignment between ξ and human RT patterns for diverse stimulus manipulations. This alignment emerged purely as a byproduct of the task constraints (no supervision on RT). Our framework paves the way for testing further hypotheses on the mechanisms supporting perceptual grouping and object-based attention, as well as for inter-model comparisons looking to improve the temporal alignment with humans on various other cognitive tasks.

Talk 5

Three theories of object-based attention implemented in deep neural network models

Hossein Adeli1, Seoyoung Ahn2, Gregory Zelinsky2, Nikolaus Kriegeskorte1; 1Columbia University, 2Stony Brook University

Understanding the computational mechanisms that transform visual features into coherent object percepts requires the implementation of theories in scalable models. Here we report on implementations, using recent deep neural networks, of three previously proposed theories in which the binding of features is achieved (1) through convergence in a hierarchy of representations resulting in object-files, (2) through a reconstruction or a generative process that can target different features of an object, or (3) through the elevation of activation by spreading attention within an object via association fields. First, we present a model of object-based attention that relies on capsule networks to integrate features of different objects in the scene. With this grouping mechanism the model is able to learn to sequentially attend to objects to perform multi-object recognition and visual reasoning. The second modeling study shows how top-down reconstructions of object-centric representations in a sequential autoencoder can target different parts of the object in order to have a more robust and human-like object recognition system. The last study demonstrates how object perception and attention could be mediated by flexible object-based association fields at multiple levels of the visual processing hierarchy. Transformers provide a key relational and associative computation that may be present also in the primate brain, albeit implemented by a different mechanism. We observed that representations in transformer-based vision models can predict the reaction time behavior of people on an object grouping task. We also show that the feature maps can model the spreading of attention in an object.

Talk 6

Combining Generative Adversarial Networks (GANs) with behavior and brain recordings to study scene understanding

Melissa Le-Hoa Võ1, Aylin Kallmayer1; 1Goethe University Frankfurt

Our visual world is a complex conglomeration of objects that adhere to semantic and syntactic regularities, a.k.a. scene grammar according to which scenes can be decomposed into phrases – i.e, smaller clusters of objects forming conceptual units – which again contain so-called anchor objects. These usually large and stationary objects further anchor predictions regarding the identity and location of most other smaller objects within the same phrase and play a key role in guiding attention and boosting perception during real-world search. They therefore provide an important organizing principle for structuring real-world scenes. Generative adversarial networks (GANs) trained on images of real-world scenes learn the scenes’ latent grammar to then synthesize images that mimic images of real-world scenes increasingly well. Therefore GANs can be used to study the hidden representations underlying object-based perception serving as testbeds to investigate the role that anchor objects play in both the generation and understanding of scenes. We will present some recent work in which we presented participants with real and generated images recording both behavior and brain responses. Modelling behavioral responses from a range of computer vision models we found that mostly high-level visual features and the strength of anchor information predicted human scene understanding of generated scenes. Using EEG to investigate the temporal dynamics of these processes revealed initial processing of anchor information which generalized to subsequent processing of the scene’s authenticity. These new findings imply that anchors pave the way to scene understanding and that models predicting real-world attention and perception should become more object-centric.