VSS, May 13-18

Attention, Eye Movements and Scanning

Talk Session: Sunday, May 15, 2022, 10:45 am – 12:30 pm EDT, Talk Room 2
Moderator: Freek van Ede, Vrije Univ., Amsterdam

Times are being displayed in EDT timezone (Florida time): Wednesday, July 6, 3:10 am EDT America/New_York.
To see the V-VSS schedule in your timezone, Log In and set your timezone.

Search Abstracts | VSS Talk Sessions | VSS Poster Sessions | V-VSS Talk Sessions | V-VSS Poster Sessions

Talk 1, 10:45 am, 32.21

Relating microsaccades and EEG-alpha activity during covert spatial attention in visual working memory

Baiwei Liu1 (), Anna Nobre2,3, Freek van Ede1,3; 1Institute for Brain and Behavior Amsterdam, Department of Experimental and Applied Psychology, Vrije Universiteit Amsterdam, The Netherlands, 2Department of Experimental Psychology, University of Oxford, United Kingdom, 3Oxford Centre for Human Brain Activity, Wellcome Centre for Integrative Neuroimaging, Department of Psychiatry, University of Oxford, United Kingdom

Covert spatial attention is associated with spatially specific modulation of 8-12 Hz EEG-alpha activity as well as with directional biases in fixational eye-movements known as microsaccades. However, how these two well-established ‘fingerprints’ of covert spatial attention are related remains largely unaddressed. We investigated the link between microsaccades and spatial modulations in alpha activity in humans in a context with no incentive for overt gaze behaviour: when attention is directed internally within the spatial layout of visual working memory. We show that the two signatures are functionally correlated. The spatial modulation of alpha activity is stronger in trials with microsaccades toward vs. away from the to-be-attended visual memory item. Moreover, the alpha modulation occurs earlier in trials with earlier microsaccades toward the memorised location of the cued memory item. At the same time, however, in trials in which we did not detect any attention-driven microsaccade, we nevertheless observed clear spatial modulation of alpha activity. Taken together, these results suggest that directional biases in microsaccades are functionally correlated to alpha signatures of internally directed spatial attention, but they are not necessary for alpha modulations by covert spatial attention to occur.

Acknowledgements: This research was supported by an ERC Starting Grant from the European Research Council (MEMTICIPATION, 850636) to F.v.E., and a Wellcome Trust Senior Investigator Award (104571/Z/14/Z) and a James S. McDonnell Foundation Understanding Human Cognition Collaborative Award (220020448) to A.C.N.

Talk 2, 11:00 am, 32.22

Distinct frontal cortex circuits for covert attention and saccade planning

Adam Messinger1 (), Aldo Genovesio2; 1National Eye Institute, National Institutes of Health, 2Sapienza University of Rome, Rome, Italy

To determine the contribution of frontal cortex to both covert spatial attention and motor planning, we recorded from monkeys performing a task that spatially dissociated these processes. On each trial, the monkey covertly attended one of four possible locations to detect a subtle luminance change that indicated it was time to make a saccade to another of these locations. Detection of this go signal was impaired on a subset of trials where the monkey was not informed where to attend. We analyzed the mean firing rates of neurons from three frontal regions in two monkeys. During the 800-ms period preceding the go signal, 213 neurons had significant spatial tuning (main effect in two-way ANOVA, p<0.05, with attended location and saccade target as factors). Neurons in premotor cortex were predominantly tuned for the upcoming saccade only, whereas in prefrontal and pre-arcuate cortex there were also neurons modulated by the allocation of attention. It has been postulated (Premotor Theory of Attention) that directing covert attention is equivalent to planning an eye movement that is never executed. It follows that neurons significantly modulated by spatial attention would necessarily also be involved in saccadic planning. This was not the case. Most spatially tuned neurons recorded in prefrontal and pre-arcuate cortex (135/163, 83%) were modulated either by covert attention or saccade planning – not both. Specifically, 49 neurons (30%) encoded where the animal was attending but were not significantly modulated by the motor plan. Thus, there are frontal neurons that contribute to the allocation of covert spatial attention without being involved in oculomotor planning. Our findings demonstrate that covert attention has stand-alone neuronal substrates, allowing it to be flexibly controlled in ways that are not reliant on the circuitry underlying saccade planning.

Acknowledgements: NIMH DIRP

Talk 3, 11:15 am, 32.23

Trade-off between uncertainty reduction and reward collection reveals intrinsic cost of gaze switches

Florian Kadner1,2, Tabea A Wilke1,2,3, Thi DK Vo1,2, David Hoppe1,2, Constantin A Rothkopf1,2; 1Center for Cognitive Science, Technical University Darmstadt, 2Institute of Psychology, Technical University Darmstadt, 3Deutscher Wetterdienst, Germany

In a dynamic and uncertain visual environment, the location with the highest task reward is not always the location with the highest information gain. In such situations, the visual system needs to trade off collecting reward against the risk of missing important events. Here we use a gaze-contingent paradigm in a visual detection task to investigate the learning and planning of temporal eye movement strategies by spatially separating the locations where the uncertainty of obtaining a task reward can be reduced and where the reward can be collected. In three different conditions within the experiment we vary the reward rate to measure subjects’ adaptive behavior in response to these altered task demands. In addition to changing locations using their gaze, subjects completed the same three task conditions by bringing the two separate locations into fixation through a button press instead of a gaze switch. This design allowed comparing the strategies in switching between task reward collection and uncertainty reduction either through gaze switches or through button presses. We find significant differences in switching behavior as function of the three reward rates but also for the two different switching conditions, indicating that humans can adapt their temporal strategies both in response to the obtainable reward rates as well as depending on the switching modality. In order to quantitatively understand the switching behavior, we develop a probabilistic planning model using Partially Observable Markov Decision Processes, which allows inferring participants’ individual behavioral switching costs and perceptual uncertainties. This model reveals that, contrary to a common belief, the subjective internal cost of a gaze switch is quite high, e.g. compared to a manual key press. Our model is able to predict key aspects of subjects’ behavioral data and we conclude that temporal eye movement strategies agree with probabilistic planning under uncertainty.

Acknowledgements: funded by the German Research Foundation (DFG, grant: RO 4337/3-1)

Talk 4, 11:30 am, 32.24

Scanpath prediction in dynamic real-world scenes based on object-based selection

Nicolas Roth1,3 (), Martin Rolfs2,3, Klaus Obermayer1,3; 1Technische Universität Berlin, 2Humboldt-Universität zu Berlin, 3Exzellenzcluster Science of Intelligence, Technische Universität Berlin

Humans actively shift their gaze when viewing dynamic real-world scenes. While there is a long-standing interest in understanding this behavior, the complexity of natural scenes makes it difficult to analyze experimentally. During free viewing, it has long been thought that the targets of eye movements are selected based on bottom-up saliency, but evidence accumulates that objects play an important role in the selection process. Here, we use a computational scanpath prediction framework to systematically compare predictions of models that incorporate combinations of object and saliency information, to human eye-tracking data. We model saccades as sequential decision processes between potential targets. To investigate the relevance of object-based selection, we compare an object-based model in which saccades target semantic objects, with a location-based model in which saccades target individual pixel values. Target selection in both models depends on potential targets’ eccentricity, the previous scanpath history, and target relevance. Target relevance is implemented either based on the distance to the center (center bias), on saliency based on low-level features, or high-level saliency as predicted by a deep neural network. We optimize each model’s parameters with evolutionary algorithms and fit them to reproduce the saccade amplitude and fixation duration distributions of free-viewing eye-tracking data on videos of the VidCom dataset. We assess model performance with respect to spatial and temporal fixation behavior, including the proportion of fixations exploring the background, as well as detecting, inspecting, and revisiting objects. Human data were best predicted by the object-based model with low-level saliency, followed by the location-based model with high-level saliency and the object-based model combined with a center bias. The location-based model with low-level saliency or center bias mainly explores the background. These results support the view that object-level attentional units play an important role in human exploration behavior, while saliency helps to prioritize between objects.

Acknowledgements: Funded by the German Research Foundation under Germany’s Excellence Strategy – EXC 2002/1 “Science of Intelligence” – project number 390523135.

Talk 5, 11:45 am, 32.25

DeepGaze vs SceneWalk: what can DNNs and biological scan path models teach each other?

Lisa Schwetlick1 (), Matthias Kümmerer2, Ralf Engbert1, Matthias Bethge2; 1University of Potsdam, 2University of Tübingen

Eye movements on natural scenes are driven by image content as well as by saccade dynamics and sequential dependencies. Recent research has seen a variety of models that aim to predict time-ordered fixation sequences, including statistical-, mechanistic-, and deep neural network (DNN) models, each with their own advantages and shortcomings. Here we show how a synthesis of different modeling frameworks may offer fresh insights into the underlying processes. Firstly, the explanatory power of biologically inspired models can help develop an understanding of mechanisms learned by DNNs. Secondly, DNN performance can be used to estimate data predictability and thereby help uncover new mechanisms. DeepGaze3 (DG3) is currently the best-performing DNN model for scan path predictions (Kümmerer & Bethge, 2020); SceneWalk (SW) is the best-performing biologically inspired dynamical model (Schwetlick et al., 2021). Both models can be fitted using maximum likelihood estimation and compute per-fixation likelihood predictions. Thus, we can analyze prediction divergence at the level of individual fixations. DG3 generally outperforms SW, indicating that the DNN is accounting for variance by learning mechanisms that are not yet included in the mechanistic SW model. Preliminary results show that SW tends to underestimate the probability of long, explorative saccades. In SW this behavior could be achieved by replacing the Gaussian attention span with a function with heavier tails or by implementing temporal attention span fluctuation. Furthermore, DG3 appears to compress previously unexplored areas, increasing likelihood for saccades to the region center. Once the region is fixated, DG3 broadens the local probability, consistent with a dualistic exploration-exploitation strategy. Adding corresponding mechanisms to SW may improve model performance and help develop more advanced dynamical models. Finding the synergies between different modeling approaches, specifically high-performing DNNs and more transparent dynamical models, is a valuable tool for improving our understanding of fixation selection during scene viewing.

Acknowledgements: German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A & Deutsche Forschungsgemeinschaft (DFG):Collaborative Research Center (SFB) 1294, projects B03 and B05 (project no.~318763901)

Talk 6, 12:00 pm, 32.26

Modeling "meaning" and weighing it against other factors in predicting fixations: you can find whatever result you are looking for

Souradeep Chakraborty1, Gregory J. Zelinsky1; 1Stony Brook University

"Meaning" has recently been added to the list of factors shown to attract fixations. We directly compare the predictive success of meaning maps to other factors known to affect fixation locations in visual search and free-viewing tasks, namely: bottom-up saliency, center bias, and target features (for search). We add to this list of factors a new "objectness" feature, and propose an image-computable method for obtaining scene objectness estimates using an image-segmentation model (Mask R-CNN). An obstacle to using the meaning map method is that the dataset for which meaning estimates are available is only 40 images. To more broadly apply the method, we trained a dilated Inception network to predict meaningful regions in scene images (based on meaning labels from 30 images), and found an average Cross Correlation of 0.82 on the 10 withheld images. With this Deep Meaning model, we can obtain meaning maps for different image datasets for which ground truth meaning labels do not exist. We compared predictions (using NSS) from each factor to ground-truth fixations in COCO-Search18 and four free-viewing datasets: OSIE, MIT1003, the meaning-map dataset, and COCO-FreeView, a new dataset paralleling COCO-Search18. We also manipulated whether factor-independent processing (multiplicative center bias, histogram matching) were used in the priority computation and comparison to the ground-truth fixation-density maps. We found the most predictive factor depended on the dataset and factor-unrelated processing used, which is undesirable. For example, objectness was most predictive without a multiplicative center bias, while meaning was most predictive when one was added. We observed similar differences across free-viewing datasets. For search, target features dominated all others in predicting target-present search, and meaning best predicted target-absent search. Our findings underscore the importance of reporting modeling results for multiple datasets, and on the need for transparent discussion of how predictive success depends on factor-unrelated processing.

Talk 7, 12:15 pm, 32.27

“Attentional Fingerprints”: Real-world scene semantics capture individuating signatures in gaze behavior

Amanda J Haskins1 (), Caroline Robertson1; 1Dartmouth College

As we look around an environment, we actively select which semantic information to attend and which to ignore. How systematic are these individual differences? In other words, does an individual’s pattern of semantic attention in one environment reliably and uniquely predict their attention in a new environment? Here, we tested whether “attentional fingerprints” exist in naturalistic visual behavior. Participants’ (n = 16) gaze was monitored while they actively explored real-world photospheres (n = 60) in VR. To model scene semantics, we introduced a novel approach combining human judgments and computational language modeling to capture affordance-based inferences available to first-person viewers. Specifically, we decomposed each photosphere into tiles and obtained a written description of each tile (MTurk participants) containing both label (a door) and affordance-based (could be opened) content. Each description was transformed using a context-sensitive NLP model (BERT) into a sentence-level semantic embedding. For each participant, we used a mixed regression model built on n-1 trials (gaze~semantics) to iteratively predict gaze in the left-out trial. We correlated participants’ predicted and actual gaze and tested whether within-subject correlation based on a participant’s own semantic model was higher, on average, than predictions made by all other participants’ semantic models (own-other difference, OOD). We find that within-subject models accurately predict gaze on left-out photospheres (r = 0.33, p < 0.001); crucially, within-subject models are also individuating (OOD scores, p < 0.001). Interestingly, our ability to individuate gaze does not simply rely on modeling large numbers of semantic labels. We find that OOD is greater when verbal descriptions contain label + affordance descriptions, relative to label-only descriptions (p = 0.039). Together, our results reveal “attentional fingerprints” in real-world visual behavior and highlight the potential for inferring individual differences in higher-order cognitive processes (action planning, inferential reasoning) and psychiatric traits (autism, anxiety) from gaze alone.