Task-Conditioned Gaze Prediction from Language-Based Scene Semantics
Poster Presentation 36.414: Sunday, May 17, 2026, 2:45 – 6:45 pm, Pavilion
Session: Eye Movements: Models, remapping
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Michelle Greene1, Bruce Hansen2; 1Barnard College, Columbia University, 2Colgate University
Human gaze in real-world environments is profoundly task-driven (Yarbus, 1967), yet most computational gaze models assume stimulus-driven attention—whether through bottom-up saliency (Itti & Koch, 2001), target-template matching in visual search (Zelinsky, 2008), crowdsourced localized meaning (Henderson & Hayes, 2017), or deep neural networks trained directly on fixation data (Kümmerer et al., 2017). Because task demands are not directly observable in the image, we begin with language as a structured means of generating interpretable, top-down priority signals. We modeled three forms of semantic guidance from human image descriptions: object-based descriptions, capturing semantic relevance similar to meaning maps; navigation descriptions, capturing action-oriented guidance; and aesthetic descriptions, capturing value-based priorities motivated by foraging accounts of attention. Each description was encoded using MPNet and used to train a dedicated convolutional neural network (CNN) to predict its embedding. Using deconvolution, we obtained three interpretable semantic priority maps for each image. These maps, together with a bottom-up saliency map and center bias, served as predictors in a spatial Poisson GLM over fixation counts using five-fold cross-validation. We tested these models on two datasets: 135 sessions from the Visual Experience Dataset (Greene et al., 2024), in which participants freely navigated a campus pond, and the COCO-Search18 dataset (Chen et al., 2021), in which observers searched for objects. The navigation semantic map was the strongest predictor of gaze in the navigation dataset (mean cross-validated normalized scanpath saliency (NSS) = 0.43), outperforming object and aesthetic semantics and low-level saliency. In contrast, in the visual search dataset, the object semantic map was most predictive across both target-present (mean NSS = 0.81) and target-absent (mean NSS = 1.23) trials. These results demonstrate that scene semantics extracted from linguistic descriptions offer a powerful, task-specific account of cognitive guidance in natural images.
Acknowledgements: NSF 2522311/2 to MRG and BCH.