A Task-Optimized Vision–Language Neural Network Produces Human-Like Linguistic Cueing Despite Lacking Built-in Attention Mechanisms

Poster Presentation 53.416: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Attention: Models

Jonathan Skaza1, Sana Shehabi1, Miguel P. Eckstein1; 1University of California, Santa Barbara

Introduction: Feedforward CNNs trained for target detection exploit visual cues and show emergent human-like behavioral signatures of covert attention (Srivastava et al., 2024) and associated mechanisms that parallel neurophysiological findings (Srivastava et al., 2025), despite lacking built-in attention mechanisms. Human covert attention is not limited to visual information; linguistic cues can modulate spatial attention (Barnas et al., 2024). Here, we assess whether linguistic cueing can emerge in neural networks that combine visual and language input (a task-optimized CNN fused with an MLP operating on language embeddings) and compare these to linguistic cueing effects in humans. Methods: Humans performed a peripheral detection task in which a tilted line target (15° ± 3.5° SD) appeared at one of two locations with 50% probability, alongside a distractor (7° ± 3.5° SD). Target-absent trials contained only distractors. Each trial began with an auditory cue predicting the target location with 80% validity (e.g., “attend left”, “monitor 3 o'clock”). To model this experiment, a task-optimized CNN processed the visual display, and a pre-trained language encoder produced embeddings of the linguistic cue. These representations were transformed via an MLP and fused with the CNN features before classification, trained on the same detection and cue-validity structure as the human experiment. Results: Both humans and the CNN-MLP fusion model showed linguistic cueing, with higher hit rates on valid versus invalid cue trials (Δ_human = 0.102, Δ_model = 0.098). Humans and the model displayed similar variation in cueing magnitude depending on the language formulation: cues phrased directly (e.g., “focus left”) produced larger effects than negated or less-direct cues (e.g., “ignore 3 o’clock”). Conclusion: A task-optimized CNN–MLP model reproduces human-like linguistic cueing effects, suggesting that linguistic modulation of spatial detection can emerge from multimodal integration under task optimization, without requiring explicit attentional mechanisms.