Feature-Based Language Guidance Facilitates Visual Search in Humans and Foveated Vision-Language Transformer Models

Poster Presentation 36.440: Sunday, May 17, 2026, 2:45 – 6:45 pm, Pavilion
Session: Visual Search: Neural mechanisms, models, eye movements

Parsa Madinei1, Sana Shehabi, Miguel P Eckstein; 1University of California, Santa Barbara, 2University of California, Santa Barbara, 3University of California, Santa Barbara

Introduction: In everyday life, language specifying target features facilitates visual search (e.g., Wolfe & Horowitz, 2017). Here, we investigate how including features in language pre-cues improves target detection performance and guides eye movements in humans and a deep neural network-based foveated language-guided search model (FLGSM; Madinei & Eckstein, 2025). Methods: We generated 30 images (15 target-present, 15 target-absent) using Gemini 3 Nano Banana Pro, each containing five objects of the same category varying along two feature dimensions: one with higher peripheral visibility/discriminability (HPV-feature, e.g., color, shape, orientation) and one with lower (LPV-feature, e.g., brand, text label, subtle feature). Language pre-cues used expressions referring to the target’s LPV-feature (e.g., "clock with time 5:54") or included its HPV-feature (e.g., "star-shaped clock with time 5:54"). Each human participant (N=6) completed a target search detection task (50% presence) using one of the language pre-cue types per image. FLGSM combined multi-modal transformers with foveated processing (Freeman & Simoncelli, 2011) and vision-language cross-attention heatmap fixation selection. FLGSM was evaluated on the same images presented to humans. Results: Human target detection improved significantly with linguistic cueing of target features visible in the periphery (Area under the ROC, AUC= 0.74 for LPV-feature vs. AUC= 0.89 for HPV-feature, p<0.01), and the distance of the closest fixation to the target decreased (3.0° for LPV-feature vs. 1.9° for HPV-feature, p<0.001). FLGSM showed similar patterns to humans: detection accuracy increased (AUC= 0.55 for LPV-feature vs AUC=0.75 for HPV-feature) and closest fixation distance decreased with HPV-features (2.4° vs. 3.1° for LPV-features). Conclusions: Our findings suggest that human feature-based linguistic search guidance can be accounted for by a model with visual-linguistic representations and the interaction between the visual features and foveated processing.