See What Matters: Clustering Visual Attention Patterns Using Multimodal Embeddings

Poster Presentation: Sunday, May 17, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Eye Movements: Individual differences, visual preference

Shan Zhang1 (), Christine Wusylko2, Do Hyong Koh1, Pavlo D. Antonenko1, Anthony F. Botelho1; 1University of Florida, 2Kennesaw State University

Social media serves as a major channel through which adolescents learn about current events. Yet, evaluating this information is difficult and requires integrating visual, textual, and cognitive processes. To understand how learners process such multimodal content, traditional visual attention studies often rely on Areas of Interest (AOIs), defined by fixed spatial regions. However, these spatially bounded AOIs can oversimplify the semantic relationships between images and text, limiting inferences of understanding. To address this gap, this study introduces a novel, context-driven method that uses shared high-dimensional text and image embeddings, combined with fixations to capture what adolescents attend to, not just where they look, when evaluating information on social media. Visual attention and cognitive data were collected from 29 middle and high school students evaluating 16 social media posts about climate change. Fixations were transformed into 116×116-pixel subimages (n = 22,143) from which multimodal embeddings were extracted using OpenCLIP and clustered with UMAP and K-means. Six visual-semantic clusters emerged, and clearly distinguished text-dominant (n = 4) and image-dominant (n = 2) attention patterns. Regression analyses revealed that participants with higher visuospatial working memory capacity (WMC) focused more on scientifically-relevant textual regions instead of images (β = 0.29, p = .01). Those with lower WMC devoted greater attention to temporally comparative or heuristic text (e.g., “years” “likes”) (β = -0.3, p = .01). Importantly, embedding-based context AOIs explained additional variance (R² = 0.37) in visuospatial WMC compared to task-specific visual-based AOIs, thus providing richer insights into the interplay between perceptual attention and cognitive capacity. By connecting visual attention patterns to individual cognitive differences, our study advances methodological and theoretical understanding of adolescent credibility evaluation on social media, contributes to vision science research, and lays the groundwork for attention-aware educational tools that foster more critical and informed evaluation practices with (mis)information.