Modeling typicality in human action perception with CLIP representations
Poster Presentation 53.447: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Perception, recognition
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Filip Durovic1 (), Marieke Mur2, Angelika Lingnau3, Paul E. Downing1; 1Bangor University, 2Western University, 3University of Regensburg
Large-scale multimodal models such as CLIP show strong alignment with high-level visual cortex, but it remains unclear whether their representational spaces also capture behavioral aspects of human vision. Here we ask whether CLIP embeddings can serve as a model of typicality in human action categorization. We measured human reaction times in an action categorization task: participants viewed an image depicting one of five actions (biking, cooking, dressing, eating, golf) for 300 ms followed by a 200 ms mask, then judged whether a concurrently presented word label (e.g., “biking”) matched the image. We extracted representations from CLIP’s final 1024-dimensional embedding layer (ResNet-50; trained on 400M image–text pairs) and computed distance-to-norm measures in this space, defined as the distance of each image embedding to either the mean image embedding of an action category (image-norm) or the single embedding of a category label (label-norm). Both measures reliably predicted response times: images closer to their category norm yielded faster “same” judgments, and images closer to norms of nonmatching categories yielded slower “different” judgments. Label-norm distances better explained “same” trials, whereas image-norm distances better explained “different” trials, suggesting distinct contributions of semantic and visual similarity. Comparing model variants revealed that a smaller CLIP model and an image-only SimCLR model, sharing the same visual backbone and trained on the same 15M-image subset (CLIP with image–text contrastive learning, SimCLR with image-only contrastive learning), both captured trial-level structure, but their predictive power was weaker. These findings indicate that CLIP’s joint vision–language representations provide a robust account of exemplar typicality in human action perception, linking large-scale visual-language models to fine-grained human behavioral variability.
Acknowledgements: This work was supported by the Economic and Social Research Council (ESRC), UK Research and Innovation (UKRI).