Modeling typicality in human action perception with CLIP representations

Poster Presentation 53.447: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Perception, recognition

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

There is a Poster PDF for this presentation, but you must be a current member or registered to attend VSS 2026 to view it.
Please go to your Account Home page to register.

Filip Durovic¹ (fld23snj@bangor.ac.uk), Marieke Mur², Angelika Lingnau³, Paul E. Downing¹; ¹Bangor University, ²Western University, ³University of Regensburg

Large-scale multimodal models such as CLIP show strong alignment with high-level visual cortex, but it remains unclear whether their representational spaces also capture behavioral aspects of human vision. Here we ask whether CLIP embeddings can serve as a model of typicality in human action categorization. We measured human reaction times in an action categorization task: participants viewed an image depicting one of five actions (biking, cooking, dressing, eating, golf) for 300 ms followed by a 200 ms mask, then judged whether a concurrently presented word label (e.g., “biking”) matched the image. We extracted representations from CLIP’s final 1024-dimensional embedding layer (ResNet-50; trained on 400M image–text pairs) and computed distance-to-norm measures in this space, defined as the distance of each image embedding to either the mean image embedding of an action category (image-norm) or the single embedding of a category label (label-norm). Both measures reliably predicted response times: images closer to their category norm yielded faster “same” judgments, and images closer to norms of nonmatching categories yielded slower “different” judgments. Label-norm distances better explained “same” trials, whereas image-norm distances better explained “different” trials, suggesting distinct contributions of semantic and visual similarity. Comparing model variants revealed that a smaller CLIP model and an image-only SimCLR model, sharing the same visual backbone and trained on the same 15M-image subset (CLIP with image–text contrastive learning, SimCLR with image-only contrastive learning), both captured trial-level structure, but their predictive power was weaker. These findings indicate that CLIP’s joint vision–language representations provide a robust account of exemplar typicality in human action perception, linking large-scale visual-language models to fine-grained human behavioral variability.

Acknowledgements: This work was supported by the Economic and Social Research Council (ESRC), UK Research and Innovation (UKRI).

Vision Sciences Society

Modeling typicality in human action perception with CLIP representations

Important Dates

MyVSS

Join VSS

Future Meetings