Behavior-Guided Fine-Tuning Makes Vision Models More Human-Like in Social Perception
Poster Presentation 16.336: Friday, May 15, 2026, 3:45 – 6:00 pm, Banyan Breezeway
Session: Face and Body Perception: Social cognition 1
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Kathy Garcia1 (kgarci18@jh.edu), Leyla Isik1; 1Johns Hopkins University
Humans effortlessly perceive rich social information in visual scenes, yet it remains unclear whether current AI vision models organize these scenes the way people do. Matching this structure is important for using models as scientific accounts of human vision and for building systems that understand social scenes. Here, we introduce a new benchmark where we measured the similarity structure of human social perception of over 49,000 odd-one-out judgments on 250 three-second videos of everyday actions, defining which videos people treat alike. We then ask whether pretrained vision models capture this structure, and find that vision models struggle: their internal embeddings do a poor job of predicting which clips people judge as the odd-one-out. To address this, we design a novel behavior-guided, hybrid-triplet-RSA objective, inspired by representational similarity analysis: videos judged as similar by humans are pulled together in the model’s space and dissimilar ones are pushed apart. We then use this novel objective to fine-tune vision models (e.g. CLIP, VideoMae, Timesformer) directly on human judgments. Notably, this behavior-guided fine-tuning significantly increased model-human alignment, improving explained variance and odd-one-out triplet accuracy on unseen videos. Interestingly, qualitative analysis of attention rollouts show that while baseline vision models focus on low-level and background information, our novel objective encourages attention on participants and actions in social scenes. Finally, probing analysis demonstrates that our fine-tuned models show stronger linear readouts of social-affective attributes (e.g. intimacy, valence, communication) relative to pretrained baselines. Overall, our findings reveal a significant gap in pretrained AI vision models’ ability to match video judgments and provide compelling evidence that human behavioral supervision can close this gap, making models organize social scenes more like humans do. Our benchmark and behavior-guided objective define a data-efficient, model-agnostic framework for building and rigorously evaluating biologically informed models of social vision.
Acknowledgements: This work was funded in part by NSF GRFP DGE-2139757 and NIMH R01MH132826.