Simple 3D Pose Features Support Human and Machine Social Scene Understanding

Poster Presentation 16.334: Friday, May 15, 2026, 3:45 – 6:00 pm, Banyan Breezeway
Session: Face and Body Perception: Social cognition 1

Wenshuo Qin1 (), Leyla Isik1; 1Johns Hopkins University

Humans can quickly and effortlessly extract a variety of information about others' social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information like whether people are communicating. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, and this information is absent from most AI vision models. To test this, we designed a novel pipeline that leverages state-of-the-art pose and depth estimation to automatically extract 3D body joints from short video clips depicting everyday human actions. We compared the performance of these body joints against over 350 current AI vision models in predicting human social judgments on these videos. Strikingly, 3D body joints outperformed most current AI vision models, revealing that key social information is available in explicit and interpretable body joints but not in the high-dimensional learned features of most vision models. We next reduced the 3D body joints to an even more compact set that captured only the 3D positions and directions of people in the videos. We found that this minimal set of 3D features (but not their 2D counterparts) was both necessary and sufficient to explain the prediction performance of the full set of 3D body joints. Moreover, this minimal 3D set also predicted the extent to which AI models aligned with human social judgments and significantly improved their performance on these tasks. Together, our findings provide strong evidence that human social scene understanding relies on explicit 3D pose representations and can be supported by simple, structured visuospatial primitives.