Mid-level motion statistics as a model for social perception in the lateral stream

Poster Presentation 16.337: Friday, May 15, 2026, 3:45 – 6:00 pm, Banyan Breezeway
Session: Face and Body Perception: Social cognition 1

Ming Zhou1 (), Leyla Isik1; 1Johns Hopkins University

The lateral visual stream supports social interaction perception, while there are few computational models to characterize the neural representations underlying this process. Although existing deep learning models can predict the neural responses in these regions, they lack interpretability and are computationally intensive. To address these gaps, we present an image-computable model, derived from first principles, to characterize social motion responses in higher-level regions across the lateral stream. We derive second-order features from the Portilla-Simoncelli texture model to construct mid-level motion statistics. Specifically, we take a standard motion energy model, consisting of spatiotemporal Gabors with different directions of motion, spatial frequencies and temporal frequencies as our first-order features. Analogous to the P-S model, we set up three kinds of second-order features based on correlation of distinct first-order features: the spatial-autocorrelation, cross-direction and cross-temporal-frequency respectively computed as the correlation across different spatial locations, directions and temporal frequencies. Visual inspection reveals that these features capture meaningful visual patterns in the stimulus, including periodicity, curved contours and biological motion. With these feature sets, we constructed encoding models to predict the human behavioral ratings and neural responses in an fMRI dataset of 200 video clips involving two people engaged in everyday actions. We found that compared to first-order motion energy features, second-order motion features (cross-direction and cross-temporal-frequency) better predicted behavioral judgments of communicative interactions, and critically, explained significantly more variance in mid-level lateral regions (e.g., EBA) and high-level social interaction regions (e.g., pSTS). These findings suggest that second-order motion features can capture higher-level motion information, analogous to their static image counterparts, and may support social perception along the lateral stream. These hand-engineered mid-level motion statistics are also a promising, interpretable alternative to deep learning for understanding the neural representations of social interaction perception.