Influence of form vs. optic flow on the recognition of naturalistic body actions
Poster Presentation 53.444: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Action: Perception, recognition
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Prerana Kumar1,2 (), Martin A. Giese1; 1Hertie Institute for Clinical Brain Research (HIH), University of Tuebingen, 2International Max Planck Research School for Intelligent Systems, Tuebingen
INTRODUCTION: The computational mechanisms of robust action recognition are not well understood. While it is known that humans can recognize actions from videos with minimal shape information in individual frames, previous work used highly simplified stimuli. Recent work in computer vision used appearance-free action videos derived from variable, real-world videos by removing shape information from each frame, and showed that humans can categorize these actions when trained on the transformed videos. It remains unclear how well humans can generalize zero-shot to such appearance-free videos and what kind of model can reproduce this generalization. METHODS: In an in-lab psychophysical experiment, we investigated zero-shot generalization to appearance-free videos derived from real-world action videos. Participants (N=22) were trained to classify five action categories using naturalistic videos (UCF5 dataset) and tested on two types of appearance-free transformations of the videos: (i) dense-noise motion videos from an existing appearance-free dataset (AFD5), and (ii) random-dot videos with background flickering, generated by us. We modeled participants’ behavior with a 3D convolutional neural network architecture with separate form (RGB) and motion/optic-flow pathways (both implemented with X3D backbones). The motion pathway includes an optic-flow estimator and a coherence gate. Our model was trained only on naturalistic UCF5 videos and tested on the appearance-free videos. RESULTS: Participants recognized actions in both appearance-free conditions well above chance (>78% accuracy, chance level: 20%), but less accurately than in naturalistic videos. Our model’s test accuracy on both appearance-free stimulus types was >60%, outperforming recent models and reducing the gap to human performance. Ablation studies showed that the motion pathway was critical for this generalization, while the form pathway improved performance on naturalistic videos. CONCLUSION: Our findings highlight the importance of motion-based representations for explaining the robust generalization to appearance-free videos observed in humans and support multi-stream architectures as a model of action processing.
Acknowledgements: This work was funded by ERC 2019-SyG-RELEVANCE-856495.