Modelling the neural dynamics of video perception: from increasingly complex static object features to mid-level dynamic action features

Poster Presentation 53.425: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Temporal Processing: Neural mechanisms, models

Christina Sartzetaki1, Anne W. Zonneveld1, Pablo Oyarzo2, Alessandro T. Gifford2, Radoslaw M. Cichy2, Pascal Mettes1, Iris I.A. Groen1; 1Informatics Institute, University of Amsterdam, 2Department of Education and Psychology, Freie Universitat Berlin

Human visual perception unfolds in a dynamic world, yet most vision research relies on static images that lack temporal context, which shapes both cognitive processing and neural dynamics. Deep neural networks (DNNs) have proven effective models of visual cortex processing for static images, but their ability to capture dynamic visual processing in the brain remains less explored. To investigate DNN alignment to the neural dynamics of natural video perception, we here employ a newly collected EEG dataset with the same video stimuli as Bold Moments Dataset (Lahner et al., 2024). We compare these EEG recordings to representations from 100+ DNNs varying in temporal integration, classification task, architecture, and pretraining, using Cross-Temporal Representational Similarity Analysis (CT-RSA) to identify the best-correlating model time-points and layers with evolving neural responses. We find that responses in posterior electrodes initially correlate best with low-to-high level layers of static object-trained models, reflecting processing of increasingly complex features across time similar to image perception. Strikingly however, this is followed by an extended period of posterior activity best correlating with mid-level layers of temporally-integrating action models, while exhibiting high temporal correspondence to the unfolding video content. In contrast, frontal electrodes show only early correlations with high-level action model layers, whilst lacking temporal correspondence to the video content. Regarding the variation factors of architecture and pretraining, state-space models as well as self-supervised pretraining show increased alignment to intermediate posterior activity, pointing to potential benefits of recurrent processing and general pretext tasks. Overall, our findings suggest that during continuous visual perception the brain integrates information beyond the canonical temporal hierarchy of low-to-high-level feature processing seen in static image perception, uncovering a novel stage of posterior processing related to mid-level dynamic action features and following early frontal engagement with high-level action features.

Acknowledgements: CS is supported by an ELLIS Amsterdam Unit grant to IIAG. AWZ acknowledges support from the UvA Data Science Centre, as part of the Human Aligned Video AI Lab.