Straightening of natural videos through local temporal integration
Poster Presentation 53.420: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Temporal Processing: Neural mechanisms, models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Anne Zonneveld1 (), Pascal Mettes1, Iris Groen1; 1University of Amsterdam
Predictions about future states of the world play an important role in guiding human behavior. In primate vision, temporal representational trajectories are straightened in neural and perceptual space relative to input space (Hénaff et al., 2019; 2021), supporting linear extrapolation and temporal predictability. This straightening enables time-aware representations that distinguish fine-grained temporally opposite actions, such as opening versus closing a door (Bagad & Zisserman, 2025). To characterize the computational principles underlying straightening, we evaluated a range of deep neural networks, differing in temporal integration (image- vs. video-based), architecture (convolutional vs. Transformer), and training (trained vs. untrained). Using more than 1,000 natural videos from the Bold Moments Dataset (Lahner et al., 2024), we identified properties that facilitate straightening, quantified as reduction in curvature from pixel to feature space. We further assessed temporal coherence in model feature space, a prerequisite for straightening, by demonstrating significantly higher curvature for temporally shuffled compared to unshuffled features. Consistent with prior work, we show straightening is absent in standard image models. However, we find it emerges in late layers of video convolutional neural networks, facilitated by local operations such as 3D convolutions, as a consequence of training. In contrast, we do not observe any straightening in video Transformers. Notably, global-attention based video Transformers also lack temporal coherence in feature space. Together, these findings suggest that temporal coherence is a necessary, but not sufficient, condition for straightening and highlight that local temporal integration of continuous visual information is critical to the straightening of visual trajectories, as performed by the brain.
Acknowledgements: This work was supported by the UvA Data Science Centre, as part of the Human Aligned Video AI Lab.