Straightening of natural videos through local temporal integration

Poster Presentation 53.420: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Temporal Processing: Neural mechanisms, models

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

Anne Zonneveld¹ (a.w.zonneveld@uva.nl), Pascal Mettes¹, Iris Groen¹; ¹University of Amsterdam

Predictions about future states of the world play an important role in guiding human behavior. In primate vision, temporal representational trajectories are straightened in neural and perceptual space relative to input space (Hénaff et al., 2019; 2021), supporting linear extrapolation and temporal predictability. This straightening enables time-aware representations that distinguish fine-grained temporally opposite actions, such as opening versus closing a door (Bagad & Zisserman, 2025). To characterize the computational principles underlying straightening, we evaluated a range of deep neural networks, differing in temporal integration (image- vs. video-based), architecture (convolutional vs. Transformer), and training (trained vs. untrained). Using more than 1,000 natural videos from the Bold Moments Dataset (Lahner et al., 2024), we identified properties that facilitate straightening, quantified as reduction in curvature from pixel to feature space. We further assessed temporal coherence in model feature space, a prerequisite for straightening, by demonstrating significantly higher curvature for temporally shuffled compared to unshuffled features. Consistent with prior work, we show straightening is absent in standard image models. However, we find it emerges in late layers of video convolutional neural networks, facilitated by local operations such as 3D convolutions, as a consequence of training. In contrast, we do not observe any straightening in video Transformers. Notably, global-attention based video Transformers also lack temporal coherence in feature space. Together, these findings suggest that temporal coherence is a necessary, but not sufficient, condition for straightening and highlight that local temporal integration of continuous visual information is critical to the straightening of visual trajectories, as performed by the brain.

Acknowledgements: This work was supported by the UvA Data Science Centre, as part of the Human Aligned Video AI Lab.

Vision Sciences Society

Straightening of natural videos through local temporal integration

Important Dates

MyVSS

Join VSS

Future Meetings