Discovering Implicit Block-Recurrent Dynamics in Vision Transformers
Poster Presentation 53.421: Tuesday, May 19, 2026, 8:30 am – 12:30 pm, Pavilion
Session: Temporal Processing: Neural mechanisms, models
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Mozes Jacobs1 (), Richard Hakim1, Thomas Fel1, Alessandra Brondetta2, Demba Ba1, T. Andy Keller1; 1Harvard University, 2University of Osnabrück
Biological visual systems leverage recurrence to process visual information, yet state-of-the-art computer vision models such as deep Vision Transformers (ViTs) typically have a purely feedforward architecture. Here we discover that block structure emerges in the layerwise representation of large vision transformers and advance a Block-Recurrent Hypothesis, which argues that the computation carried out by the original L transformer blocks can be rewritten using only K << L distinct blocks applied repeatedly along depth. We first characterize layer-by-layer representational similarity in visual foundation models (e.g. DINOv2) and use a max-cut segmentation of this matrix to identify contiguous “phases” of computation. To test the functional significance of these representational phases, we construct recurrent surrogates (“Raptors”) in which K transformer blocks are weight-tied across depth. We train these surrogate models to match the intermediate activations of the original model using a hybrid teacher-forcing and autoregressive objective. Remarkably, despite having far fewer distinct blocks, a two-block Raptor recovers 96% of DINOv2 ViT-B linear-probe accuracy on ImageNet-1k, and three blocks recover 98%, while maintaining equivalent computational cost. Finally, we treat the depth-unfolded DINOv2 (ViT-G) as a discrete-time dynamical system and analyze its emergent dynamics. We find: (i) directional convergence of token representations into class-dependent angular basins with self-correcting trajectories under small perturbations; (ii) token-specific dynamics, where the cls token undergoes sharp late reorientations while patch tokens exhibit strong late-stage mean-field-like coherence; and (iii) a collapse of the update to low rank at late depth, consistent with convergence to low-dimensional attractors. These results suggest that ViT depth implements a compact recurrent program, offering a principled framework to compare artificial vision models with the recurrent processing dynamics of biological vision.
Acknowledgements: Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University