How Predictive Reconstruction and Fixation Integration Create High-Resolution Vision from Sparse Samples

Poster Presentation 36.415: Sunday, May 17, 2026, 2:45 – 6:45 pm, Pavilion
Session: Eye Movements: Models, remapping

Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions

There is a Poster PDF for this presentation, but you must be a current member or registered to attend VSS 2026 to view it.
Please go to your Account Home page to register.

Akihito Maruya¹ (am6949@columbia.edu), Hossein Adeli¹, Tian Zheng², Nikolaus Kriegeskorte¹, Ning Qian¹; ¹Zuckerman Institute, Columbia University, ²Department of Statistics, Columbia University

Human vision operates with sparse sampling: only the fovea provides high-resolution input, and each saccade provides a partial glimpse of the scene. Yet we experience an illusion of uniformly high resolution, suggesting that the brain predicts peripheral structure and integrates information across fixations, though the underlying computations remain unclear. We introduce two modeling approaches that test how predictive reconstruction and fixation-to-fixation integration may support this perceptual stability. We trained vision Transformers with self-supervised objectives on CIFAR-10 in two ways: masked auto-encoding where a certain ratio of image patches is made blank (ViT-Blank), and a biologically inspired one with blurred versions (ViT-Blur). Both models learned to reconstruct full images, and generalization was measured by reconstruction error on rescaled ImageNet-1K images. Unsurprisingly, across ratios and depths of Transformer blocks, reconstruction MSE was consistently lower for ViT-Blur, with especially large differences at high masking. At mask ratio 0.99, MSE was ~0.06 for ViT-Blank but only ~0.01 for ViT-Blur, indicating substantially better recovery of global structure and color. We then evaluated the learned representations with linear CIFAR-10 classification. Classification showed the same pattern: mean accuracy was 0.749 for ViT-Blur, 0.707 for ViT-Blank, and 0.626 for controls, with the largest differences at high mask ratios and shallower depths. To examine integration across saccades, we developed a fixation-based model trained on ImageNet-1K. Each fixation produces an image with blur and down-sampling from fovea to periphery according to cortical magnification; the model receives the current view and previous ones, processed by a ViT. A Conv-GRU integrates these glimpses using corollary discharge signals, and a decoder reconstructs the head-centered image. A reinforcement-learned policy selects fixation sequences that maximize reconstruction improvement. Reconstructions improved over four fixations, reaching validation MSE ≈ 0.008. Our framework suggests that change blindness arises when a modification remains statistically consistent with the brain’s internal reconstruction.

Vision Sciences Society

How Predictive Reconstruction and Fixation Integration Create High-Resolution Vision from Sparse Samples

Important Dates

MyVSS

Join VSS

Future Meetings