How Predictive Reconstruction and Fixation Integration Create High-Resolution Vision from Sparse Samples
Poster Presentation 36.415: Sunday, May 17, 2026, 2:45 – 6:45 pm, Pavilion
Session: Eye Movements: Models, remapping
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Akihito Maruya1 (), Hossein Adeli1, Tian Zheng2, Nikolaus Kriegeskorte1, Ning Qian1; 1Zuckerman Institute, Columbia University, 2Department of Statistics, Columbia University
Human vision operates with sparse sampling: only the fovea provides high-resolution input, and each saccade provides a partial glimpse of the scene. Yet we experience an illusion of uniformly high resolution, suggesting that the brain predicts peripheral structure and integrates information across fixations, though the underlying computations remain unclear. We introduce two modeling approaches that test how predictive reconstruction and fixation-to-fixation integration may support this perceptual stability. We trained vision Transformers with self-supervised objectives on CIFAR-10 in two ways: masked auto-encoding where a certain ratio of image patches is made blank (ViT-Blank), and a biologically inspired one with blurred versions (ViT-Blur). Both models learned to reconstruct full images, and generalization was measured by reconstruction error on rescaled ImageNet-1K images. Unsurprisingly, across ratios and depths of Transformer blocks, reconstruction MSE was consistently lower for ViT-Blur, with especially large differences at high masking. At mask ratio 0.99, MSE was ~0.06 for ViT-Blank but only ~0.01 for ViT-Blur, indicating substantially better recovery of global structure and color. We then evaluated the learned representations with linear CIFAR-10 classification. Classification showed the same pattern: mean accuracy was 0.749 for ViT-Blur, 0.707 for ViT-Blank, and 0.626 for controls, with the largest differences at high mask ratios and shallower depths. To examine integration across saccades, we developed a fixation-based model trained on ImageNet-1K. Each fixation produces an image with blur and down-sampling from fovea to periphery according to cortical magnification; the model receives the current view and previous ones, processed by a ViT. A Conv-GRU integrates these glimpses using corollary discharge signals, and a decoder reconstructs the head-centered image. A reinforcement-learned policy selects fixation sequences that maximize reconstruction improvement. Reconstructions improved over four fixations, reaching validation MSE ≈ 0.008. Our framework suggests that change blindness arises when a modification remains statistically consistent with the brain’s internal reconstruction.