Functional Organization of Visual Pathways 2
Talk Session: Tuesday, May 19, 2026, 8:15 – 9:45 am, Talk Room 2
Moderator: Niko Kriegeskorte, Columbia University
Schedule of Events | Search Abstracts | Symposia | Talk Sessions | Poster Sessions
Talk 1, 8:15 am
Cortex: A unified framework for evaluating the predictive and cognitive alignment of vision models entirely zero-shot
Ruolin Wang1,2, Mayukh Deb1,2, Alex Abate3, Alish Dipani1,2, Sanjana Chillarege1,2, Kruthik Ravikanti1,2, Kushal Dudipala1,2, Yuxuan Li1,2, Haider Al-Tahan1,2, Ranjani Koushik1,2, Herrick Fung1,2, Yung-Ying Chen1,2, Nikolas McNeal1,2, Nancy Kanwisher4,5,6, N. Apurva Ratan Murty1,2; 1Cognition and Brain Science, School of Psychology, Georgia Institute of Technology, 2Computational Cognition, Georgia Institute of Technology, 3Harvard Medical School, 4Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 5McGovern Institute for Brain Research, Massachusetts Institute of Technology, 6The Center for Brains, Minds and Machines, Massachusetts Institute of Technology
Our understanding of the visual system has relied on hypothesis-driven studies and tightly controlled stimuli to adjudicate between theoretical predictions. In contrast, modern NeuroAI benchmarks prioritize aggregate metrics, such as how well artificial neural networks (ANNs) predict responses to large, uncontrolled datasets. But prediction alone offers limited insight into whether models capture the experimental phenomena and foundational theories of the field. Here we introduce Cortex, a framework for evaluating models of functional ROIs entirely zero-shot (without model retraining). We leveraged Cortex to evaluate 126 leading ANN-based encoding models of category-selective cortex (FFA, PPA, and EBA) on their ability to predict brain responses and their capacity to recapitulate influential findings in the field. Our assessments spanned 8 fMRI datasets and 20 experimental studies. On prediction tests we found that the choice of fMRI mapping data was critical: models mapped using rich naturalistic datasets (like NSD) generalized better than models mapped using smaller datasets. However, systematic gaps also emerged as these models did not predict responses to synthetic or dynamic video stimuli. We also observed remarkable variations between fROIS: EBA was less predictive and harder to model than the FFA or PPA. We also found that vision–language models (like CLIP-ResNet50) tended to generalize better across datasets, as previously reported. Finally, we evaluated how well models replicated prior cognitive neuroscience studies. Here we found that even the best predictive models frequently failed to replicate specific experimental findings. For example, CLIP-ResNet50 that consistently ranked highly on prediction tests failed to reproduce patterns like retinotopy or any of the EBA-specific experimental results. Collectively, these findings show that prediction is a necessary but incomplete proxy for understanding vision. Cortex helps bridge this gap by providing a rigorous in silico platform for hypothesis-testing, exposing key model failures that can serve as a roadmap for future model development.
This work is supported by the NIH Pathway to Independence Award (R00EY032603), the NSF Nexus computation support (Allocation number: SOC250049), and a startup grant from Georgia Tech.
Talk 2, 8:30 am
A resource of neural encoding models for in silico visual neuroscience
Alessandro Gifford1, Domenic Bersch2, Daniel Janini1, Gemma Roig2, Radoslaw Cichy1; 1Freie Universität Berlin, 2Goethe Universität Frankfurt
In silico neural responses to visual stimuli generated by encoding models increasingly resemble in vivo responses recorded from real brains, enabling the novel research paradigm of in silico visual neuroscience. The fast and economical generation of in silico neural responses allows researchers to test more scientific hypotheses and to explore across larger solution spaces than possible in vivo. Crucially, novel findings from large-scale in silico experimentation are then validated through targeted small-scale in vivo data collection, thereby optimizing research resources. To empower this emerging research paradigm, we introduce the Brain Encoding Response Generator (BERG; https://github.com/gifale95/BERG), a resource consisting of diverse pre-trained encoding models of the brain and a Python package to easily generate in silico neural responses to arbitrary visual stimuli. BERG enables researchers to efficiently address a wide range of research questions through in silico visual neuroscience by providing a growing, well documented library of encoding models trained on different neural recording modalities, species, datasets, subjects, and brain areas. We demonstrate BERG’s potential for neuroscientific discovery in four ways. First, BERG’s encoding models accurately predict neural responses to visual stimuli. Second, these in silico responses reproduce key signatures of visual processing in the brain, such as retinotopy and categorical selectivity (fMRI), or the different dynamics of object exemplar versus concept categorization (EEG). Third, BERG enables data types that are impossible to collect in vivo, such as high spatio-temporally resolved in silico neural responses derived through linear mapping of EEG onto fMRI. Fourth, in a separate study we used BERG to reveal representational relationships between visual areas, which we successfully validated in vivo (Gifford et al., 2025, NatHumBehav). Together, we envision that BERG will empower in silico visual neuroscience, ultimately accelerating scientific discovery. We warmly welcome models, ideas, and collaboration from the vision science community.
Talk 3, 8:45 am
Diffusion-based stimulus optimization reveals functional organization across higher visual cortex
Margaret M Henderson1 (), Andrew F Luo2, Sungjoon Park1, Michael J Tarr1, Leila Wehbe1; 1Carnegie Mellon University, 2University of Hong Kong
Characterizing the fine-grained functional organization of human higher visual cortex remains a significant challenge. Traditional neuroimaging experiments are limited in the number of stimuli they can sample, which may bias results toward particular stimulus attributes. In prior work we developed a novel data-driven tool, termed “BrainDiVE” (Luo et al. 2023, NeurIPS), which addresses these challenges by synthesizing images optimized to activate specific brain regions. BrainDiVE leverages pretrained image diffusion models guided by gradients from an image-computable fMRI encoding model. Here, we validated BrainDiVE experimentally by generating images that targeted several functional regions of interest (i.e., images predicted to maximally activate those brain areas), and presenting them to new human participants (n=12) in an fMRI study. We found that the synthesized images elicited robust and spatially specific responses in the predicted target regions, yielding significantly higher measures of category selectivity relative to natural images. This validates BrainDiVE's ability to capture neural selectivity in human ventral visual cortex, characterizing tuning properties that generalize across participants. Furthermore, we demonstrated fine-grained experimental control by differentially activating two face-selective regions, the occipital face area (OFA) and fusiform face area (FFA), providing further evidence that these regions encode distinct aspects of faces. We also identify a functional gradient within the occipital place area (OPA) along a posterior-to-anterior axis, suggesting a functional topology based on scene properties such as distance and indoor-outdoor. These findings provide new insights into the representational structure of category-selective regions and establish a novel paradigm for targeted exploration of neural selectivity in human visual cortex. More generally, our approach offers a powerful tool for investigating the functional organization of visual cortex at a fine-grained level, exceeding the capabilities of traditional methods across multiple dimensions.
Support was provided by a grant from Apple Inc. to MJT. This work used Bridges-2 at Pittsburgh Supercomputing Center through allocation SOC220017 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation.
Talk 4, 9:00 am
Interpretable fMRI-to-Image Decoding Reveals How Brain Representations Guide Generative Visual Reconstructions
Pinyuan Feng1, Hossein Adeli1, Wenxuan Guo1, Fan Cheng1, Ethan Hwang1, Nikolaus Kriegeskorte1; 1Columbia University
Recent works have demonstrated that complex visual stimuli can be decoded from human brain activity. Current approaches first map fMRI signals into intermediate image or text features to then guide a generative model. While effective, this two-stage strategy introduces an information bottleneck and obscures how specific brain regions contribute to the final reconstruction. Here, we introduce NeuroAdapter, an end-to-end visual decoding framework that conditions a latent diffusion model directly on brain representations. Neural activity is transformed into parcel-wise tokens that modulate the generative process through transformer cross-attention, enabling a more transparent link between cortical signals and image generation. We show that our model achieves high-quality visual reconstructions on Natural Scene Dataset (NSD), NSD-Imagery and Deeprecon. We also created a baseline model using a state-of-the-art brain encoder that retrieved images from a large image dataset (ImageNet). On NSD, our model achieves 11.7% average improvement in low-level image quality metrics and 18.6% average improvement in high-level semantic metrics. Furthermore, the decoded images yield predicted neural responses that closely match the measured fMRI patterns, confirming that the reconstructions preserve the brain-relevant visual features. To understand how brain representations shape the unfolding generative trajectory, we further propose the Image–Brain Bi-directional Interpretability (IBBI) framework. IBBI analyzes cross-attention patterns across diffusion steps to show (i) the contribution of individual cortical parcels during denoising process, and (ii) the spatial influence of category-selective regions (e.g., face, body, scene, word region of interests) on evolving image-level features. Visualization of ROI attention maps reveals a systematic temporal pattern that ROI-level influence is broadly distributed across the image during early denoising stages but progressively converges onto semantically relevant pixel regions at later stages of reconstruction. Together, our work establishes an interpretable approach for end-to-end brain-to-image reconstruction, highlighting the potential of decoding to reveal the brain representation of complex scenes.
Research reported in this publication was supported in part by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under award numbers 1RF1NS128897 and 4R01NS128897. The content is solely the responsibility of the authors.
Talk 5, 9:15 am
Context-Dependent Dissociation of Shared Input and Directed Flow in the Visual Cortex
Yuxuan Xue1,2 (), Mitchell Morton3, Anirvan S. Nandy4,5,6,7, Monika P. Jadi2,4,7; 1Department of Electrical Engineering, Yale University, 2Department of Psychiatry, Yale University, 3Interdepartmental Neuroscience Program, Yale University, 4Department of Neuroscience, Yale University, 5Department of Psychology, Yale University, 6Kavli Institute for Neuroscience, Yale University, 7Wu Tsai Institute, Yale University
Cortical population activity reflects a mixture of externally driven inputs and internally generated brain states, yet these sources of co-fluctuation can produce similar patterns of correlated variability. This ambiguity poses a fundamental challenge for interpreting whether population coupling reflects true information flow or shared modulation. To dissociate these mechanisms, we investigated how visual stimulation and wakefulness state differentially shape inter-laminar communication in the macaque primary visual cortex (V1). We recorded laminar spiking activity across input and superficial layers in two macaques during both stimulus-evoked and spontaneous (eyes open vs. closed) conditions, and quantified cross-population interactions using reduced-rank regression (RRR) to identify low-dimensional predictive subspaces linking the two layers. Prediction accuracy, principal angles between subspaces, and temporal delay analyses were used to assess the strength, structure, and directionality of inter-laminar coordination. To mechanistically interpret the empirical signatures, we constructed a two-layer recurrent neural network (RNN) with low-rank feedforward connectivity and manipulated the structure and layer-specificity of external inputs. Visual stimulation markedly modulated the inter-laminar predictive subspace and increased prediction accuracy in a directional manner: input-layer activity predicted superficial responses more strongly than undirected models. Additionally, optimal delays showed that input activity consistently preceded superficial activity. In contrast, eye-closed spontaneous activity increased prediction accuracy with no directional enhancement and no consistent temporal lead–lag structure. Simulations reproduced this dissociation: directionality emerged only when stimulus-driven inputs aligned with feedforward connectivity and when the superficial layer exhibited stronger low-dimensional structure relative to the input layer, whereas shared fluctuations produced undirected co-activation regardless of structure. These findings demonstrate that similar correlation patterns can arise from qualitatively distinct mechanisms. Sensory stimulation drives feedforward communication across V1 layers, whereas spontaneous internal dynamics produce shared modulation without directional flow. Our results establish predictive subspace structure as a principled tool for distinguishing communication from common input across cortical circuits.
This research was supported by NIH R01 EY034605, NIH R00 EY025026, NIH R21 MH126072 and SFARI 875855 to MPJ, NARSAD Young Investigator Grant, Ziegler Foundation Grant, Yale Orthwein Scholar Funds, NIH R01 EY032555, NIH R21 MH126072 and SFARI 875855 to ASN, and by NEI core grant for vision research P30 EY026878 to Yale University.
Talk 6, 9:30 am
Meta-Learning In-Context Enables Training Free Prediction & Decoding of Visual Cortex From Novel Subjects
Andrew Luo1, Mu Nan1, Muquan Yu1, Margaret Henderson2, Leila Wehbe2, Michael Tarr2; 1University of Hong Kong, 2Carnegie Mellon University
Developing computational models of human visual cortex that generalize across individuals remains a fundamental challenge in neuroscience, as the substantial variability in functional organization typically necessitates collecting large-scale datasets to train individual models for each subject. To address this challenge, we introduce a unified meta-learning framework that reformulates neural modeling as an in-context inference problem, enabling both the prediction and decoding of visual representations in novel subjects without any gradient-based fine-tuning. First, addressing the forward encoding problem, we present a transformer-based architecture that infers the unique functional tuning of individual voxels by simply conditioning on a small set of image-activation examples. This approach yields high-fidelity predictions of voxelwise responses in higher visual cortex that generalize robustly across diverse scanners, acquisition protocols, and subject populations using only a fraction of the data typically required. We then leverage this learned forward model to solve the inverse problem (decoding visual stimuli from brain activity) via a novel hierarchical in-context learning strategy. In the first stage of the hierarchy, the model estimates per-voxel visual response parameters in-context; in the second stage, we construct a context consisting of these estimated parameters and observed brain responses across multiple voxels to perform a "functional inversion" that directly infers the visual stimulus embedding. We demonstrate that this method achieves state-of-the-art cross-subject image retrieval without requiring anatomical alignment or shared stimuli, effectively bypassing the correspondence problem. Furthermore, attention-based analyses reveal that the model spontaneously learns to localize and differentially weight functionally specialized regions: such as face- and place-selective areas to maximize decoding accuracy, mirroring known cortical organization. Together, these findings establish a scalable, data-efficient foundation model for non-invasive brain activity prediction and decoding, offering a principled computational lens for investigating population-level neural representations without the constraints of subject-specific training.