Leveraging Vision and Language Generative Models to Understand the Visual Cortex

Poster Presentation 53.316: Tuesday, May 21, 2024, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Scene Perception: Ensembles, natural image statistics

Andrew Luo1 (), Margaret Henderson1, Leila Wehbe1, Michael Tarr1; 1Carnegie Mellon University

Understanding the functional organization of the higher visual cortex is a fundamental goal in neuroscience. Traditional approaches have focused on mapping the visual and semantic selectivity of neural populations using hand-selected, non-naturalistic stimuli, which require a priori hypotheses about visual cortex selectivity. To address these limitations, we introduce two data-driven methods: Brain Diffusion for Visual Exploration ('BrainDiVE') and Semantic Captioning Using Brain Alignments ('BrainSCUBA'). BrainDiVE synthesizes images predicted to activate specific brain regions, having been trained on a dataset of natural images and paired fMRI recordings, thus bypassing the need for hand-crafted visual stimuli. This approach leverages large-scale diffusion models combined with brain-gradient guided image synthesis. We demonstrate the synthesis of preferred images with high semantic specificity for category-selective regions of interest (ROIs). This method further enables the characterization of differences and novel functional subdivisions within ROIs, which we validated with behavioral data. BrainSCUBA, on the other hand, generates natural language descriptions for images predicted to maximally activate individual voxels. Utilizing a contrastive vision-language model and a pre-trained large language model, BrainSCUBA generates interpretable captions, enabling text-conditioned image synthesis. This method shows that the generated images are semantically coherent and achieve high predicted activations. In exploratory studies on the distribution of 'person' representations in the brain, we observe fine-grained semantic selectivity in body-selective areas. Together, these two methods offer well-specified constraints for future hypothesis-driven examinations and demonstrate the potential of data-driven approaches in uncovering visual cortex organization.

Acknowledgements: This work used Bridges-2 at Pittsburgh Supercomputing Center through allocation SOC220017 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.