Spurious reconstruction from brain activity: The thin line between reconstruction, classification, and hallucination

Poster Presentation 26.426: Saturday, May 18, 2024, 2:45 – 6:45 pm, Pavilion
Session: Object Recognition: High-level features

Ken Shirakawa1,2 (), Yoshihiro Nagano1,2, Misato Tanaka1,2, Shuntaro C. Aoki1,2, Kei Majima3, Yusuke Muraki1, Yukiyasu Kamitani1,2; 1Graduate School of Informatics, Kyoto University, 2ATR Computational Neuroscience Laboratories, 3National Institutes for Quantum Science and Technology

Visual image reconstruction aims to recover arbitrary stimulus/perceived images from brain activity. To achieve reconstruction over diverse images, especially with limited training data, it is crucial that the model leverages a compositional representation that spans the image space, with each feature effectively mapped to brain activity. In light of these considerations, we critically assessed recently reported photorealistic reconstructions based on text-to-image diffusion models applied to a large-scale fMRI/stimulus dataset (Natural Scene Dataset, NSD). We found a notable decrease in the reconstruction performance of these models with a different dataset (Deeprecon) specifically designed to prevent category overlaps between the training and test sets. UMAP visualization of the target features (CLIP text/semantic features) with NSD images revealed a strikingly limited diversity with only ~40 distinct semantic clusters overlapping between the training and test sets. Further, CLIP feature decoders trained on NSD highlighted significant challenges in predicting novel semantic clusters not present in the training set. Simulations also revealed the inability to predict new clusters when the training set was restricted to a small number of clusters. Clustered training samples appear to restrict the feature dimensions that could be predicted from brain activity. Conversely, by diversifying the training set to ensure a broader distribution in the feature dimensions, the decoders exhibited improved generalizability beyond the trained clusters. Nonetheless, it is important to note that text/semantic features alone are insufficient for a complete mapping to the visual space, even if they are perfectly predicted from brain activity. Building on these observations, we argue that the recent photorealistic reconstructions may predominantly be a blend of classification into trained semantic categories and the generation of convincing yet inauthentic images (hallucinations) through text-to-image diffusion. To avoid such spurious reconstructions, we offer guidelines for developing generalizable methods and conducting reliable evaluations.

Acknowledgements: This work was supported by the JSPS (KAKENHI grants JP20H05954, JP20H05705, JP21K17821 and 22KJ1801), JST (CREST grants JPMJCR18A5, and JPMJCR22P3), and NEDO (commissioned project, JPNP20006)