Building compositional memories and imagery from disentangled latent spaces in an autoencoder

Poster Presentation 36.450: Sunday, May 19, 2024, 2:45 – 6:45 pm, Pavilion
Session: Visual Memory: Imagery

Brad Wyble1 (), Ian Deal1, Vijay Subedi1; 1Penn State University

The visual system has a densely recursive hierarchy that integrates feedforward and feedback processing in the service of functions such as perception, attention, memory and imagery. This hierarchy presumably has the remarkable property that representations, whether perceived, or imagined, can be passed forward and backward to reach a given level of abstraction or dimensionality to allow different kinds of processing. We explore this idea using a generative, neurocomputational model called Memory for Latent Representations(MLR). First, an autoencoder is trained to disentangle features such as shape, color and location into distinct latent spaces. Then, a shared memory resource builds engrams from those latent spaces and binds them to tokens. By selecting which latent spaces are used for building these memories on-the-fly, an engram can focus on fine-grained visual details, compressed visual details, categorical codes, or any combination of these. Empirical demonstrations in human observers support the model by showing that engrams can be calibrated against task demands, to focus on the appropriate features and visual details. Through its decoder, MLR can reconstruct both recollected memories as well as arbitrary combinations of features according to top-down instructions, thereby providing an approximation of some aspects of compositional visual imagery. By combining a working memory system with an autoencoder, MLR provides a theoretical framework for understanding how visual memory and imagery work jointly to encode, decompose, modify and recode complex visual representations. In this expanded version of the MLR model we demonstrate the ability to separate location, color and visual form into disentangled latent spaces and then to modify and recombine those codes. Such codes can then be used in a generative fashion to create novel compositions of features according to top-down instructions. This work helps us to understand the mechanisms of visual imagery in a highly interpretable model context.

Acknowledgements: NSF grant 1734220