GramStatTexNet: Using the Gram Matrix of Multi-Scale Pyramids to Contrastively Learn Texture Model Statistics

Poster Presentation 56.455: Tuesday, May 21, 2024, 2:45 – 6:45 pm, Pavilion
Session: Spatial Vision: Machine learning, neural networks

Vasha DuTell1,2 (), Christian Kovesdi1,2, Anne Harrington1,2, Mark Hamilton1, William T. Freeman1, Ruth Rosenholtz1,2; 1MIT CSAIL, 2MIT Brain and Cognitive Sciences

Visual processing in area V1 is often modeled by multi-scale, oriented filters, both pre-computed such as the steerable pyramid, as well as learned. To understand how these V1 responses are combined in layers V2 and beyond, models for texture synthesis are often employed. These models use pre-defined summary statistics calculated on the correlations of V1 filter outputs (Portilla & Simoncelli,1999), and combined with spatial pooling, are used to understand peripheral visual processing (Balas, 2009, Freeman & Simoncelli, 2011). More recently, learned features in deep neural networks (DNNs) have been used to represent texture, employing the Gram matrix to encode the texture-like representation seen in mid-level peripheral vision (Wallis et al, 2017). While models with hand-picked statistics are well-validated, they cannot faithfully represent some texture families (Brown et al, 2023); DNN approaches on the other hand are based on biologically-implausible and over-parameterized representations. To address this, we propose a new framework to learn texture statistics. We employ the contrastive learning approach from StatTexNet (Koevsdi et al, 2023), modifying this model to learn elements from the Gram matrix of pyramid images, rather than learning pre-defined statistics. The learned component consists of a single-layer, fully-connected network, which reduces the set of statistics from the full Gram matrix, down to a reduced set of meta-statistics. It is trained through contrastive learning to push similar textures together, and pull dissimilar textures apart in representation space. As an indicator of successful learning, we show that this network successfully clusters both same- and similar- texture samples. We find that the learned weight matrix is innately sparse, with pyramid image auto correlations most highly weighted, and low-pass pyramids least utilized. This work demonstrates a both biologically-inspired and learned approach to the texture and peripheral vision models of V2, giving further insight into the complex transformations of mid-level vision.

Acknowledgements: CSAIL METEOR Fellowship, US National Science Foundation under grant number 1955219, National Science Foundation Grant BCS-1826757 to PI Rosenholtz, MIT SuperCloud and Lincoln Laboratory Supercomputing Center