Exploring mental representation of visual emoji symbols through human similarity judgments

Poster Presentation 56.450: Tuesday, May 21, 2024, 2:45 – 6:45 pm, Pavilion
Session: Object Recognition: Structure of categories

Yiling Yun1 (), Bryor Snefjella1, Shuhao Fu1, Hongjing Lu1; 1University of California, Los Angeles

How do people align concepts learned from different modalities, such as visual and linguistic inputs? To address this question, we examined the representations of emojis, which are pictograms commonly used in linguistic contexts and convey distinctive visual characteristics that make them appear engaging. The representational similarity structure of emojis was measured using an odd-one-out paradigm. In Experiment 1, human similarity judgments were measured for 48 emojis from a wide range of emoji categories (faces, animals, objects, signs, etc). We compared human similarity judgments with model predictions from three types of models, including a language model (fastText) that is trained for word prediction in sentences, a vision model (Visual Auto-Encoder) that is trained to reconstruct input images, and a multi-modal neural network (CLIP) that learns visual concepts under language supervision. We found that CLIP correlated with human similarity judgments the highest (rho = .38), followed by fastText (rho = .36), and Visual Auto-Encoder (rho = .17). When controlling for linguistic semantics from fastText, CLIP maintained the significant semipartial-correlation with human judgments (sr = .34). The best performance of CLIP was not simply due to the combination of multimodal inputs since simply concatenating fastText and Visual Auto-Encoder embeddings resulted in a lower correlation (rho = .17). In Experiment 2, we used the 50 most frequently used emojis, which mostly include faces with different expressions and hand gestures. We found that all three models show correlations with human similarity judgments: CLIP (rho = .68), followed by fastText (rho = .52), and Visual Auto-Encoder (rho = .46). These results suggest that models trained with aligned visual and linguistic inputs in a multi-modal way best capture human conceptual representations of visual symbols, such as emojis. However, these models trained with general purposes are inadequate to capture fine-grained social attributes in emojis.

Acknowledgements: We dedicate this abstract to the memory of our co-author Dr. Bryor Snefjella, whose creativity, brilliance, insight and generosity make this project possible. This abstract was supported by NSF Grant BCS-2142269 awarded to H.L.