Object and Scene Recognition Abilities Predict the Content and Quality of Image Descriptions

Poster Presentation 56.424: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Pavilion
Session: Object Recognition: Models

Melina O. Mueller1 (), Ian G. Dobbins2, Isabel Gauthier1; 1Vanderbilt University, 2Washington University in Saint Louis

Do non-verbal visual abilities predict how people describe what they see? We measured novel object recognition (o; simultaneous matching to silhouettes, sequential matching, and oddball tasks), scene recognition (sequential matching and oddball tasks with prepared food and outdoor images), and intelligence (g) in 101 adults, then asked them to describe 10 complex images (e.g., market, coffee shop, abstract sculpture), from memory or during perception. We examined the quality of descriptions via semantic similarity scores to a large language model’s descriptions and the content of descriptions via feature extraction (e.g., feature descriptors, hedging, spatial language). With hierarchical regression, we controlled for age, gender, vocabulary, description length, g (and o for scenes as a predictor). People who were better at recognizing scenes provided higher quality descriptions of images while viewing them (ΔR² = .093, p = .003) or from memory (ΔR² = .125, p < .001). Additionally, visual abilities predicted how people described images. Those with a higher o made more references to shape (ΔR² = .033, p = .048) and spatial relations (ΔR² = .040, p = .014), and used less hedging (ΔR² = .052, p = .016), in memory-based descriptions. Those with higher scene recognition made more references to texture (ΔR² = .062, p = .022) and to spatial relations (ΔR² = .035, p = .046) in memory-based descriptions, and used more adjectives in perception-based descriptions (ΔR² = .062, p = .012). Beyond intelligence and vocabulary, non-verbal perceptual visual abilities predict how we write about images, and the quality of these descriptions. One interpretation is that o supports encoding of richer visual representations that can be accessed during linguistic output. These results have theoretical implications for both the study of vision and that of language, and practical implications for what we can learn about people using natural language processing.

Acknowledgements: This work was supported by the David K. Wilson Chair Research Fund and NSF BCS Award 2316474