Evaluation of Model-Generated Saliency Maps for Affective Scenes

Undergraduate Just-In-Time Abstract

Poster Presentation 56.343: Tuesday, May 19, 2026, 2:45 – 6:45 pm, Banyan Breezeway
Session: Undergraduate Just-In-Time 3

Simona Sartan1 (simona.sartan@ufl.edu), Yujun Chen1, Peng Liu1, Ruogu Fang1, Mingzhou Ding1; 1UNIVERSITY OF FLORIDA

Saliency maps are bottom-up, image-computable priority maps that highlight locations in natural scenes likely to attract visual attention. Deep learning-based saliency models, trained predominantly on emotionally-neutral images, have demonstrated strong performance in predicting human gaze patterns. Whether these models capture visual features that convey emotional significance in affective scenes is unclear. We addressed this by combining saliency-guided occlusion with large language model (LLM)-based emotion evaluation to test whether saliency maps from these models encode emotion-relevant information. Saliency maps for affective scenes from the International Affective Picture System (IAPS) were generated using three deep saliency models: DeepGaze IIE (Linardos et al., 2021), SalFBNet (Ding et al., 2022), and TranSalNet (Lou et al., 2022). A human eye-tracking saliency baseline from the EMOd dataset (Fan et al., 2018) was included for comparison. Three conditions were tested: occlusion guided by the original saliency map, occlusion of randomly relocated saliency patches preserving patch size and shape, and occlusion at random locations. Valence and arousal ratings for each occluded image were obtained from ChatGPT-3.5 using structured prompts. Correlation between LLM ratings and normative IAPS ratings were computed across occlusion levels. The results showed that saliency-guided occlusion produced steeper declines in both valence and arousal correlations compared to the two control conditions, confirming that model-identified salient regions carry emotionally-significant information. This effect was more pronounced for arousal than valence, suggesting that valence-relevant information is more spatially distributed across a scene. Performance under saliency-guided occlusion converged with the human gaze baseline, indicating that neutral scene-trained models identify regions that overlap with those prioritized by human visual attention. These findings demonstrate that (1) saliency models trained on neutral content capture emotionally-critical visual locations in affective scenes and (2) LLM-based ratings offer a scalable and efficient alternative to human ratings for affective image analysis.