Eye Movements during Free Viewing to Maximize Scene Understanding

Poster Presentation 53.411: Tuesday, May 21, 2024, 8:30 am – 12:30 pm, Pavilion
Session: Eye Movements: Natural world and VR

Shravan Murlidaran1 (), Miguel P Eckstein2; 1University of California Santa Barbara

Introduction: The extent to which eye movements during free-viewing of scenes are influenced by low-level saliency (Parkhurst et.al 2002, Harel et al. 2007, Koehler et al. 2014), local semantic meaningfulness (Henderson et al., 2017, Peacock et al., 2019), or other processes is debated. Here, we hypothesize that during free-viewing, humans direct their eyes to regions that maximize scene understanding rather than locally salient or meaningful regions. Methods: For each image (n=36) we created a scene understanding map (SUM) that assesses the contribution of individual objects to observers’ (n=110) scene descriptions (global understanding of the scene) by digitally removing each object from the image and having eighteen raters evaluate the similarity of descriptions to manipulated and original images. We compared the predictions from SUM and other models like saliency (Graph-Based Visual Saliency), DeepGaze, and local meaningfulness to human (n=50 per task) fixations during free-viewing (FV) and scene-description (SD) tasks. Images were presented for 2 seconds while eye position was measured. Results: In both the scene description (SD) task and free viewing (FV) tasks, fixations to the regions most critical to scene understanding (top-highest region in SUM) were significantly higher than those to the top predictions of DeepGaze (pSD=0.0035,pFV=0.0025, with a significant difference starting with the 6th fixation, (pSD=0.013,pFV=0.044)), local meaningfulness (pSD=0.00001, pFV<0.00001, with a significant difference starting with the 4th and 3rd fixation (pSD=0.003, pFV=0.019)) and GBVS saliency(pSD<0.00001, pFV<0.00001, with a significant difference starting with the 4th fixation (pSD=0.037, pFV=0.007)) models. Conclusions: Our findings suggest that during free-viewing, humans do not execute eye movements to low-level saliency or locally meaningful regions but to image regions that maximize the global understanding of the scene.