Specifying the relationships between objects, gaze, and descriptions for scene understanding
63.403, Wednesday, May 15, 8:30 am - 12:30 pm, Orchid Ballroom
Kiwon Yun1, Yifan Peng1, Hossein Adeli2, Tamara Berg1, Dimitris Samaras1, Gregory Zelinsky1,2; 1Department of Computer Science, Stony Brook University, 2Department of Psychology, Stony Brook University
The objects that people choose to look at while viewing a scene provide an abundance of information about how a scene is ultimately understood. In Experiment 1, participants viewed a scene for 5 seconds, then described the scenes content, with this description being our estimate of their scene understanding. There were 104 scenes (selected from SUN09), spanning 8 scene types, and analyses were limited to 22 categories of common objects for which bounding box information was available. In Experiment 2, participants viewed 1000 scenes (from PASCAL VOC), each for 3 seconds, in anticipation of a memory test. Analyses were limited to 20 object categories and descriptions were obtained using Mechanical Turk. For both experiments, we found that fixated objects tended also to be described (95.2% for PASCAL, 72.5% for SUN09) and described objects tended also to be fixated (86.6% for PASCAL, 73.7% for SUN09). Differences between experiments were likely due to the PASCAL images being less cluttered than the SUN09 images, thereby increasing the probability of fixations on selected objects. People also tended to look more often at animate objects (people, animals) or objects that conveyed animacy (televisions, computer monitors) than inanimate objects (e.g., tables, rugs, cabinets). Furthermore, by analyzing where fixations typically fell within the bounding boxes for different categories of objects (using object-based fixation density maps), we were able to discern distinct category-specific patterns of fixation behavior. For example, fixations on tables and chairs tended to be distributed in the extreme upper halves of bounding boxes, reflecting the fact that things usually sit on these objects, whereas fixations on cats and cows were distributed along the horizontal midline, reflecting a center-of-mass looking bias. Collectively, these findings suggest that embedded in viewing behavior is information about the content of a scene and how a scene is being understood.