The role of scene context in object recognition by humans and convolutional neural networks

Poster Presentation 43.310: Monday, May 22, 2023, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models

There is a Poster PDF for this presentation, but you must be a current member or registered to attend VSS 2023 to view it.
Please go to your Account Home page to register.

Haley G. Frey1, Hojin Jang1, Hui-Yuan Miao1, Frank Tong1; 1Vanderbilt University

It is rare that humans are required to recognize objects without a surrounding context. Previous research has shown that modifying the scene information can decrease the speed and accuracy of object recognition in human observers. Although convolutional neural networks (CNNs) can attain near human-level performance on simple object recognition tasks, it remains unclear whether these models of biological vision continue to reflect human abilities when objects occur in complex scenes. Here, we investigated the impact of visual clutter and semantic incongruence on object recognition accuracy in humans and CNNs. Eighteen undergraduate students and four CNNs implemented with Pytorch were shown 384 greyscale images consisting of a target object superimposed on a background scene. We manipulated the level of visual clutter, defined as how much texture, pattern, or excess information is in an image, and the semantic congruency, defined as whether the object-scene pairing was realistic. The eight target categories consisted of animals (bear, bison, elephant, owl) and common indoor objects (lamp, teapot, vacuum, vase), which were presented in either outdoor nature scenes or indoor scenes. The scenes were rated on their degree of clutter by separate participants and sorted into low or high clutter scenes. We found that human observers performed significantly worse with increased clutter, yet CNN performance was unaffected by clutter. Interestingly, the CNNs showed significantly better classification accuracy for congruent than incongruent object-scene pairings while the human observers did not. However, human participants did show a congruency bias effect, choosing a congruent category over an incongruent category in a significant portion of trials where they reported low confidence. Our findings reveal notable deviations between human and CNN object classification performance and indicate that CNN models do not process background scene context in the same way that humans do.

Acknowledgements: Supported by NIH grant R01EY029278