Visual Analogy Between Object Parts

Poster Presentation 43.301: Monday, May 22, 2023, 8:30 am – 12:30 pm, Banyan Breezeway
Session: Object Recognition: Models

Hongjing Lu1 (), Shuhao Fu1; 1University of California, Los Angeles

When asked, “If a tree had a knee, where would it be?” preschoolers can point to a sensible location. This example of visual analogical reasoning illustrates the human ability to find and exploit resemblances based on relations among entities, rather than solely on the entities themselves. But how do perception and reasoning systems work together to accomplish visual analogy from pixel-level inputs? To address this question, we developed a spatial mapping task to measure the consistency of human judgments in finding analogous parts between two images. In Experiment 1 we used synthetic images of vehicles (cars, buses, motorcycles, bikes) generated from 3D object models. Participants were shown one image with two markers (each indicating one part of an object) and asked to place markers on corresponding locations in a different image (size of 400 by 300 pixels). Humans showed consistent judgments in placing the markers, with small spatial variability (~15 pixels). Marker variability was lower when the two images were from the same rather than different object types, and when the images showed objects from similar rather than different viewpoints. In Experiment 2 we used the same task with realistic images of vehicles and obtained similar results. We developed a computational model for visual analogy, which first decomposes an object into parts using a deep learning model for semantic part segmentation, and then builds a structural representation using both visual features of parts and spatial relations between parts. These structural representations of objects are encoded as attributed graphs, and mapping is performed using a probabilistic graph matching algorithm. The model achieves close-to-human performance on the mapping task and predicts the influence of object type and viewpoints on mapping variability. These results support the essential role of structural representations of objects derived from raw images in performing downstream reasoning tasks.