Object Detection from Reference Images

The goal of this project at Diamond Age Technology was to develop a method to detect objects, using a few “reference” images of the same (or similar) objects. The application is an augmented reality tool to assist a person in performing a task. The idea is that a subject matter expert annotates an example object with subpart labels and instructions, and then another user can see these labels mapped to the particular object they are looking at.

This is very useful for many applications! For example, a factory supervisor wearing an AR headset could tag a leaking seal on a machine with a work order for a technician. Later, another technician also wearing an AR headset could then easily locate the leaking seal and understand the tasks to be performed by viewing the annotations that are displayed, registered to the object.

A key obstacle to deploying augmented reality is the effort to train the system to recognize new objects. To avoid the need for training, I use a small number of “reference images”, taken while the user places the annotations. As a result, the method can be quickly and easily applied to new objects. The images below show examples of matching a reference image of an object to a query image of a different instance of the object.

The method uses a convolutional neural network (CNN) to extract dense features from the images. It uses pre-trained features from the DINOv2 foundation model, that work across a wide variety of object types. I then use the RoMa (Robust Matching) network to find matching pixel pairs, even when they differ significantly in viewpoint, illumination, or scale.

I match each reference image against the query image and pick the reference image with the highest number of consistent matches, meaning feature pairs that have each other as their best match. Once I have the “warp” (or flow) field between the images, I can map the annotations from the reference image to the query image.

Video Examples

The videos below show the detection of an object using images from a hand-held phone camera. As shown in the guide in the adjacent figure, the inset image in the upper left of each video frame shows which reference image was matched. The inset image in the upper right of each video frame shows the matching features.

Legend for object detection videos below.


Fire Hose Valve

In this example, the reference object is a fire hose valve on one floor of a building, and the query object is an identical fire hose valve on another floor.

Reference

Query


In this example the type of the query object is the same as the type of the reference object.

Fire Hydrant - Example 1

Reference

Query


Fire Hydrant - Example 2

In this example the type of the query object is different than the type of the reference object.

Reference

Query