Paper Localizing Active Objects From Egocentric Vision - Blog of Kasra Darvish

Kasra Darvish

I write to exist beyond time!

Paper Localizing Active Objects From Egocentric Vision

February 8, 2024

[Kasra Darvish]

1-Minute Read

This paper is published in ACL and you can find it using this link.

It’s super interesting to see how different research projects are taking advantage of LLMs (GPT) by prompting them to do something and use those results to solve another problem.

Summary

This paper tackles the problem of grounding (localizing) objects in images using text instructions by focusing on pre- and post-conditions of target objects; they call this active object grounding.

The authors introduce a novel prompting mechanism to extract pre- and post-conditions for the target objects, as well as additional knowledge about the target objects (based on the text instruction).

Their complete system consists of prompting GPT for acquiring the information mentioned above, a GLIP model for multimodal grounding with an added subcomponent for frame-type prediction (Pre/point-of-no-return (PNR)/Post), and a per-object knowledge aggregation that is used for ranking regressed bounding boxes. Per-object knowledge aggregation is done by combining frame-type prediction and per-object masks by the dot product.