Paper Localizing Active Objects From Egocentric Vision - Blog of Kasra Darvish
This is me

Kasra Darvish

I write to exist beyond time!

[Kasra Darvish]

1-Minute Read

This paper is published in ACL and you can find it using this link.

It’s super interesting to see how different research projects are taking advantage of LLMs (GPT) by prompting them to do something and use those results to solve another problem.


This paper tackles the problem of grounding (localizing) objects in images using text instructions by focusing on pre- and post-conditions of target objects; they call this active object grounding.

The authors introduce a novel prompting mechanism to extract pre- and post-conditions for the target objects, as well as additional knowledge about the target objects (based on the text instruction).

Their complete system consists of prompting GPT for acquiring the information mentioned above, a GLIP model for multimodal grounding with an added subcomponent for frame-type prediction (Pre/point-of-no-return (PNR)/Post), and a per-object knowledge aggregation that is used for ranking regressed bounding boxes. Per-object knowledge aggregation is done by combining frame-type prediction and per-object masks by the dot product.

  • None
  • None
comments powered by Disqus

Recent Posts



I'm a Ph.D. student interested in Artificial Intelligence, Machine Learning and intelligence in its abstract form