Key Differences Between Text-Prompt with Automatic Tracking and Invoking Grounded-SAM/Lang-SAM Per Frame

z-x-yang / Segment-and-Track-Anything

An open-source project dedicated to tracking and segmenting any objects in videos, either automatically or interactively. The primary algorithms utilized include the Segment Anything Model (SAM) for key-frame segmentation and Associating Objects with Transformers (AOT) for efficient tracking and propagation purposes.

GNU Affero General Public License v3.0

2.77k stars 334 forks source link

Key Differences Between Text-Prompt with Automatic Tracking and Invoking Grounded-SAM/Lang-SAM Per Frame #106

Open xiaobanni opened 10 months ago

xiaobanni commented 10 months ago

Thank you for the wonderful project. I observed that it features text-prompt and automatic Tracking Mode, allowing me to obtain the segment mask for each frame based on the pre-provided text-prompt. However, I would like to understand the key difference with invoking Grounded-SAM/Lang-SAM each frame with text-prompt.

yamy-cheng commented 10 months ago

The key difference is that we additionally used the CMR module to determine whether the detected objects of Grounding-DINO are newly appearing objects in the video.

xiaobanni commented 10 months ago

So, if I only need to get the objects based on the text prompt, I just need to invoke the Grounding DINO to get the boxes and then send them to SAM, right?

yamy-cheng commented 10 months ago

So, if I only need to get the objects based on the text prompt, I just need to invoke the Grounding DINO to get the boxes and then send them to SAM, right?

yes, you are right.

xiaobanni commented 10 months ago

I am a researcher in a different field who wants to utilize the text-to-segment ability for long videos. I have found that invoking Grounding DINO is a time-consuming process, which is an unacceptable expenditure for many demands. I wonder whether leveraging the properties of the video continuum could be a method to circumvent the need to involve Grounding DINO. However, this might require additional processing to determine whether a new entity that satisfies the text prompt occurs in each timeframe. I hope that researchers related to this field can explore this type of demand further. Additionally, I welcome any recommendations for suitable projects.