Open xiaobanni opened 10 months ago
The key difference is that we additionally used the CMR module to determine whether the detected objects of Grounding-DINO are newly appearing objects in the video.
So, if I only need to get the objects based on the text prompt, I just need to invoke the Grounding DINO to get the boxes and then send them to SAM, right?
So, if I only need to get the objects based on the text prompt, I just need to invoke the Grounding DINO to get the boxes and then send them to SAM, right?
yes, you are right.
I am a researcher in a different field who wants to utilize the text-to-segment ability for long videos. I have found that invoking Grounding DINO is a time-consuming process, which is an unacceptable expenditure for many demands. I wonder whether leveraging the properties of the video continuum could be a method to circumvent the need to involve Grounding DINO. However, this might require additional processing to determine whether a new entity that satisfies the text prompt occurs in each timeframe. I hope that researchers related to this field can explore this type of demand further. Additionally, I welcome any recommendations for suitable projects.
Thank you for the wonderful project. I observed that it features text-prompt and automatic Tracking Mode, allowing me to obtain the segment mask for each frame based on the pre-provided text-prompt. However, I would like to understand the key difference with invoking Grounded-SAM/Lang-SAM each frame with text-prompt.