z-x-yang / Segment-and-Track-Anything

An open-source project dedicated to tracking and segmenting any objects in videos, either automatically or interactively. The primary algorithms utilized include the Segment Anything Model (SAM) for key-frame segmentation and Associating Objects with Transformers (AOT) for efficient tracking and propagation purposes.
GNU Affero General Public License v3.0
2.75k stars 332 forks source link

HIL full sequence prompting for drifts and ambiguities #10

Open bhack opened 1 year ago

bhack commented 1 year ago

Do you think that prompted frames (e.g. to fix drift) could be dinamically added in the long term memory as reference frame? Or we could add this as an explicit option.

yoxu515 commented 1 year ago

Hi, actually the current implementation of automatic segmentation and tracking does add long term memory every sam_gap frame. However, we only select objects from background as newly appeared objects, and the new reference masks only include these new objects. As a result, SamTrack is able to find new objects in a video and then track them.

bhack commented 1 year ago

What is the plan with the interactive version? E.g. also in your basketball player demo video there are frames to fix with more prompting. Are these added to the long term memory or you plan to just use the new segmentation mask as prev/short term memory?

yoxu515 commented 1 year ago

What is the plan with the interactive version? E.g. also in your basketball player demo video there are frames to fix with more prompting. Are these added to the long term memory or you plan to just use the new segmentation mask as prev/short term memory?

Adding more prompting to long-term memory may help the results, so we may make it optional for users, considering that more long-term memory frames also require more computing resources.

bhack commented 1 year ago

Yes I am guessing about the best pipeline in the interactive demo. Suppose that at frame X you want to fix segmentation errors/Deaot drift with extra prompting, like in your demo Video, to fix the segmentation between the finger and the ball what we want to do then with this HIL "fixed segmentation"? immagine

yoxu515 commented 1 year ago

Yes I am guessing about the best pipeline in the interactive demo. Suppose that at frame X you want to fix segmentation errors/Deaot drift with extra prompting, like in your demo Video demo, to fix the segmentation between the finger and the ball what we want to do then with this HIL "fixed segmentation"? immagine

Some extra point prompts may be useful to indicate the foreground and background around fingers. The fixed segmentation mask can be propagated to later frames, as either long-term memory or short-term memory.

bhack commented 1 year ago

Yes exactly, I am think how to best use in the memory the extra HIL prompting in the sequence. Here another frame of your demo video: immagine

bhack commented 1 year ago

And here another one: immagine

So I think HIL prompting and Deaot memories will need to cooperate to recover propagation failures.

yoxu515 commented 1 year ago

In the current implementation, we only use the first frame as long-term memory. For best performance, the fixed segmentation mask, as well as some intermediate frames, should be added to the long-term memory. But as I mentioned, more long-term memory will increase the computational burden. There is a trade-off between segmentation quality and memory capacity. We will consider the quality improvement after finishing building the basic framework of the interactive version.

bhack commented 1 year ago

Yes I meant it is not really required to be added to the long term memory it could also just go in the short term one. But I think you need to pay attention about time coherence of the segmentation as SAM enc/dec it is not "propagation aware" as the Deaot enc/dec finetuned on DAVIS+Youtube. So probably a SAM mask HIL it will be need to be encoded/decoded in in Deaot enc/dec when we have an HIL frame. If not, after x HIL frames, we will have a lot of time incoherence in the output segmentation sequence.

yoxu515 commented 1 year ago

Yes I meant it is not really required to be added to the long term memory it could also just go in the short term one. But I think you need to pay attention about time coherence of the segmentation as SAM enc/dec it is not "propagation aware" as the Deaot enc/dec finetuned on DAVIS+Youtube. So probably a SAM mask HIL it will be need to be encoded/decoded in in Deaot enc/dec when we have an HIL frame. If not, after x HIL frames, we will have a lot of time incoherence in the output segmentation sequence.

Thank you for your advice, we will consider it during developing the interactive part.

bhack commented 1 year ago

I think it is important as they have the same propagation error to handle also with SAM+XMEM https://github.com/gaomingqi/Track-Anything

yoxu515 commented 1 year ago

I think it is important as they have the same propagation error to handle also with SAM+XMEM https://github.com/gaomingqi/Track-Anything

Yes, we have also notice this work, which has the similar idea with ours. Both of them are good start for appyling SAM on video segmentation, while we think AOT is better at processing multiple objects. Besides, ours supports automatic segmentation and tracking of all objects in the video.

bhack commented 1 year ago

Yes this is true but in any case we will never have the perfect propagation network so it is a required step to handle the HIL prompting on the sequence at best to recover drifts and ambiguities using the WEB UI and the memory of the tracker.

bhack commented 1 year ago

It is like how they approach it at step 3 and 4 (page 3) https://arxiv.org/abs/2304.11968

It would be also nice if you can calculate and publish your SAM+Deaotl baseline on common dataset, before the interactive refinement, as in this technical report.

bhack commented 1 year ago

I saw you mentioned something in the Readme markdown with demo 6 and demo 7 videos. What is the plan?