Cloud I just segment and track interested object with prompt?

z-x-yang / Segment-and-Track-Anything

An open-source project dedicated to tracking and segmenting any objects in videos, either automatically or interactively. The primary algorithms utilized include the Segment Anything Model (SAM) for key-frame segmentation and Associating Objects with Transformers (AOT) for efficient tracking and propagation purposes.

GNU Affero General Public License v3.0

2.85k stars 341 forks source link

Cloud I just segment and track interested object with prompt? #7

Closed aixiaodewugege closed 1 year ago

aixiaodewugege commented 1 year ago

Is it convenient to use GroundingDINO to provide bbox and do next step with your current code?

z-x-yang commented 1 year ago

Thank you for your suggestion. I believe it is an excellent idea and we will include this feature in our upcoming development plan. Currently, we are working on the development of an interactive editing WebUI. However, due to limited energy, we kindly request your patience as we work towards its implementation.

aixiaodewugege commented 1 year ago

Thanks for your reply. I have impleaded it myself, based on your code.

I am curious that which one has the better performance on tracking task, your method or the MOTRv2?

z-x-yang commented 1 year ago

"Segment and Track Anything" aims to achieve open-world video object segmentation, requiring the algorithm to have the ability to generalize to untrained object categories. In contrast, MOTR is designed for MOT tasks, which usually only detect and track specific targets (usually people and vehicles) and do not consider generalization capabilities. Furthermore, MOTR uses a BBox-based solution instead of a segmentation-based solution for tracking, which may reduce tracking performance.

In the VOT2022 challenge, our segmentation-based tracking method, based on AOT, significantly outperformed all BBox-based methods in robustness (detailed in the Workshop presentation shown below). This is a new breakthrough for segmentation-based methods and deserves further research and consideration.

aixiaodewugege commented 1 year ago

Thanks for your clearly response!

I noticed that you use first frame to generate better result. Could you please explain how does it work?

Your current code is able to track every each segment generated by SAM, but SAM need bbox to generate better result. Hopes you can add detection model into your code.

Again thanks for your patience.

ZillaRU commented 1 year ago

"Segment and Track Anything" aims to achieve open-world video object segmentation, requiring the algorithm to have the ability to generalize to untrained object categories. In contrast, MOTR is designed for MOT tasks, which usually only detect and track specific targets (usually people and vehicles) and do not consider generalization capabilities. Furthermore, MOTR uses a BBox-based solution instead of a segmentation-based solution for tracking, which may reduce tracking performance.

In the VOT2022 challenge, our segmentation-based tracking method, based on AOT, significantly outperformed all BBox-based methods in robustness (detailed in the Workshop presentation shown below). This is a new breakthrough for segmentation-based methods and deserves further research and consideration.

I am a beginner of CV😀. I wonder if there is a model combining semantic segmentation and segmentation-based tracking together. Or does your proposed MS_AOT run fast enough to serve as a module of such a model (for video annotation)? Thanks for your patience.

z-x-yang commented 1 year ago

Thanks for your clearly response!

I noticed that you use first frame to generate better result. Could you please explain how does it work?

Your current code is able to track every each segment generated by SAM, but SAM need bbox to generate better result. Hopes you can add detection model into your code.

Again thanks for your patience.

The first frame is a reference frame, in which the objects are given. There could be more than one "first frame" since new objects may appear at different times. SAM-Track uses SAM to detect new objects per N frames.

Currently, we have released a new version to support interactive segmentation by click or brush (similar to BBox). We are considering adding the support of detection models to provide BBox prompts in a future version.

z-x-yang commented 1 year ago

"Segment and Track Anything" aims to achieve open-world video object segmentation, requiring the algorithm to have the ability to generalize to untrained object categories. In contrast, MOTR is designed for MOT tasks, which usually only detect and track specific targets (usually people and vehicles) and do not consider generalization capabilities. Furthermore, MOTR uses a BBox-based solution instead of a segmentation-based solution for tracking, which may reduce tracking performance. In the VOT2022 challenge, our segmentation-based tracking method, based on AOT, significantly outperformed all BBox-based methods in robustness (detailed in the Workshop presentation shown below). This is a new breakthrough for segmentation-based methods and deserves further research and consideration.

I am a beginner of CV😀. I wonder if there is a model combining semantic segmentation and segmentation-based tracking together. Or does your proposed MS_AOT run fast enough to serve as a module of such a model (for video annotation)? Thanks for your patience.

The task you are talking about is similar to video semantic segmentation or video panoptic segmentation, for which our lab has released two large-scale datasets, VSPW and VIPSeg.

When annotating these datasets, only 1 of 15 frames are required to be manually annotated since we can use our AOT framework to propagate annotations to all the other frames. In such short-term annotation scenes, our latest DeAOT models can simultaneously process 10 objects in real time and predict results comparable to human annotations.

ZillaRU commented 1 year ago

Thanks for your clearly response! I noticed that you use first frame to generate better result. Could you please explain how does it work? Your current code is able to track every each segment generated by SAM, but SAM need bbox to generate better result. Hopes you can add detection model into your code. Again thanks for your patience.

The first frame is a reference frame, in which the objects are given. There could be more than one "first frame" since new objects may appear at different times. SAM-Track uses SAM to detect new objects per N frames.

Currently, we have released a new version to support interactive segmentation by click or brush (similar to BBox). We are considering adding the support of detection models to provide BBox prompts in a future version.

I was coding for using bbox as prompt to refine our dataset annotations when I saw your reply. ORZZZZZ.

yamy-cheng commented 1 year ago

Is it convenient to use GroundingDINO to provide bbox and do next step with your current code?

Hi, the current version of our WebUI has integrated Grounding-DINO. You can use it according to the tutorial.

aixiaodewugege commented 1 year ago

You guys are great!

I would like to ask how image segmentation-based tracking methods solve the problem of temporal consistency in segmentation in videos? If the segmentation results of the same object vary greatly between different frames, would it affect the tracking results?

yoxu515 commented 1 year ago

You guys are great!

I would like to ask how image segmentation-based tracking methods solve the problem of temporal consistency in segmentation in videos? If the segmentation results of the same object vary greatly between different frames, would it affect the tracking results?

For the same object, there is only one segmentation result for it from SAM. In another word, once an object has a mask, it can be tracked in the whole video. Therefore, the temporal consistency fully relies on our tracking method AOT in the following frames.

aixiaodewugege commented 1 year ago

Thanks for your reply.

You don't use the shape or the appearance of the object to do tracking? So AOT only use the position of the segmentations between frames to match the object?

yamy-cheng commented 1 year ago

Thanks for your reply.

You don't use the shape or the appearance of the object to do tracking? So AOT only use the position of the segmentations between frames to match the object?

AoT use hierarchal long-short-term transformer to propagate the masks between frames to get results.You can find more details about how AoT tracks objects in the paper.

yoxu515 commented 1 year ago

Thanks for your reply.

You don't use the shape or the appearance of the object to do tracking? So AOT only use the position of the segmentations between frames to match the object?

No, AOT use both appearance and position to track objects. For the first frame where the object apears for the first time, we use SAM to obtain the segmentation, which is used to initialize AOT. For the following frames, AOT tracks the object and output segmentations. As for the temporal consistency, AOT has both global attention (searching the object in the whole image) and local attention (searching the object in the local window). These attentions are based on appearance matching. The motion of an object is prefered to be local, but not compulsory. Therefore, even the position has dramatic change, it is still posible for AOT to track the object.

aixiaodewugege commented 1 year ago

Thanks for your reply. I think you miss my meaning in temporal consistency. You can have a look at this page https://zhuanlan.zhihu.com/p/554572973 .

I have tried your work in my scene, and I found that it will make mistake on big static objects, because of the bad temporal consistency for segmentation method. We can discuss more about it ,if you have time please contact me through my email and I will send your my video.