z-x-yang / Segment-and-Track-Anything

An open-source project dedicated to tracking and segmenting any objects in videos, either automatically or interactively. The primary algorithms utilized include the Segment Anything Model (SAM) for key-frame segmentation and Associating Objects with Transformers (AOT) for efficient tracking and propagation purposes.
GNU Affero General Public License v3.0
2.84k stars 341 forks source link

Control inputs resolution #19

Open bhack opened 1 year ago

bhack commented 1 year ago

Can we control SAM/Deaotl final input resolution from the UX/params?

Can we enable Gradio zooming in the UX? Sometimes it is hard to click some fine details.

yamy-cheng commented 1 year ago

In the next version, we will implement the feature for users to control the resolution. You can indirectly achieve Gradio component scaling by using the zoom function in your browser.

bhack commented 1 year ago

immagine

Do you meant that you can click fine fingers like in our official tutorial screenshot just with zooming the browser?

yamy-cheng commented 1 year ago

immagine

Do you meant that you can click fine fingers like in our official tutorial screenshot just with zooming the browser?

Yes, all the examples in our official tutorial are reproducible.

bhack commented 1 year ago

No I meant that the segmented fingers seems to be cut. Are you able to fix the prompt just zooming the browser? It seem hard to click on these small details.

yamy-cheng commented 1 year ago

No I meant that the segmented fingers seems to be cut. Are you able to fix the prompt just zooming the browser? It seem hard to click on these small details.

I'm sorry for misunderstanding your meaning. The ability of SAMTrack to segment small details is constrained by SAM. We will add the feature for users to paint to complete masks to our to-do list. This feature may fulfill your needs.

bhack commented 1 year ago

The ability of SAMTrack to segment small details is constrained by SAM. 

Yes it is why we don't know how much clicks/strokes are needed and we could need to zoom in.

If I remember correctly the R50 deaotl encoder/decoder conf (1024px) has enough resolution to propagate eventually some fine details right?