Mask size compared to image size affects tracking

yoxu515 / aot-benchmark

An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch

BSD 3-Clause "New" or "Revised" License

608 stars 108 forks source link

Mask size compared to image size affects tracking #46

Open rihns38 opened 1 year ago

rihns38 commented 1 year ago

Hi,

I have images where the masks are quite small and "round-ish" compared to the total size of the image itself (approx 400 small masks per image and still a decent amount of background). When I try to use the demo.py on the original image and masks, the tracking doesn't work to well.

However, when I crop the original image so that the size of the masks is much larger compared to the total size of the image (approx 10 large masks and some background), the tracking works much better.

Is there a way to change that "mask scaling factor" so that it works on my original images?

Thanks, Renaud

yoxu515 commented 1 year ago

Hi, may I know which AOT model you use? You can change the --max_resolution for eval.py, which scales up the resolutions of frames and masks, as well as the feature maps in AOT engine. However, a large resolution may be extremely heavy to process, which will need lots of memory and be very slow.

rihns38 commented 1 year ago

Hi,

Thanks for the fast reply.

I am using the "R50_AOTL_PRE_YTB_DAV.pth" model.

Our images are 1040x1408 pix and our mask size range is [300-3000]pix.

In demo.py I see: 259 parser.add_argument('--max_resolution', type=float, default=480 1.3) 276 cfg.TEST_MAX_SIZE = args.max_resolution 800. / 480.

What numbers do you recommend to change the default max resolution to of the lines above considering our image/mask size?

Thanks, Renaud

yoxu515 commented 1 year ago

I think there are two factors:

Your GPU memory. A large size may lead to out of memory error. You may try 1.4, 1.5... until reaching the limit of the upper bound of your GPU memory.
The length of your video. AOT-L keeps adding new memory frames so the memory will keep growing. You can modify the TEST_LONG_TERM_MEM_GAP from 5 to a larger value, or try DeAOT-B or smaller models which only use the first frame and a previous frame as memory. BTW, DeAOT models are better than AOT, and runs with less memory and better performance.

rihns38 commented 1 year ago

Which of the DeAOT model would you recommend using? I tried SwinB-DeAOTL and it worked much better than AOT :) Any other tips for videos where objects are appearing in the center of the field of view from one frame to another?

Thanks

yoxu515 commented 1 year ago

If you only care about performance, choosing SwinB-DeAOTL should be fine. These models are mainly about performance and speed trade-off. If you are sure that objects only appear in the center, you may crop the frames and resize the cropped region to a larger resolution, which may benefit small objects.