yoxu515 / aot-benchmark

An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch
BSD 3-Clause "New" or "Revised" License
605 stars 108 forks source link

0.5px/1px E/SE misaligment #47

Open bhack opened 1 year ago

bhack commented 1 year ago

Have you found on your experiment runs a 0.5px/1px misaligment bias in right/bottom-right direction? I have noted this both with aligned and not aligned corners models that you have used (e.g. R50/Swin Deaotl). As these kind of errors are very hard to debug I want to know if you have experienced something like this on your side.

Thanks.

yoxu515 commented 1 year ago

Hi, thank you for pointing out the issue. Could you please give an image for example? where does the 0.5px/1px misaligment bias happen?

bhack commented 1 year ago

I will try to find an example on DAVIS to share. In the meantime have you already experienced something like this?

bhack commented 1 year ago

@yoxu515 @z-x-yang Just to give more evidence to this effect I've replicated the same frame (0) from DAVIS speed-skating 100 times.

Here the original annotation (frame 0/GT): 00000

Frame 0 after 50 propagation: 00050

Frame 0 after 99 propagation: 00099

Down/Right accumulated Drift 0-10: diff10

Down/Right accumul diff50 ated Drift 0-50:

Down/Right accumulated Drift 0-99: diff99

bhack commented 1 year ago

Any feedback on this?

bhack commented 9 months ago

@yoxu515 There is a rounding error in the eval engine and the related interpolations as this happen and it is reproducible when input W or H is not divisible by 16.

s-deeper commented 8 months ago

@z-x-yang Other then some edge case precision about the max_stride alignment you are also affected by https://github.com/pytorch/pytorch/issues/34808.

z-x-yang commented 8 months ago

@yoxu515 There is a rounding error in the eval engine and the related interpolations as this happen and it is reproducible when input W or H is not divisible by 16.

Thank you for your ongoing attention to this issue. To be honest, at this point, I am also still trying to understand why this misalignment is occurring. Perhaps, as @s-deeper commented, the nearest interpolation in PyTorch could introduce misalignment, leading to this situation. However, AOT should be able to learn how to eliminate this misalignment during training, unless there is a lack of strict alignment between the training and testing settings.

As far as I remember, the handling of mask interpolation in both the training and testing processes of AOT should be consistent. In any case, I will pay closer attention to this issue. Thank you!

z-x-yang commented 8 months ago

@yoxu515 There is a rounding error in the eval engine and the related interpolations as this happen and it is reproducible when input W or H is not divisible by 16.

Furthermore, has this kind of misalignment caused any difficulties for you in the actual use of DeAOT? If not, I don't believe it's a critical issue.

Indeed, I have also noticed in some early versions of DeAOT experiments that when the video frame rate is very high, and the target remains stationary, there is some weird drift in the segmentation mask. However, in the released versions of DeAOT on YouTube-VOS and DAVIS, this issue does not arise (though I am not sure if it persists in videos with even higher frame rates or smaller object movements).

z-x-yang commented 8 months ago

@z-x-yang Other then some edge case precision about the max_stride alignment you are also affected by pytorch/pytorch#34808.

Thank you for pointing out the bug in PyTorch that I had not previously noticed! I will review the relevant code and strive to prevent all unexpected misalignments.

bhack commented 8 months ago

You can reproduce exactly with the current eval code:

Partially it is solved for sure by https://github.com/pytorch/pytorch/issues/34808#issuecomment-1007806783

But I think you have residual edge case/side effect using np.around: https://github.com/yoxu515/aot-benchmark/blob/ada8a3cbf0ba6dde563a49e78e56dbbcde01d143/dataloaders/video_transforms.py#L640-L655

Can you increase the precision there?

Also there is another issue in training: https://github.com/pytorch/pytorch/issues/104157

You need to use something like: https://github.com/huggingface/transformers/pull/28504/files#r1455033425