z-x-yang / CFBI

The official implementation of CFBI(+): Collaborative Video Object Segmentation by (Multi-scale) Foreground-Background Integration.
BSD 3-Clause "New" or "Revised" License
322 stars 43 forks source link

foreground and background predictions confusions. #16

Closed aiwithshekhar closed 4 years ago

aiwithshekhar commented 4 years ago

I have tried using Davis-2017-test-dev sequence named horsejump-stick, there the algorithm prediction confuses the foreground and backgound. Below are the few images. can you please let me know what may have caused this.

00025 00030

z-x-yang commented 4 years ago

Did you use the pretrained ResNet101-CFBI-DAVIS? I found the link of ResNet101-CFBI-DAVIS was the same as ResNet101-CFBI. This was a mistake, and I have updated the link.

ResNet101-CFBI didn't be finetuned on DAVIS and didn't learn this case, which leads to a limited performance (about 3 points lower than ResNet101-CFBI-DAVIS).

z-x-yang commented 4 years ago

CFBI is effective in relieving the background confusion problem, but this doesn't mean that CFBI can totally overcome it. In some cases, background confusion can be an ill-posed problem.

But the case in your example seems not too hard, I suppose a better training with more similar scenes should overcome this case. It seems that there are few shots containing similar horses in both DAVIS and YouTube-VOS.

aiwithshekhar commented 4 years ago

thanks for replying, actually for the above prediction i had used mobilenetv2_cfbi_davis. I tried with above mentioned youtube pretrained weights (resnet101_cfbi) & the results looks good. 00025 00030

z-x-yang commented 4 years ago

Reasonable.

MobileNet-V2 is much weaker than ResNet-101.

z-x-yang commented 4 years ago

I suppose your question has been solved, and I'll close this issue. THX.

aiwithshekhar commented 4 years ago

@z-x-yang One last question, for youtube vos we have a meta.json file where information about every object frame number is provided. But for any test sequence, it would be little difficult to know this information beforehand. Can you suggest something for this.

z-x-yang commented 4 years ago

For each test sequence, object frame numbers are the total sequence, even if the object is fully occluded in some frames.

This is reasonable. Just like you will not know when an object will be occluded in real world. You have to process all the frames.

aiwithshekhar commented 4 years ago

Thanks for answering! what i meant was suppose Seq "00f88c4f0a" in YouTube-VOS, here in meta json we explicitly provide information about when rider will appear(10th frame) and when skateboard(20th frame) will appear. But in any new test case sequence knowing these information beforehand will be difficult.

z-x-yang commented 4 years ago

It is necessary for semi-supervised video object segmentation to know when an object appears and this object's mask at this time. Semi-supervised VOS is only responsible for tracking and segmenting your object (s) after given such information.

If you don't know when an object appears, you have to detect it by yourself. And this task is video instance segmentation.

aiwithshekhar commented 4 years ago

Thanks a lot for your patience throughout.