Closed aiwithshekhar closed 4 years ago
Did you use the pretrained ResNet101-CFBI-DAVIS? I found the link of ResNet101-CFBI-DAVIS was the same as ResNet101-CFBI. This was a mistake, and I have updated the link.
ResNet101-CFBI didn't be finetuned on DAVIS and didn't learn this case, which leads to a limited performance (about 3 points lower than ResNet101-CFBI-DAVIS).
CFBI is effective in relieving the background confusion problem, but this doesn't mean that CFBI can totally overcome it. In some cases, background confusion can be an ill-posed problem.
But the case in your example seems not too hard, I suppose a better training with more similar scenes should overcome this case. It seems that there are few shots containing similar horses in both DAVIS and YouTube-VOS.
thanks for replying, actually for the above prediction i had used mobilenetv2_cfbi_davis. I tried with above mentioned youtube pretrained weights (resnet101_cfbi) & the results looks good.
Reasonable.
MobileNet-V2 is much weaker than ResNet-101.
I suppose your question has been solved, and I'll close this issue. THX.
@z-x-yang One last question, for youtube vos we have a meta.json file where information about every object frame number is provided. But for any test sequence, it would be little difficult to know this information beforehand. Can you suggest something for this.
For each test sequence, object frame numbers are the total sequence, even if the object is fully occluded in some frames.
This is reasonable. Just like you will not know when an object will be occluded in real world. You have to process all the frames.
Thanks for answering! what i meant was suppose Seq "00f88c4f0a" in YouTube-VOS, here in meta json we explicitly provide information about when rider will appear(10th frame) and when skateboard(20th frame) will appear. But in any new test case sequence knowing these information beforehand will be difficult.
It is necessary for semi-supervised video object segmentation to know when an object appears and this object's mask at this time. Semi-supervised VOS is only responsible for tracking and segmenting your object (s) after given such information.
If you don't know when an object appears, you have to detect it by yourself. And this task is video instance segmentation.
Thanks a lot for your patience throughout.
I have tried using Davis-2017-test-dev sequence named horsejump-stick, there the algorithm prediction confuses the foreground and backgound. Below are the few images. can you please let me know what may have caused this.