zifuwanggg / JDTLosses

Optimization with JDTLoss and Evaluation with Fine-grained mIoUs for Semantic Segmentation
41 stars 2 forks source link

Adding Mask2Former code #1

Closed antopost closed 1 month ago

antopost commented 3 months ago

First of all, very interesting and relevant work!

In your review rebuttal you stated that you were planning on adding Mask2Former to your evaluation. Is there a timeline for that?

I would like to do knowledge distillation with Mask2Former, but I am stuck on the Hungarian Matching problem with soft labels.

zifuwanggg commented 3 months ago

Thank you for your interest in our work.

Could you please elaborate on the specific issues you encountered when implementing knowledge distillation for Mask2Former? Mathematically, the Hungarian matching algorithm should work with soft labels. You just compute the loss with soft labels and then perform the Hungarian matching algorithm based on this computed loss.

While the implementation of the Hungarian matching algorithm with soft labels is relatively straightforward, integrating matching-based architectures into the JDTLosses codebase can be time-consuming. That's why I haven't done that yet and I cannot provide a definite timeline for this integration.

antopost commented 3 months ago

Thanks for the quick response.

Yes, mathematically, training Mask2Former with soft labels works. However, in practice I get poor results.

In my version I simply pass soft labels to the loss function (using the Huggingface implementation) instead of hard labels. The only other difference is that the HF implementation requires a list of class ids present in the training image to be passed. Because in a label-free setting I don't know the classes in image I pass all class ids. Could this be the issue? Or perhaps my teacher network just isn't well trained enough.

I also found this paper that performs knowledge distillation with Mask2Former for instance segmentation. However, they filter out low confidence masks and make all the masks below the confidence threshold hard, if I understand correctly. It makes me wonder if there was a reason why they didn't just keep the soft labels instead.

If you have any ideas, I'd be happy to hear them. Otherwise, feel free to close this issue :)

zifuwanggg commented 3 months ago

In a classical supervised setting, for each class $c$, you have a network prediction $p$ and a class mask $m$. You calculate the matching loss, e.g. DICE($p$, $m$), and you will also need to pass the class index $c$ to perform the Hungarian matching. With soft labels, you calculate the matching loss, e.g. DICE($p$, $0.9m$), but you will still need to pass the same class index $c$.

If you weight your KD losses equally, e.g. 0.5DICE(S, T) + 0.5DICE(S, GT), a poorly trained teacher should only hurt the result by at most a few percentages. Filtering out low-confidence masks could make some effects. This is because low-confidence masks can drastically increase the number of active classes in a mini-batch. I did something similar in my implementation. You may also find this useful: Which classes should I use to calcualte the loss value? At least in some classical datasets with a large number of classes such as ADE20K and COCO-Stuff, this can have a huge effect. However, I don’t think you have to make all masks hard, as it could diminish the effect of KD. It would indeed simplify the implementation, and I think that might be a reason why they did so.

In your setting, passing all class indexes could also be an issue. I haven’t checked the HF code, but you should understand that these indexes can be a source of bugs and misuses, and I will be very careful with them. To debug your code, you might try setting the softness of labels to a very low value. If the implementation is correct, soft labels should not drastically change the results.

zifuwanggg commented 1 month ago

I’m closing this issue due to inactivity. If you still need assistance or have additional information to provide, please feel free to reopen the issue or create a new one.