Closed 297774951 closed 10 months ago
Hi @297774951
Instead of the iteration-wise update, we calculate each of the teacher's parameters in an epoch manner.
The teacher's parameter is highly rely on the student, which is updated via exponential moving average. The SGD optimiser and also the strong augmentations (including CutMix, color jittering) encourage student learned different parameters in different epochs, and thus suggesting different parameters for dual teachers, leading a related higher divergence.
Cheers, Yuyuan
Why should dual teachers have a related higher divergence?
Can you reply if you have time? Thank you so much
Is it to strengthen the perturbation of the network to improve the generalization of consistency learning?
Hi @wangmingaaaaa
Yes, various strong perturbation causes different optimisation of the student network in epochs, while its update to the teacher is also different. Please note, the comment of "related higher divergence" is in comparison with iterative-wise update method.
I believe dual teachers will eventually fall in same local minima, just like normal MT does, while our goal for such architecture is for more reliable pseudo label throughout the training process.
Cheers, Yuyuan
Thanks for your reply!
@wangmingaaaaa My pleasure!
Hello, I would like to ask about the iteration method used when updating two teachers in your paper (update the first teacher in this iteration, and update the other teacher in the next iteration). I saw the explanation in your paper. Just to increase the diversity between the two teachers. What are the benefits of using an iterative method to update two teachers?