Closed twmht closed 2 years ago
Hey, we also found some possible issues with the training process and submitted some commits last week to help (or try to) fix them. Our group found that the KD_Loss (knowledge distillation loss) with the default learning rate may lead to a significant difference in the order of magnitude (e.g. CE_Loss of max_subnet=3.5, while KD_Loss of min_subnet=0.001). We speculated that this may lead to inappropriate optimization of smaller subnets during training, and have replaced it with the "CE_loss with soft labels" instead.
Another member in our group reported that he has transferred the training code into TIMM
, i.e., using the training method of BigNAS
on other network architectures, and the network converges better during training. Thus we guess that the different hyper-parameters may cause the training instability. He also recommended using the EMA
method, which may help gain better performance. The learning curves are following.
I use the Tensorboard
to show the learning curve during training. However, since the training process restarted after last week's fix, only 48/365 epochs were completed. The learning curves are below.
Hope it helps, and I'll keep testing and modifying the code as well. Thanks for your results. :)
Maybe I was not clear for my statement.
You can try to do some transfer learning, for example, training on pascal voc (not bignas, just normal training) with the pretrained weight from BigNAS. And then compare the result if the pretrained weight is not from BigNAS.
I just found out the pretrained weight from normal training is better than from BigNAS when doing transfer learning.
Hi,
I have trained a bigNas ImageNet. And I use the supernet to finetune the downstream task. Here I use cub to do the experiement. I use r50 as my supernet.
And I found out the transfer performance is not good as compared to the pretrained weight from normal training (you can find resnet50 on torchvision). that is, using pretrained weight from bigNas may not produce a good result.
I also found out in Table 2 from the paper.
they should carefully finetune otherwise the performace would drop, even though the imagenet itself does need pretrain weight.
do you have any idea about this?