xfey / pytorch-BigNAS

PyTorch implementation of BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models
MIT License
25 stars 4 forks source link

Discuss the pretrain weight from BigNAS? #2

Closed twmht closed 2 years ago

twmht commented 2 years ago

Hi,

I have trained a bigNas ImageNet. And I use the supernet to finetune the downstream task. Here I use cub to do the experiement. I use r50 as my supernet.

And I found out the transfer performance is not good as compared to the pretrained weight from normal training (you can find resnet50 on torchvision). that is, using pretrained weight from bigNas may not produce a good result.

I also found out in Table 2 from the paper.

image

they should carefully finetune otherwise the performace would drop, even though the imagenet itself does need pretrain weight.

do you have any idea about this?

xfey commented 2 years ago

Hey, we also found some possible issues with the training process and submitted some commits last week to help (or try to) fix them. Our group found that the KD_Loss (knowledge distillation loss) with the default learning rate may lead to a significant difference in the order of magnitude (e.g. CE_Loss of max_subnet=3.5, while KD_Loss of min_subnet=0.001). We speculated that this may lead to inappropriate optimization of smaller subnets during training, and have replaced it with the "CE_loss with soft labels" instead.

Another member in our group reported that he has transferred the training code into TIMM, i.e., using the training method of BigNAS on other network architectures, and the network converges better during training. Thus we guess that the different hyper-parameters may cause the training instability. He also recommended using the EMA method, which may help gain better performance. The learning curves are following.

11951657863179_ pic

I use the Tensorboard to show the learning curve during training. However, since the training process restarted after last week's fix, only 48/365 epochs were completed. The learning curves are below.

image

Hope it helps, and I'll keep testing and modifying the code as well. Thanks for your results. :)

twmht commented 2 years ago

Maybe I was not clear for my statement.

You can try to do some transfer learning, for example, training on pascal voc (not bignas, just normal training) with the pretrained weight from BigNAS. And then compare the result if the pretrained weight is not from BigNAS.

I just found out the pretrained weight from normal training is better than from BigNAS when doing transfer learning.