wyharveychen / CloserLookFewShot

source code to ICLR'19, 'A Closer Look at Few-shot Classification'
Other
1.13k stars 269 forks source link

can't reach the accuracy as report #4

Open KaleidoZhouYN opened 5 years ago

KaleidoZhouYN commented 5 years ago

I have train your code with loss_type = 'dist' & train_aug(baseline++),but the 5 shot accuracy on mini-imagenet is only 56%,much lower than your report(66%),could you give a resonable explaination?

wyharveychen commented 5 years ago

Thanks for your report, are you using the latest version? I will also check again whether I accidentally change something in the last update...

wyharveychen commented 5 years ago

Hello, I have rerun the code and it reaches 66%. Is your training loss in the last epoch (399) be around 1.5?

Also, since 56% looks like the accuracy without data augmentation, can you check whether your commands are exactly the same as follow?

python ./train.py --dataset miniImagenet --model Conv4 --method baseline++ --train_aug python ./save_features.py --dataset miniImagenet --model Conv4 --method baseline++ --train_aug python ./test.py --dataset miniImagenet --model Conv4 --method baseline++ --train_aug

Note that the train_aug option is still required for testing since it indicates the correct model path.

KaleidoZhouYN commented 5 years ago

Hello, I have rerun the code and it reaches 66%. Is your training loss in the last epoch (399) be around 1.5?

Also, since 56% looks like the accuracy without data augmentation, can you check whether your commands are exactly the same as follow?

python ./train.py --dataset miniImagenet --model Conv4 --method baseline++ --train_aug python ./save_features.py --dataset miniImagenet --model Conv4 --method baseline++ --train_aug python ./test.py --dataset miniImagenet --model Conv4 --method baseline++ --train_aug

Note that the train_aug option is still required for testing since it indicates the correct model path.

thx,i've found the problem is because i changed the initialization of classification weight

KaleidoZhouYN commented 5 years ago

but can you explain why you use Weight_Normalization in baseline++?

wyharveychen commented 5 years ago

For efficient update. Weight normalization reparameterizes the gradient into direction and length. And for baseline++, only the direction of weight is important. Thus, by weight normalization, we could directly update the direction of weights. Without weight normalization, the result of 5-shot accuracy of baseline++ in mini-Imagenet with ResNet10 backbone would be as low as ~60%. With weight normalization, it would be ~76%.

For detailed information in weight normalization, see this paper.

KaleidoZhouYN commented 5 years ago

For efficient update. Weight normalization reparameterizes the gradient into direction and length. And for baseline++, only the direction of weight is important. Thus, by weight normalization, we could directly update the direction of weights. Without weight normalization, the result of 5-shot accuracy of baseline++ in mini-Imagenet with ResNet10 backbone would be as low as ~60%. With weight normalization, it would be ~76%.

For detailed information in weight normalization, see this paper.

but after the weight_normalization,the caculated 'cos_dist' after nn.Linear in class distLinear is larger than 1.0 or less than -1.0,which makes me feel confused.

wyharveychen commented 5 years ago

Sorry for the late reply. This is a good observation I did not notice. The problem is in the line 'cos_dist = self.L(x_normalized)'. I intend to use forward operation to perform matrix production between x_normalized and self.L.weight.data, but I forget that when self.L belongs to WeightNorm class, its forward operation performs on self.L.weight_g, and self.L.weight_v, not self.L.weight.

Thus, the weight is not normalized, so the cos_dist is beyond the range [-1,1]. But it is approximately a scaled cos_dist.

From my observation, self.L.weight_g.data (the norm of the weight vectors) does not have a large difference among classes. Take my baseline++ model trained on miniImagenet with Conv4 backbone as the example, the self.L.weight_g.data for the first 64 classes (i.e. the class with training data) is between 20~35. So it is roughly the cos_dist scaled around 27.

I would mark this issue in my codes, thanks for finding this problem!

ZhangGongjie commented 5 years ago

So if you use the correct normalization, what performance can the baseline++ method get?

Thank you.

KaleidoZhouYN commented 5 years ago

Sorry for the late reply. This is a good observation I did not notice. The problem is in the line 'cos_dist = self.L(x_normalized)'. I intend to use forward operation to perform matrix production between x_normalized and self.L.weight.data, but I forget that when self.L belongs to WeightNorm class, its forward operation performs on self.L.weight_g, and self.L.weight_v, not self.L.weight.

Thus, the weight is not normalized, so the cos_dist is beyond the range [-1,1]. But it is approximately a scaled cos_dist.

From my observation, self.L.weight_g.data (the norm of the weight vectors) does not have a large difference among classes. Take my baseline++ model trained on miniImagenet with Conv4 backbone as the example, the self.L.weight_g.data for the first 64 classes (i.e. the class with training data) is between 20~35. So it is roughly the cos_dist scaled around 27.

I would mark this issue in my codes, thanks for finding this problem!

if you use weight_normalization,the norm the each class will be different both in train and finetune,this is quite different from your baseline++ method report in your paper.

i think your original idea is to realize the formulation of "output = cos_t * scale_factor" which here the scale factor should be the same for all the classes.that's why you use scale_factor =2 for mini-imagenet and scale_factor =10 for omniglot.I tried scale_factor =2 after revise for baseline++ and the result is very bad(as said 56% for 5-shot testing).set the scale_factor =30 will reach a similar result for conv4 network but it is still not as good as your report for ResNet18 network.

i think it a interesting things that actually learning a adaptive weight norm for each class can reach a better result

ZhangGongjie commented 5 years ago

Sorry for the late reply. This is a good observation I did not notice. The problem is in the line 'cos_dist = self.L(x_normalized)'. I intend to use forward operation to perform matrix production between x_normalized and self.L.weight.data, but I forget that when self.L belongs to WeightNorm class, its forward operation performs on self.L.weight_g, and self.L.weight_v, not self.L.weight. Thus, the weight is not normalized, so the cos_dist is beyond the range [-1,1]. But it is approximately a scaled cos_dist. From my observation, self.L.weight_g.data (the norm of the weight vectors) does not have a large difference among classes. Take my baseline++ model trained on miniImagenet with Conv4 backbone as the example, the self.L.weight_g.data for the first 64 classes (i.e. the class with training data) is between 20~35. So it is roughly the cos_dist scaled around 27. I would mark this issue in my codes, thanks for finding this problem!

if you use weight_normalization,the norm the each class will be different both in train and finetune,this is quite different from your baseline++ method report in your paper.

i think your original idea is to realize the formulation of "output = cos_t * scale_factor" which here the scale factor should be the same for all the classes.that's why you use scale_factor =2 for mini-imagenet and scale_factor =10 for omniglot.I tried scale_factor =2 after revise for baseline++ and the result is very bad(as said 56% for 5-shot testing).set the scale_factor =30 will reach a similar result for conv4 network but it is still not as good as your report for ResNet18 network.

i think it a interesting things that actually learning a adaptive weight norm for each class can reach a better result

That is a very good insight! Thanks for sharing!

wyharveychen commented 5 years ago

if you use weight_normalization,the norm the each class will be different both in train and finetune,this is quite different from your baseline++ method report in your paper.

i think your original idea is to realize the formulation of "output = cos_t * scale_factor" which here the scale factor should be the same for all the classes.that's why you use scale_factor =2 for mini-imagenet and scale_factor =10 for omniglot.I tried scale_factor =2 after revise for baseline++ and the result is very bad(as said 56% for 5-shot testing).set the scale_factor =30 will reach a similar result for conv4 network but it is still not as good as your report for ResNet18 network.

i think it a interesting things that actually learning a adaptive weight norm for each class can reach a better result

Thanks for helping reply to the problem (and sorry for late reply as being busy recently). Yes, it turns out to be different from what I meant to use, and also what I describe in the paper... I have marked the issue in the comment of the code and would release a modified paper onto the Arxiv. It would be interesting to detailedly address this class-wise normalization issue in future works.

icoz69 commented 5 years ago

Sorry for the late reply. This is a good observation I did not notice. The problem is in the line 'cos_dist = self.L(x_normalized)'. I intend to use forward operation to perform matrix production between x_normalized and self.L.weight.data, but I forget that when self.L belongs to WeightNorm class, its forward operation performs on self.L.weight_g, and self.L.weight_v, not self.L.weight. Thus, the weight is not normalized, so the cos_dist is beyond the range [-1,1]. But it is approximately a scaled cos_dist. From my observation, self.L.weight_g.data (the norm of the weight vectors) does not have a large difference among classes. Take my baseline++ model trained on miniImagenet with Conv4 backbone as the example, the self.L.weight_g.data for the first 64 classes (i.e. the class with training data) is between 20~35. So it is roughly the cos_dist scaled around 27. I would mark this issue in my codes, thanks for finding this problem!

hi, if weight_normed Linear function only uses g and v in the forward pass, what self.L.weight is used for? is it updated ? can it be delected directly?

wyharveychen commented 5 years ago

Yes, with WeightNorm, self.L.weight would not be used. I have marked the code more clearly, thanks!

SongyiGao commented 5 years ago

For efficient update. Weight normalization reparameterizes the gradient into direction and length. And for baseline++, only the direction of weight is important. Thus, by weight normalization, we could directly update the direction of weights. Without weight normalization, the result of 5-shot accuracy of baseline++ in mini-Imagenet with ResNet10 backbone would be as low as ~60%. With weight normalization, it would be ~76%. For detailed information in weight normalization, see this paper.

but after the weight_normalization,the caculated 'cos_dist' after nn.Linear in class distLinear is larger than 1.0 or less than -1.0,which makes me feel confused.

Without weight normalization, the result of 5-shot accuracy of baseline++ in mini-Imagenet with ResNet10 backbone would be as low as ~60%. But in the paper,the baseline is about 75%. I have some questions about that.

wyharveychen commented 5 years ago

Yes, as discussed above, the number reported on the paper is actually with weight normalization. However, you can follow the parameter in issue #12 to have 75% accuracy without weight normalization. Thanks!

nicolefinnie commented 5 years ago

@KaleidoZhouYN scale_factor doesn't really matter in my experiments (I've tried 30~60 after having read your suggestion), where I have a large number of classes. If your loss function is cross entropy, then the scaled score should get "normalized" again through a softmax function in cross entropy. However, my backbone is a resnet101, the deeper the backbone is, the less sensitive those "factors" are in my experiments

TheSunWillRise commented 5 years ago

@nicolefinnie I think the reason for that is that "WeightNorm" is applied in the latest code, so the class-wise learnable norms can actually play the role of "factors".