wutong16 / DistributionBalancedLoss

[ ECCV 2020 Spotlight ] Pytorch implementation for "Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets"
362 stars 46 forks source link

problem of train set and experiment #2

Closed valencebond closed 4 years ago

valencebond commented 4 years ago

thanks for your detailed code!

  1. according to your paper and codes, train set only contains 1909 images for coco datasets followed by pareto distribution with max=1200 min=1. i am confused about the params settings, the train set is too samll, why not try to set a larger max param to construct a larger train set.
  2. in table 1, for the ERM settings, are all images sampled uniformly without class aware sampler ? just as most common simply baseline(just like training imagenet except for BCEloss)? I have done some experiments on original coco14 datasets(also a long-tailed distribution), but i find using class-aware sampler get a inferior performance than baseline, 2~3 point below. maybe there is something wrong in my implementation.
wutong16 commented 4 years ago

Thanks for your questions!

  1. We referred to ImageNet-LT [1-2] where the maximum class has 1,000 samples when constructing the dataset. It's indeed kind of small because there are only 80 classes, but further lifting the sample size of the head classes would either lead to an extremely imbalanced distribution, or we cannot perform strict limitations to the size of the tail classes. Notice that we use the widely adopted long-tailed setting with class-split manner for head/many-shot: >100 samples, medium-shot: 20-100 samples, and tail/few-shot: <20 samples.

    • Yes, all images are sampled uniformly for ERM setting.
    • coco 2014 is indeed imbalanced but it's not long-tailed according to the class-split manner above. Actually, the least frequent class in original coco 2014 has over 100 samples, and 73/80 of the classes have over 1,000 samples (please correct me if there's some error with the statistics). So all the classes are many-shot and have more than enough samples for plain training, which is not the case we mainly focus on.
    • However, you can still adopt class-aware re-sampling if you like, but in a two-stage manner. Specifically, you may try to pre-train the whole network first with uniform sampling, and then freeze the backbone to fine-tune the classifier weights only. As has been pointed out [1][3] that class-aware re-sampling hurts the representation learning of the model but benefits the classifier learning, so a two-stage training solves the problem. For the original coco with adequate samples, the classifier learning without re-sampling is not bad, so adding re-sampling would mainly hurts of representation, which may lead to worse performance as you tried. Please refer to [1-3] for training details.
    • We also tried on the original coco 2017, and using two-stage would lead to ~0.2 points below, and using DB-Loss + two stage leads to ~0.6 points above.

Hope these details will help you~

[1] Kang et.al., Decoupling representation and classifier for long-tailed recognition. In: International Conference on Learning Representations. In ICLR 2020 [2] Liu et.al., Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. In CVPR 2019 [3] Zhou et.al., BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In CVPR 2020

valencebond commented 4 years ago

hi @wutong16, thanks for your explanation. For long-tailed distribution, my previous view more focus on relative numbers between various categories, i.e. if the number of one label A is only one percent of the other label B, then the label A is minority. But your point more concentrate on absolute number of a category. It truely makes sense.

it is a great work. thanks!

valencebond commented 4 years ago

By the way,

  1. according to class_aware_sample_generator with num_samples_cls = 3, so one batch consists of tuples of 3 images with same target label ? For one gpu used, batchsize 32 is made up by 10 labels with 3 images and 1 labels with 2 images. it is a little weird.

2.for the self.num_sample in ClassAwareSampler, why we need reduce params and set num_samples_cls=3, reduce=4? is there some intuitive reason?

wutong16 commented 4 years ago

Hi!

  1. Different settings of num_samples_cls in a proper range would not influence the results a lot, as we tried 1,2,3, and 4. But yes, it's better to use an even number to avoid the situation you mentioned.

  2. The parameter reduce is to control the total number of samples in an epoch, since the imbalance and head-dominance(usually the class 'person') is severe, we don't want to take N_max*C samples for each epoch which is too many and we have to further reduce the total epoch number. So we take N_max*C/reduce samples alternatively, which may slightly down-sample one/two head classes. Similarly, a proper range(not too big) of reduce won't influence the results too much as we tried, but it does make some little difference.