Regarding the datasets - Githubissues

zhmiao / OpenLongTailRecognition-OLTR

Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World" (CVPR 2019 ORAL)

BSD 3-Clause "New" or "Revised" License

839 stars 128 forks source link

Regarding the datasets #38

Closed saurabhsharma1993 closed 4 years ago

saurabhsharma1993 commented 5 years ago

Hi, Thank you again for your code release. I am puzzled by the following issues, which I'm hoping you can help me with : -> Places-LT has 62.5K examples, differently from the reported 184.5K images in the paper. Is the mistake in the paper or in the released dataset ? -> I am unable to reproduce the dataset statistics for ImageNet-LT and Places-LT using Zipf's law ( discrete Pareto distribution : https://en.wikipedia.org/wiki/Pareto_distribution, https://en.wikipedia.org/wiki/Zipf%27s_law ) with alpha=6 ( which seems rather high ). Moreover, the log-log plot is not completely linear in my opinion :

zhmiao commented 5 years ago

Hello @ssfootball04 , thank you very much for asking. We are very sorry that we made mistakes in the paper. Actually, for Place, the alpha should be 1.34 and the actual count is 62K. 184K is the number of data generated with alpha=6. We switched to a more extremely long-tailed distribution right before submission. This might be the reason we forgot to change these numbers.

One the other hand, the reason why the distribution in the log-log space is not strict linear is that first, we use numpy to generate random numbers, second, during data generation, the actual min number of the data is 25. then for each class, we take 20 samples to construct the validation set. This process will also affect the log-log distribution.

Does that make sense?

saurabhsharma1993 commented 5 years ago

Yes I understand, thank you for your reply. One further question, just to be sure, for ImageNet, is alpha=6 or alpha=1.34 ?

zhmiao commented 4 years ago

@ssfootball04 Yes, I think it is true. Sorry for the late reply!

cnyanhao commented 3 years ago

Hi, thanks for your great work. I'm also confused about the Pareto distribution. Are you using the following PDF of Pareto distribution to decide the number of images for each class? If so, do you mean f(1)=1280 and f(1000)=5 with \alpha=6 and x_m=1? That seems doesn't make sense. Have you solve this problem? @ssfootball04