xuewyang / Fashion_Captioning

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.
Other
81 stars 13 forks source link

Scores of the 3 released baselines. #3

Closed LONGRYUU closed 3 years ago

LONGRYUU commented 3 years ago

Thanks for your released code. The code is well-structured, I replaced the dataloader with my own implemention and it still works well. But I still got some issues.

I've trained the SAT and BUTD model for about 15 epochs now, they all achieve high scores but the differences are quite large, especially on CIDEr. The CIDEr of them are about 192.6 and 144.6 respectively. Are these results alright? So what scores have you got with these models?

Detailed results are as follows: SAT: Bleu_1: 0.495 Bleu_2: 0.348 Bleu_3: 0.267 Bleu_4: 0.219 METEOR: 0.215 ROUGE_L: 0.465 CIDEr: 1.928

BUTD: Bleu_1: 0.462 Bleu_2: 0.302 Bleu_3: 0.213 Bleu_4: 0.161 METEOR: 0.193 ROUGE_L: 0.432 CIDEr: 1.446

Besides, I've also trained CNNC for 4 epochs. I find it quite slow to eval and it achieves really low scores. Bleu_1: 0.158 Bleu_2: 0.057 Bleu_3: 0.020 Bleu_4: 0.009 METEOR: 0.060 ROUGE_L: 0.131 CIDEr: 0.094

tangyuhao2016 commented 3 years ago

I think different data split have different result. It is very strange to me. I only use the SAT. When I use 126750images(one item one image), after 20 epo: 116750 train, 5000 val, 5000test Bleu_1: 0.452 Bleu_2: 0.309 Bleu_3: 0.211 Bleu_4: 0.153 METEOR: 0.191 ROUGE_L: 0.435 CIDEr: 1.628

When I use the whole data, after 1 epo: (I use the same data split as the author provide) about 770k train, 100k val, 100k test Bleu_1: 0.014 Bleu_2: 0.001 Bleu_3: 0.000 Bleu_4: 0.000 METEOR: 0.008 ROUGE_L: 0.013 CIDEr: 0.010

How do you split data?

LONGRYUU commented 3 years ago

How do you split data?

I've used the whole dataset, I think. About 793k images for training and 99k images for validating.

LONGRYUU commented 3 years ago

I think I've probably found some potential reasons.

Firstly, the hype-parameters of BUTD should be different from SAT's. Specifically, the decoder_dim and emd_dim of BUTD should be 1000 rather than 512. Secondly, finetuning BUTD is harder than SAT. That's probably why BUTD peforms worse than SAT.

Besides, I found that the code for val and test in this repo is quite different from ruotian's, and probably that's why ruotian's code is achieving low scores. But in my opinion, maybe ruotian's implemention is more widely-acknowledged and the author could update the scores based on ruotian's evaluation code, though the scores would be lower. But I think it's OK since this is really a much harder challenge than traditional image caption. And this dataset is really valuable for linking image caption with practice.

tangyuhao2016 commented 3 years ago

BUTD: Bleu_1: 0.431 Bleu_2: 0.267 Bleu_3: 0.175 Bleu_4: 0.122 METEOR: 0.177 ROUGE_L: 0.406 CIDEr: 1.264

CNNC: Bleu_1: 0.389 Bleu_2: 0.213 Bleu_3: 0.126 Bleu_4: 0.080 METEOR: 0.151 ROUGE_L: 0.354 CIDEr: 0.916

The evaluation code is quite different from ruotian's code. When I use ruotian's code, the result is quite low like your CNNC's result.

xuewyang commented 3 years ago

Thanks for your released code. The code is well-structured, I replaced the dataloader with my own implemention and it still works well. But I still got some issues.

I've trained the SAT and BUTD model for about 15 epochs now, they all achieve high scores but the differences are quite large, especially on CIDEr. The CIDEr of them are about 192.6 and 144.6 respectively. Are these results alright? So what scores have you got with these models?

Detailed results are as follows: SAT: Bleu_1: 0.495 Bleu_2: 0.348 Bleu_3: 0.267 Bleu_4: 0.219 METEOR: 0.215 ROUGE_L: 0.465 CIDEr: 1.928

BUTD: Bleu_1: 0.462 Bleu_2: 0.302 Bleu_3: 0.213 Bleu_4: 0.161 METEOR: 0.193 ROUGE_L: 0.432 CIDEr: 1.446

Besides, I've also trained CNNC for 4 epochs. I find it quite slow to eval and it achieves really low scores. Bleu_1: 0.158 Bleu_2: 0.057 Bleu_3: 0.020 Bleu_4: 0.009 METEOR: 0.060 ROUGE_L: 0.131 CIDEr: 0.094

I kind of agree with your comments that BUTD uses different hidden dimension and it's hard to train. I also found the same problem. BUTD needs more epochs to converge. I also have bad results for CNNC. I had much better results (similar to SAT) long time ago when I wrote my paper. But now, the results seem to be much worse. I don't know if I mistakenly changed something.

xuewyang commented 3 years ago

BUTD: Bleu_1: 0.431 Bleu_2: 0.267 Bleu_3: 0.175 Bleu_4: 0.122 METEOR: 0.177 ROUGE_L: 0.406 CIDEr: 1.264

CNNC: Bleu_1: 0.389 Bleu_2: 0.213 Bleu_3: 0.126 Bleu_4: 0.080 METEOR: 0.151 ROUGE_L: 0.354 CIDEr: 0.916

The evaluation code is quite different from ruotian's code. When I use ruotian's code, the result is quite low like your CNNC's result.

Yes, CNNC result is bad for me too. The evaluation code is different. But I don't think that is a problem because I just first get the generated sentence and the reference sentence and then calculate. The process is the same. I am adopting codes from ruotian's repo too.

xuewyang commented 3 years ago

How do you split data?

I've used the whole dataset, I think. About 793k images for training and 99k images for validating.

My original image data is on google drive and I am running out of my SSD and no space for that big dataset. I think it might take me some time to figure out a way to re-process it to see if there are any bugs in data processing.

xuewyang commented 3 years ago

Thanks for your released code. The code is well-structured, I replaced the dataloader with my own implemention and it still works well. But I still got some issues.

I've trained the SAT and BUTD model for about 15 epochs now, they all achieve high scores but the differences are quite large, especially on CIDEr. The CIDEr of them are about 192.6 and 144.6 respectively. Are these results alright? So what scores have you got with these models?

Detailed results are as follows: SAT: Bleu_1: 0.495 Bleu_2: 0.348 Bleu_3: 0.267 Bleu_4: 0.219 METEOR: 0.215 ROUGE_L: 0.465 CIDEr: 1.928

BUTD: Bleu_1: 0.462 Bleu_2: 0.302 Bleu_3: 0.213 Bleu_4: 0.161 METEOR: 0.193 ROUGE_L: 0.432 CIDEr: 1.446

Besides, I've also trained CNNC for 4 epochs. I find it quite slow to eval and it achieves really low scores. Bleu_1: 0.158 Bleu_2: 0.057 Bleu_3: 0.020 Bleu_4: 0.009 METEOR: 0.060 ROUGE_L: 0.131 CIDEr: 0.094

CNNC is extremely slow in evaluation. And performance is bad.

LONGRYUU commented 3 years ago

I kind of agree with your comments that BUTD uses different hidden dimension and it's hard to train. I also found the same problem. BUTD needs more epochs to converge. I also have bad results for CNNC. I had much better results (similar to SAT) long time ago when I wrote my paper. But now, the results seem to be much worse. I don't know if I mistakenly changed something.

The problem lies in the code for generating captions for evaluation, I think. More specifically, teacher forcing strategy should not be used for evaluating process but only for training.

When evaluating, the input at time step t should be the output of time step t-1, instead of the groudtruth word at time step t-1. In this repo, the generating process in val and test is exactly the same as that in training, which means utilizing caption labels during test and val and leads to higher scores natually. However, I think only images should be fed to the caption model as the only input. Besides, the generation process in val and test should be terminated as soon as an EOS token is output.

In most implementions, there would be functions named 'sample' of 'beam_search_sample' which are responsible for generating sentences for evaluation, instead of using 'forward' directly.

Maybe you could check this link as an example for better understanding. By comparing the function 'forward' and 'sample', you'll get what I mean.

xuewyang commented 3 years ago

I kind of agree with your comments that BUTD uses different hidden dimension and it's hard to train. I also found the same problem. BUTD needs more epochs to converge. I also have bad results for CNNC. I had much better results (similar to SAT) long time ago when I wrote my paper. But now, the results seem to be much worse. I don't know if I mistakenly changed something.

The problem lies in the code for generating captions for evaluation, I think. More specifically, teacher forcing strategy should not be used for evaluating process but only for training.

When evaluating, the input at time step t should be the output of time step t-1, instead of the groudtruth word at time step t-1. In this repo, the generating process in val and test is exactly the same as that in training, which means utilizing caption labels during test and val and leads to higher scores natually. However, I think only images should be fed to the caption model as the only input. Besides, the generation process in val and test should be terminated as soon as an EOS token is output.

In most implementions, there would be functions named 'sample' of 'beam_search_sample' which are responsible for generating sentences for evaluation, instead of using 'forward' directly.

Maybe you could check this link as an example for better understanding. By comparing the function 'forward' and 'sample', you'll get what I mean.

Yes, you are right. See here, I didn't use sampling. But here, I am using sampling. That is why. And having low scores seem to be normal! Thank you. That really helped me to know the reasons.

xuewyang commented 3 years ago

I kind of agree with your comments that BUTD uses different hidden dimension and it's hard to train. I also found the same problem. BUTD needs more epochs to converge. I also have bad results for CNNC. I had much better results (similar to SAT) long time ago when I wrote my paper. But now, the results seem to be much worse. I don't know if I mistakenly changed something.

The problem lies in the code for generating captions for evaluation, I think. More specifically, teacher forcing strategy should not be used for evaluating process but only for training. When evaluating, the input at time step t should be the output of time step t-1, instead of the groudtruth word at time step t-1. In this repo, the generating process in val and test is exactly the same as that in training, which means utilizing caption labels during test and val and leads to higher scores natually. However, I think only images should be fed to the caption model as the only input. Besides, the generation process in val and test should be terminated as soon as an EOS token is output. In most implementions, there would be functions named 'sample' of 'beam_search_sample' which are responsible for generating sentences for evaluation, instead of using 'forward' directly. Maybe you could check this link as an example for better understanding. By comparing the function 'forward' and 'sample', you'll get what I mean.

Yes, you are right. See here, I didn't use sampling. But here, I am using sampling. That is why. And having low scores seem to be normal! Thank you. That really helped me to know the reasons.

I will correct the codes when I am done with my current work.

LONGRYUU commented 3 years ago

I kind of agree with your comments that BUTD uses different hidden dimension and it's hard to train. I also found the same problem. BUTD needs more epochs to converge. I also have bad results for CNNC. I had much better results (similar to SAT) long time ago when I wrote my paper. But now, the results seem to be much worse. I don't know if I mistakenly changed something.

The problem lies in the code for generating captions for evaluation, I think. More specifically, teacher forcing strategy should not be used for evaluating process but only for training. When evaluating, the input at time step t should be the output of time step t-1, instead of the groudtruth word at time step t-1. In this repo, the generating process in val and test is exactly the same as that in training, which means utilizing caption labels during test and val and leads to higher scores natually. However, I think only images should be fed to the caption model as the only input. Besides, the generation process in val and test should be terminated as soon as an EOS token is output. In most implementions, there would be functions named 'sample' of 'beam_search_sample' which are responsible for generating sentences for evaluation, instead of using 'forward' directly. Maybe you could check this link as an example for better understanding. By comparing the function 'forward' and 'sample', you'll get what I mean.

Yes, you are right. See here, I didn't use sampling. But here, I am using sampling. That is why. And having low scores seem to be normal! Thank you. That really helped me to know the reasons.

I will correct the codes when I am done with my current work.

Thanks for your quick reply. Glad to find out this bug. And I think this means ruotian's repo can be used as a correct backbone now, is it?

xuewyang commented 3 years ago

I kind of agree with your comments that BUTD uses different hidden dimension and it's hard to train. I also found the same problem. BUTD needs more epochs to converge. I also have bad results for CNNC. I had much better results (similar to SAT) long time ago when I wrote my paper. But now, the results seem to be much worse. I don't know if I mistakenly changed something.

The problem lies in the code for generating captions for evaluation, I think. More specifically, teacher forcing strategy should not be used for evaluating process but only for training. When evaluating, the input at time step t should be the output of time step t-1, instead of the groudtruth word at time step t-1. In this repo, the generating process in val and test is exactly the same as that in training, which means utilizing caption labels during test and val and leads to higher scores natually. However, I think only images should be fed to the caption model as the only input. Besides, the generation process in val and test should be terminated as soon as an EOS token is output. In most implementions, there would be functions named 'sample' of 'beam_search_sample' which are responsible for generating sentences for evaluation, instead of using 'forward' directly. Maybe you could check this link as an example for better understanding. By comparing the function 'forward' and 'sample', you'll get what I mean.

Yes, you are right. See here, I didn't use sampling. But here, I am using sampling. That is why. And having low scores seem to be normal! Thank you. That really helped me to know the reasons.

I will correct the codes when I am done with my current work.

Thanks for your quick reply. Glad to find out this bug. And I think this means ruotian's repo can be used as a correct backbone now, is it?

I would say yes.

xuewyang commented 3 years ago

@LONGRYUU Be aware of here. It should be self.imgs[i] / 255.

LONGRYUU commented 3 years ago

@LONGRYUU Be aware of here. It should be self.imgs[i] / 255.

Thanks, but I've used my own dataloader when I ran this code so I think this problem didn't bother me. And in my code I applied transforms.Normalize to normalize the images into a float tensor.