xuewyang / Fashion_Captioning

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.
Other
81 stars 13 forks source link

About Expreiment Settings and Performances #1

Closed LONGRYUU closed 3 years ago

LONGRYUU commented 3 years ago

Thanks for your sharing your dataset. It seems to be a really useful and fantastic work! But I'm getting into troubles when I try to replicate some results.

I used the code in ruotian's repo to try some baselines. I trained the 'att2in' and 'adaatt' model using XE loss on FACAD, but got really bad performance on BLEUs, METEOR, ROUGEL, and CIDEr. Even when I use the training split to evaluate the trained model, the socres are still much lower than reported in the paper except CIDEr.

And I also find that the training loss can drop to 1.8 after epochs of training, while the loss on the val split stops at about 3.1. It seems that I've come across an overfit but I've no idea about the reason, as I think the amout of the data is big enough to avoid overfitting. Note that these models all behave well on COCO dataset. And I think I've preprocessed FACAD into the COCO format.

The only difference is that, in COCO, each image is paired with 5 captions. While in FACAD, each image is paired with only one caption, and sometimes different images share one same caption. I don't know if this difference results in the terrible performance.

Do you have any ideas on these problems? Are there any significant details for data preprocessing or training?

xuewyang commented 3 years ago

What scores do you have now? What are your BLEU, etc? I actually found the same problems using his repo. I am trying to debug the problems. I found that all the generations in my implementation are the same. Really weird.

LONGRYUU commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction.

Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

xuewyang commented 3 years ago

Thank you. It's good to know your results. I only have like 0.8 in BLEU-4 using up-down attention model. For now, I don't know the reasons. There might be some problems in my original implementations. So I want to use ruotian's repo. I will work on that and let you know if any updates.

xuewyang commented 3 years ago

Yes, all the generated captions are the same even for different items.

LONGRYUU commented 3 years ago

That's really weired. Have you preprocessed your data as required? BTW, the model I used is att2in. Maybe up-down is harder to finetune than cnn based models? I also tried some other CNN&Attention based models but excluding up-down. It seems that att2in performs better than att2in2 and adaatt in my experiment.

ruotian's repo is a little complex, I think. Maybe I've missed some significant steps in data preprocessing or experiment settings. So I'm gonna try some simpler implementions and see whether I could overcome these problems. I'll inform you If I make any progress, as well.

xuewyang commented 3 years ago

Yes, I get the fc and att features first.

LONGRYUU commented 3 years ago

I made some modifications on how to get features. I didn't extract the features before training since the CNN needs finetuning. So I add a CNN encoder into the AttModel to extract features at the first step of the _forward function, and finetune the CNN as described in the paper. BTW, I used the beam sample method for validation and the beam size is 5, and the max_length is set to 32.

But I think it's okay for you to prepare features in advance since you certainly have a finetuned CNN.

LONGRYUU commented 3 years ago

I tried this repo and failed again. And I also tried picking only one image of each item to train and val but got no improvement.

It seems that you're not using teacher forcing strategy in your paper, is it? But all of the models are trained with teacher forcing on COCO.

Besides, have you checked the pipeline of your original implemention? Is it exactly the same as ruotian's or other models' trained on COCO?

xuewyang commented 3 years ago

Can you tell me your results? I have the following after training with my codes for one and two epochs.

After epoch 0: Bleu_1: 0.420 Bleu_2: 0.251 Bleu_3: 0.159 Bleu_4: 0.109 computing METEOR score... METEOR: 0.165 computing Rouge score... ROUGE_L: 0.393 computing CIDEr score... CIDEr: 1.064

After epoch 1: Bleu_1: 0.433 Bleu_2: 0.270 Bleu_3: 0.181 Bleu_4: 0.131 computing METEOR score... METEOR: 0.173 computing Rouge score... ROUGE_L: 0.404 computing CIDEr score... CIDEr: 1.222

I am training more epochs to see if I can get better results. I am using teacher forcing as other models do. I think it's the same with others. I will try to just integrate ruotian's models into my pipeline to debug where the problem is. After I think my codes are safe I then will publish it.

xuewyang commented 3 years ago

As some of the metrics like CIDEr, meteor are very close to what I reported or even better, I will probably update the results in arxiv. I think the reason might be 1. I am using ruotian's evaluation codes. 2. I am using more examples. (The same number with ECCV but for ECCV I used only a subset actually to report numbers because of time limit)

LONGRYUU commented 3 years ago

WoW, sounds great!Thanks for your efforts.

How did you solve the problems and improved your results? Are there any neccessary changes need to be done in ruotian's code? With which model do you get these good results? The one proposed in your paper or just one included in his repo? The CIDEr seems to be quite good with only 2 epochs of training.

My results are just like aforementioned. Since I am just trying some models not mentioned in the paper as more baselines, the attributes are not used in my experiments, as well as the ALS and SLS proposed in the paper. So maybe that's why I'm not achieving good results. I'll try to take use of the attributes to see if I could get any improvements.

xuewyang commented 3 years ago

I am just using one of the baselines I implemented. I am testing on more baselines. I will test with ruotian's repo too to see if there are some changes needed to use it on FC.

LONGRYUU commented 3 years ago

Are you using attributes of the items in your baselines? I found that I can get good results by simply leveraging the attributes in labels to train and val.

The scores are just like what you described above. But note that the attributes are simlpy retrieved from labels instead of from multi-label predictions, so it is expected to get such improvements.

I'm wondering whether you've taken use of attributes in your baselines so they can work well. If so, maybe that's why ruotian's models are not performing well on FACAD.

xuewyang commented 3 years ago

I am not using attributes now. I think that might be an upper bound for the performance.

tangyuhao2016 commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction.

Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

LONGRYUU commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

tangyuhao2016 commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

May you share the detailed command and data process code?Thank you very much

LONGRYUU commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

May you share the detailed command and data process code?Thank you very much

The command to get a single image is just like wget image_url . Just use a loop to execute this command over and over again until you get all the images. The urls of images can be firstly extracted out from meta_all_129927.json for convenience. It's a quite simply shell script but quite time consuming. So you best divide all the urls into several splits and then run processes for each split in a parallel way. Besides, the raw images are also sapce consuming, so ensure that the space on your disk is enough. An alternative way is to resize the images into a smaller size and then delete the raw files.

tangyuhao2016 commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

May you share the detailed command and data process code?Thank you very much

The command to get a single image is just like wget image_url . Just use a loop to execute this command over and over again until you get all the images. The urls of images can be firstly extracted out from meta_all_129927.json for convenience. It's a quite simply shell script but quite time consuming. So you best divide all the urls into several splits and then run processes for each split in a parallel way. Besides, the raw images are also sapce consuming, so ensure that the space on your disk is enough. An alternative way is to resize the images into a smaller size and then delete the raw files.

The eval link, I have upload into the baiduyun.

https://pan.baidu.com/s/1vHj-s_sx6Xf-J5j9GS9FlA

ogkw

LONGRYUU commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

May you share the detailed command and data process code?Thank you very much

The command to get a single image is just like wget image_url . Just use a loop to execute this command over and over again until you get all the images. The urls of images can be firstly extracted out from meta_all_129927.json for convenience. It's a quite simply shell script but quite time consuming. So you best divide all the urls into several splits and then run processes for each split in a parallel way. Besides, the raw images are also sapce consuming, so ensure that the space on your disk is enough. An alternative way is to resize the images into a smaller size and then delete the raw files.

The eval link, I have upload into the baiduyun.

https://pan.baidu.com/s/1vHj-s_sx6Xf-J5j9GS9FlA

ogkw

Thanks, but I've already managed to get it from google driver. Besides, I'm replicating the results on SAT model with the released code and the results seem to be OK. And I'll train the rest models later on to see whether they can all work well.

tangyuhao2016 commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

May you share the detailed command and data process code?Thank you very much

The command to get a single image is just like wget image_url . Just use a loop to execute this command over and over again until you get all the images. The urls of images can be firstly extracted out from meta_all_129927.json for convenience. It's a quite simply shell script but quite time consuming. So you best divide all the urls into several splits and then run processes for each split in a parallel way. Besides, the raw images are also sapce consuming, so ensure that the space on your disk is enough. An alternative way is to resize the images into a smaller size and then delete the raw files.

The eval link, I have upload into the baiduyun. https://pan.baidu.com/s/1vHj-s_sx6Xf-J5j9GS9FlA ogkw

Thanks, but I've already managed to get it from google driver. Besides, I'm replicating the results on SAT model with the released code and the results seem to be OK. And I'll train the rest models later on to see whether they can all work well.

Thank you. I have download the data again. It's very time consuming.

Meanwhile, I am trying to check whether the problem is data or code. I use the ruotian's code and use the base mode 'newfc', and i process the data format the same as the coco data the code used, the result is really really bad and the b-4 only 0.05. Many generations are the same. I dn not know how the problem appened. You say your result seem to be the ok. May you share the code of the model, I want to check if the problem is the data.If not, I will stop download data, it takes up a lot of my hard disk space. Thank you!

LONGRYUU commented 3 years ago

Scores on the val split: BLEU1-4: 24.4 12.3 7.7 5.7 METEOR: 10.1 ROUGEL: 20.2 CIDER: 46.5 Scores on the training split (i.e. I used the same data for training to evaluate): BLEU1-4: 30.9 20.1 15.6 13.4 METEOR: 13.8 ROUGEL: 27.2 CIDER: 110.0 It's quite strange that only the CIDER score seems to normal on the training set. Other scores are a bit higher but still far from satisfaction. Besides, you mentioned that all the generations were the same. Do you mean that the model generated the same captions for different items? This didn't happen in my experiment.

Wow, the result is very good. Could you tell me how to process the data, download the original images from meta_all_129927.json or use the hdf5 the author provided? I have download the images, and find the dataset are lack of about 1200 images which do not have link. Further, Is it convenient for you to provide the data you processed....Downloading Google cloud disk data in China is very time-consuming。。。。。。。

I directly used wget to download all the raw images instead of downloading from google driver. I simply dropped the missing images. And I don't really think these are good scores since they are much lower than reported in the paper ;-). Now that some codes are released, I think we should use the released code to replicate the results.

May you share the detailed command and data process code?Thank you very much

The command to get a single image is just like wget image_url . Just use a loop to execute this command over and over again until you get all the images. The urls of images can be firstly extracted out from meta_all_129927.json for convenience. It's a quite simply shell script but quite time consuming. So you best divide all the urls into several splits and then run processes for each split in a parallel way. Besides, the raw images are also sapce consuming, so ensure that the space on your disk is enough. An alternative way is to resize the images into a smaller size and then delete the raw files.

The eval link, I have upload into the baiduyun. https://pan.baidu.com/s/1vHj-s_sx6Xf-J5j9GS9FlA ogkw

Thanks, but I've already managed to get it from google driver. Besides, I'm replicating the results on SAT model with the released code and the results seem to be OK. And I'll train the rest models later on to see whether they can all work well.

Thank you. I have download the data again. It's very time consuming.

Meanwhile, I am trying to check whether the problem is data or code. I use the ruotian's code and use the base mode 'newfc', and i process the data format the same as the coco data the code used, the result is really really bad and the b-4 only 0.05. Many generations are the same. I dn not know how the problem appened. You say your result seem to be the ok. May you share the code of the model, I want to check if the problem is the data.If not, I will stop download data, it takes up a lot of my hard disk space. Thank you!

I've replied in your issue and I'm going to close this one. Each comment is just like a long list!!