Couldnt reproduce results gotten in the paper using the same dataset and code

hemhemoh commented 2 months ago

Hello GFSLT-VLP,

Thank you for sharing your work. I tried reproducing the results as reported in your paper, specifically by using the VLP Pretrain V2 command and the GFSLT-VLP command on a single gpu but I encountered some discrepancies between the evaluation results I obtained and those reported in the paper.

Steps Taken:

Followed the instructions in the README.
Ran the 'VLP Pretrain V2' command on a single GPU: CUDA_VISIBLE_DEVICES=0 python train_vlp_v2.py --batch-size 8 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2
Ran the GFSLT-VLP command CUDA_VISIBLE_DEVICES=0 python train_slt.py --batch-size 2 --epochs 200 --opt sgd --lr 0.01 --output_dir out/Gloss-Free --finetune ./out/vlp/checkpoint.pth
Performed evaluation using CUDA_VISIBLE_DEVICES=0 python train_slt.py --batch-size 2 --epochs 200 --opt sgd --lr 0.01 --output_dir out/Gloss-Free --resume out/Gloss-Free/best_checkpoint.pth --eval

Issue:

The results I obtained were as follows: Rouge-21.1 ,B4- 8.247 for dev and B4-8.2 , Rouge-20.5 for test
These results differ from those reported in the paper: ROUGE - 42.49 on test set and 43.72 on dev set, B4- 21.44 on test set and 22.12 on dev set

Could you please help clarify if I might have missed something, or if there are additional steps or considerations that might be required to reproduce the results accurately?

Thank you in advance for your assistance!

Best regards,
Mardiyyah

zhoubenjia commented 2 months ago

Hi, thank you for your attention to our work. It seems that the batch size is too small. When your GPU is limited, try setting gradient_accumulation_steps and perform gradient accumulation.

hemhemoh commented 2 months ago

Alright, thank you. I will do that and get back to you.

varunlmxd commented 1 month ago

@hemhemoh can you tell the instruction to download the dataset and rerun the experiment please

hemhemoh commented 1 month ago

Hi @varunlmxd I downloaded the dataset off the internet and had to do some restructuring to fit the format they want. I could decipher the format they want by reading the dataset.py file and also loading the pickle files in the data/phoenix2014T folder. As regards running the experiments, I followed the readme to create a venv and also the instructions on how to install nlgeval from the metric folder. I believe once you get your dependencies fixed and dataset to be in the right format, it will be easier to go from there. I also had to make some changes to the code because I am using a single gpu and the authors recommended using gradient accumulation. Let me know if you need more help.

zhoubenjia / GFSLT-VLP

Couldnt reproduce results gotten in the paper using the same dataset and code #19