Open Zachary-Lau-s opened 10 months ago
Hi Zhou,
I have read your paper and am very interested in the idea. Therefore, I would like to conduct some experiments on this model. However, when I switched the dataset to CSL-Daily, I did not achieve satisfactory results. I would appreciate some advice from you on this matter.
I generate the CSL-Daily data file based on the Phoenix data file. Subsequently, I utilized the trim_model.py to create pre-train models on CSL-Daily dataset. I adjusted the src_lang and tgt_lang parameters to "zh_CN".
Could you please provide me with some additional key parameter adjustments and areas I should pay special attention to?
Hi, did you change this code while testing https://github.com/zhoubenjia/GFSLT-VLP/issues/7#issuecomment-1803110284?
Although I have not completed the reproduction, I am looking forward to communicating with you. Or could you share the weights with me?
Thank you for your response. During testing, I did not modify the code, but I recorded data for "tgt_pres" and "tgt_refs," where there is a space between each character.
I have generated files like "labels.train" myself, and their contents are as follows.
Subsequently, I used "trim_model.py" to obtain pretrained parameters. In this process, I only changed the "src_lang" and "tgt_lang" parameters of the tokenizer to "zh_CN". The remaining parameters used were the same as those used for training Phoenix-2014T, including "--batch-size 2", "--epochs 200", "--opt sgd", and "--lr 0.01". However, it is important to note that I utilized only one GPU.
I have trained for 200 epochs, and the BLEU-4 score is only 1.98%. I'm looking forward to receiving your suggestions on how to enhance my score.
Thank you for your response. During testing, I did not modify the code, but I recorded data for "tgt_pres" and "tgt_refs," where there is a space between each character.
I have generated files like "labels.train" myself, and their contents are as follows.
Subsequently, I used "trim_model.py" to obtain pretrained parameters. In this process, I only changed the "src_lang" and "tgt_lang" parameters of the tokenizer to "zh_CN". The remaining parameters used were the same as those used for training Phoenix-2014T, including "--batch-size 2", "--epochs 200", "--opt sgd", and "--lr 0.01". However, it is important to note that I utilized only one GPU.
I have trained for 200 epochs, and the BLEU-4 score is only 1.98%. I'm looking forward to receiving your suggestions on how to enhance my score.
I'm sorry for the late reply. When you construct the data, the text should be a continuous sentence, like '上海的冬天很冷注意保暖'. When calculating BLEU, please refer to #7. I'm too busy to update the code about CSL-Daily during this time. But I can provide some instructions.
Thank you for your response. During testing, I did not modify the code, but I recorded data for "tgt_pres" and "tgt_refs," where there is a space between each character. I have generated files like "labels.train" myself, and their contents are as follows. Subsequently, I used "trim_model.py" to obtain pretrained parameters. In this process, I only changed the "src_lang" and "tgt_lang" parameters of the tokenizer to "zh_CN". The remaining parameters used were the same as those used for training Phoenix-2014T, including "--batch-size 2", "--epochs 200", "--opt sgd", and "--lr 0.01". However, it is important to note that I utilized only one GPU. I have trained for 200 epochs, and the BLEU-4 score is only 1.98%. I'm looking forward to receiving your suggestions on how to enhance my score.
I'm sorry for the late reply. When you construct the data, the text should be a continuous sentence, like '上海的冬天很冷注意保暖'. When calculating BLEU, please refer to #7. I'm too busy to update the code about CSL-Daily during this time. But I can provide some instructions.
- Create the data file and trim the tokenizer and model.
- Use the trimmed tokenizer and model for VLP Pretraining.
- Perform GFSLT finetuning. However, we find that using the trimmed mbart-tokenizer in this stage will lead to poor performance. Therefore, we replaced the tokenizer with a char-based tokenizer.
Awesome! Could you tell me how to implement char-based tokenizer?
Thank you for your response. During testing, I did not modify the code, but I recorded data for "tgt_pres" and "tgt_refs," where there is a space between each character. I have generated files like "labels.train" myself, and their contents are as follows. Subsequently, I used "trim_model.py" to obtain pretrained parameters. In this process, I only changed the "src_lang" and "tgt_lang" parameters of the tokenizer to "zh_CN". The remaining parameters used were the same as those used for training Phoenix-2014T, including "--batch-size 2", "--epochs 200", "--opt sgd", and "--lr 0.01". However, it is important to note that I utilized only one GPU. I have trained for 200 epochs, and the BLEU-4 score is only 1.98%. I'm looking forward to receiving your suggestions on how to enhance my score.
I'm sorry for the late reply. When you construct the data, the text should be a continuous sentence, like '上海的冬天很冷注意保暖'. When calculating BLEU, please refer to #7. I'm too busy to update the code about CSL-Daily during this time. But I can provide some instructions.
- Create the data file and trim the tokenizer and model.
- Use the trimmed tokenizer and model for VLP Pretraining.
- Perform GFSLT finetuning. However, we find that using the trimmed mbart-tokenizer in this stage will lead to poor performance. Therefore, we replaced the tokenizer with a char-based tokenizer.
Awesome! Could you tell me how to implement char-based tokenizer?
I originally used the torchtext library to build, you can refer to the build_vocab function in utils.py. But it's not very convenient. You can also try using huggingface's tokenizer library. Here is a simple example code:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(WordLevel())
tokenizer.pre_tokenizer = Whitespace()
trainer = WordLevelTrainer(special_tokens = ['unk'])
#sentences should be the list of all sentences in CSL-daily
sentences = ['你 好 啊','吃 饭 了 吗']
tokenizer.train_from_iterator(sentences, trainer)
output = tokenizer.encode("吃 饭 了 吗").ids
print(output)#[3,7,4,1]
text = tokenizer.decode(output)
print(text)#吃 饭 了 吗
Hi Zhou
Thx for your amazing code. I have conducted some experiments on Phoenix-Dataset and got some similar results with your paper. However, when I switched the dataset to CSL-Daily, I did not achieve satisfactory results. The BLEU-4 score is only 5.96
As referring to some relative issues #7, I utilized the trim_model.py to create pre-train models on CSL-Daily dataset. and the vocab size is "vocab_size": 6036
Here is our config. I would appreciate some advice from you on this matter.
@zhoubenjia Hi, Zhou
We followed the previous discussion to build mbart for the CSL-Daily dataset and the GFSLT baseline got 8.36 at B@4 on test set. But GFSLT+VLP got worse B@4 than baseline. It is quite weird. We have double and try different dropouts, lr, and stuff like that.
We attach our log files for pre-train, baseline, and GFSLT+VLP.
csl_GFSLT_VLP_lightingPretrained_0301.txt CSL_GFSLT_VLP.txt CSL_GFSLT.txt CSL_VLP.txt
@zhoubenjia Hi, Zhou
We followed the previous discussion to build mbart for the CSL-Daily dataset and the GFSLT baseline got 8.36 at B@4 on test set. But GFSLT+VLP got worse B@4 than baseline. It is quite weird. We have double and try different dropouts, lr, and stuff like that.
We attach our log files for pre-train, baseline, and GFSLT+VLP.
csl_GFSLT_VLP_lightingPretrained_0301.txt CSL_GFSLT_VLP.txt CSL_GFSLT.txt CSL_VLP.txt
Hi, have you tried using the mbart native decoder as the text decoder?
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=1236 --use_env train_vlp.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp --decoder-type LLMD
Hi, I already built char_level tokenizer and used LLMD for CSL-Daily. Our best B@4 for GFSLT baseline is 9.58. But when we apply VLP, the results are around 3 B@4.
At fine-tune stage, we load params with --decoder-type LLMD
Both of them strongly decreased the B@4 to around 3.*
Here are our log.txts, can you help me out? Which setting you use for pertaining fine-tune?
LLMD_config_csl_char.json LLMD_config_csl_word.json csl_GFSLT_VLP_lightingPretrained_0301.txt GFSLT_vlpv2WordLlmd_charLLMD.txt
@Zachary-Lau-s @ZechengLi19
Hi, did you reduce the results for CSL-Daily? any suggestion?
@Zachary-Lau-s @ZechengLi19
Hi, did you reduce the results for CSL-Daily? any suggestion?
We have not tried to reproduce it, but when we used the GFSLT code repository, we did find that the GFSLT had training instability.
Hi Zhou,
I have read your paper and am very interested in the idea. Therefore, I would like to conduct some experiments on this model. However, when I switched the dataset to CSL-Daily, I did not achieve satisfactory results. I would appreciate some advice from you on this matter.
I generate the CSL-Daily data file based on the Phoenix data file. Subsequently, I utilized the trim_model.py to create pre-train models on CSL-Daily dataset. I adjusted the src_lang and tgt_lang parameters to "zh_CN".
Could you please provide me with some additional key parameter adjustments and areas I should pay special attention to?