Cannot reproduct the result

zetang94 / ICSE2022_AST_Trans

This is the official implementation of paper "AST-Trans: Code Summarization with Efficient Tree-StructuredAttention" acceptted in ICSE 2022

23 stars 7 forks source link

Cannot reproduct the result #1

Open zzxn opened 2 years ago

zzxn commented 2 years ago

Hi, I use the same hyper params as the paper described (expect that the batch size is 32), and the result on the python dataset is:

bleu: 33.57576184454644, rouge: 40.749804328727144 meteor: 20.408711947772836

This is lower than that of the original paper, especially the rouge metric (47.14 -> 40.74), which is a large drop.

Is this the final version of you code? Or I should use a larger batch size?

Please help me, thanks!

js4720 commented 2 years ago

I also have reproduction issue. I ran the experiment with the same hyper params. I get the following result:

bleu: 34.7170977478838, rouge: 41.85927888224562 meteor: 20.773805692174424

I think the gap for rouge metric is big. Could you tell which part I should fix to get the reproduced result?

Below is the hyper param I used. (As in paper) max_tgt_len = 30 max_src_len = 200 is_split = True num_heads = 8 pos_type = "p2q_p2k_p2v" par_heads = 1 max_rel_pos = 10 max_par_rel_pos = 10 max_bro_rel_pos = 5 num_layers = 4 hidden_size = 256 dim_feed_forward = 2048 is_ignore = True dropout = 0.2 batch_size = 128 num_epochs = 500 learning_rate = 1e-3 warmup = 0.01 criterion = LabelSmoothing(padding_idx=PAD, smoothing=0.1)

zetang94 commented 2 years ago

Hi, I use the same hyper params as the paper described (expect that the batch size is 32), and the result on the python dataset is:
bleu: 33.57576184454644, rouge: 40.749804328727144 meteor: 20.408711947772836
This is lower than that of the original paper, especially the rouge metric (47.14 -> 40.74), which is a large drop.

Is this the final version of you code? Or I should use a larger batch size?

Please help me, thanks!

Sorry to reply so late, I'm doing an internship, so I am a litter busy and can't get the saved model at that time. I ran the code again and used the same configuration in the paper, the batch size is set to 128 and I used 4 GPUs. The only difference is that I annotated the early stop mechanism . The result of rouge is 52.76, higher than the paper. The saved model is uploaded in https://box.nju.edu.cn/f/ebdb250c46524268be41/

js4720 commented 2 years ago

I think it would be better if you could just upload the script that you used so that people can reproduce the result by running the code :) (Not sure adding early stopping would increase my result.)

Also I am not sure what I have missed.

zetang94 commented 2 years ago

I have uploaded the config I used in ast_trans_for_py.py, and the tensorboard log in here.

The early stop mechanism should not be used in this experiment, as it will hurt the performance. Just annotated the line 195~198 in script/train.py

% common.add_early_stopping_by_val_score(patience=config.es_patience, % evaluator=evaluator, % trainer=trainer, % metric_name='bleu')

As shown in the log file, the evaluate bleu will continuously grow without the early stop mechanism. However, if early stop is used, the program will stop prematurely due to oscillations.

And I think batch_size = 32 is also ok and will not hurt the result so much. You can try it. Hoping this can help you :)

js4720 commented 2 years ago

@zetang94 Hi, I thought I was able to reproduce the result using the processed data you used. Yet, I found that your processed code contains two dots, which seem to boost the accuracy. (As in below screenshot) Removing the two dots, I got

BLEU: 34.61 ROUGE: 41.25 METEOR: 20.42

Could you check on this and see if the result is reproducible with remaked data with no dots? Thank you. I appreciate the help. 👍

zetang94 commented 6 months ago

@zetang94 Hi, I thought I was able to reproduce the result using the processed data you used. Yet, I found that your processed code contains two dots, which seem to boost the accuracy. (As in below screenshot) Removing the two dots, I got BLEU: 34.61 ROUGE: 41.25 METEOR: 20.42 Could you check on this and see if the result is reproducible with remaked data with no dots? Thank you. I appreciate the help. 👍

Hello there, do you still have a copy of the preprocessed data? I was trying to run this repostiory as well but it seems that there is not a single trace of data provided

Sorry, I accidentally deleted the data that was originally stored on Google Drive, and I can't find my data anymore. But I recently found that this repository uses AST-Trans as a baseline, and they publish their processed data. I hope it might be helpful to you