Question about the GenerationConfig in commonsense_evaluate.py

StarLooo commented 1 month ago

Thanks for the fast open source. I find that in the commonsense_evaluate.py Line 52~58, the value of the parameter do_sample of GenerationConfig has not been set, and the default value of do_sample is False. Then, with the do_sample=False and num_beams=4, the model will generate using beam-search decoding. Besides, I also find that the Line 60~66 may not pass the value of related attention_mask which may cause warning in transformers library. I don't know whether this behavior is intended, and the right way (hyper-param in generate) to reproduce the results in Table 4. of this paper.

StarLooo commented 1 month ago

Thanks for the fast open source. I find that in the commonsense_evaluate.py Line 52~58, the value of the parameter do_sample of GenerationConfig has not been set, and the default value of do_sample is False. Then, with the do_sample=False and num_beams=4, the model will generate using beam-search decoding. Besides, I also find that the Line 60~66 may not pass the value of related attention_mask which may cause warning in transformers library. I don't know whether this behavior is intended, and the right way (hyper-param in generate) to reproduce the results in Table 4. of this paper.

By the way, when using do_sample=False, it conflicts with (and will override) other settings like temperature, top_k, and top_p.

wutaiqiang commented 1 month ago

Thanks for your kind reminder.

The hyperparameters are not intended but follow https://github.com/AGI-Edgerunners/LLM-Adapters

I understand that such a setting would affect the results but keep the same for all baselines for fair comparison.

StarLooo commented 1 month ago

Thanks for your kind reminder.

The hyperparameters are not intended but follow https://github.com/AGI-Edgerunners/LLM-Adapters

I understand that such a setting would affect the results but keep the same for all baselines for fair comparison.

Thanks for your reply. However, I still have some questions:

An even more stranger phenomenon is that when I try to use the same checkpoint to generate twice, the results are not the same (Even though the do_samp=False).
If the most appropriate way to evaluate the model is to use deterministic greedy decoding, I'll try to change the generate config to it for these methods and try to get similar results.
I try to reproduce the results under the original setting, but I find that the variances are quite large. For LoRA baseline, I got an average ACC of 83.3/82.9 in two trials; but for MoSLoRA, I got an average ACC of 81.7/83.6/85.0 in three trials. Especially I only got 72.2 on the HellaS in the first trial of MoSLoRA and 62.4 on the BoolQ in the second trial, which is too low and seems to be some outliers.

wutaiqiang commented 1 month ago

An even more stranger phenomenon is that when I try to use the same checkpoint to generate twice, the results are not the same (Even though the do_samp=False).

I have not tried this. I think there would be something deeper in the transformer's decoding process. BTW, we must set the batch size to 1 when decoding. (refer to https://github.com/huggingface/transformers/issues/25921)

If the most appropriate way to evaluate the model is to use deterministic greedy decoding, I'll try to change the generate config to it for these methods and try to get similar results.

You can try that.

I try to reproduce the results under the original setting, but I find that the variances are quite large. For LoRA baseline, I got an average ACC of 83.3/82.9 in two trials; but for MoSLoRA, I got an average ACC of 81.7/83.6/85.0 in three trials. Especially I only got 72.2 on the HellaS in the first trial of MoSLoRA and 62.4 on the BoolQ in the second trial, which is too low and seems to be some outliers.

Yes, the results are not stable. I guess the reasons are as 2-folds: i) there are some random factors in the training process ii) Evaluation method. I follow https://github.com/AGI-Edgerunners/LLM-Adapters to evaluate the answers. They first extract the answer and then compare the string. The way to extract the answer is to match the first string with the format Answer*. However, sometimes the model may repeat the options and then generate the answer. So the extracted answer is answer1 and this response is marked as false if the gt is another answer. This question remains but is hard to solve. LLM would not always output answers first. The most suitable way to compare is in the semantic level rather than the string level. That is why recent benchmarks employ GPT to score rather than compare generated strings.

StarLooo commented 1 month ago

An even more stranger phenomenon is that when I try to use the same checkpoint to generate twice, the results are not the same (Even though the do_samp=False).

I have not tried this. I think there would be something deeper in the transformer's decoding process. BTW, we must set the batch size to 1 when decoding. (refer to huggingface/transformers#25921)

If the most appropriate way to evaluate the model is to use deterministic greedy decoding, I'll try to change the generate config to it for these methods and try to get similar results.

You can try that.

I try to reproduce the results under the original setting, but I find that the variances are quite large. For LoRA baseline, I got an average ACC of 83.3/82.9 in two trials; but for MoSLoRA, I got an average ACC of 81.7/83.6/85.0 in three trials. Especially I only got 72.2 on the HellaS in the first trial of MoSLoRA and 62.4 on the BoolQ in the second trial, which is too low and seems to be some outliers.

Yes, the results are not stable. I guess the reasons are as 2-folds: i) there are some random factors in the training process ii) Evaluation method. I follow https://github.com/AGI-Edgerunners/LLM-Adapters to evaluate the answers. They first extract the answer and then compare the string. The way to extract the answer is to match the first string with the format Answer*. However, sometimes the model may repeat the options and then generate the answer. So the extracted answer is answer1 and this response is marked as false if the gt is another answer. This question remains but is hard to solve. LLM would not always output answers first. The most suitable way to compare is in the semantic level rather than the string level. That is why recent benchmarks employ GPT to score rather than compare generated strings.

Thanks, and I will try more experiments.

wutaiqiang commented 1 month ago

One related response:

https://github.com/AGI-Edgerunners/LLM-Adapters/issues/64#issuecomment-2408417574

wutaiqiang / MoSLoRA

Question about the GenerationConfig in commonsense_evaluate.py #6