First, how to get for GLM model? Is it right that all positions for padding tokens should be 0 for ? Do I need to set other positions as 0 based on ? This question is not discussed in the paper.
Second, if I set as -1, the optimization of model will be stuck in NaN issue. But if I set as 20 or larger number, the NaN issue disappear. Did you encounter the same issue during experiments? Or is there any other methods to fix NaN problem?
Based on https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq/tasks/sentence_prediction.py#L103, I implemented FreeLB at the finetune stage for GLM model. I have four questions.
First, how to get for GLM model? Is it right that all positions for padding tokens should be 0 for ? Do I need to set other positions as 0 based on ? This question is not discussed in the paper.
Second, if I set as -1, the optimization of model will be stuck in NaN issue. But if I set as 20 or larger number, the NaN issue disappear. Did you encounter the same issue during experiments? Or is there any other methods to fix NaN problem?
Third, I found you didn't use in your bert model(https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L224). Does this mean bert-base is more stable than Roberta? Or differs between different models?
Finally, where to find the code implementation for 'when adversarial training meets dropout' in the paper?
Looking forward to your response. Thanks!