Closed zjiehang closed 3 years ago
Hey, thanks for trying out!
We did implement the modified dropout in the code for both fairseq and Huggingface Transformers. Most of the modification are on the modeling part. For an easier reference, see the lines involving the variable dp_mask
in huggingface-transformers/src/transformers/modeling_albert.py.
For a comparison of the results with and without such a modification, please refer to Table 4 in the paper.
Wow. Thanks for your prompt reply! I get it. I use the Vanilla transformers framework and feel confused about the "dp_mask" parameter. It does play a role in your revised ”Albert“ version, which is used to control the dropout mask used in the forwarding step. Thanks. Maybe We need a balance between performance and code modification (especially for Vanilla transformers framework, e.g. for other Bert-variant models, do the same modification as your "Albert") when using the same dropout suggestion.
Thanks for your valuable work "FreeLB", which indeed improves my models' performance. One question that arises when discussing the relationship between Dropout and Adversarial Training, in Section 3.3 of Paper https://arxiv.org/pdf/1909.11764.pdf, where the same mask is suggested to be used in each forward-backward step. However, I can't find such implementations in code (Or maybe I missed it by mistake? ). Does using the same mask affect the results? In my own experiments, I ignore the same mask suggestion, but I also improve performance. Very Thanks!