i have a question about the attention module in "transformer_models.py"
in your code , assume that there are 10 tokens in setence, i think the 7th token can only calculate with 1-6th tokens, as writen in your paper.
but when i read the GPT2 paper and , then read "GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES" find the attention module is not calculate in that way as your paper.
Where can I find relevant information about the GPT2's attention calculation method? i only find the attention with left and right tokens.
thanks for sharing the code.
i have a question about the attention module in "transformer_models.py"
in your code , assume that there are 10 tokens in setence, i think the 7th token can only calculate with 1-6th tokens, as writen in your paper.
but when i read the GPT2 paper and , then read "GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES" find the attention module is not calculate in that way as your paper.
Where can I find relevant information about the GPT2's attention calculation method? i only find the attention with left and right tokens.
thank you