some question about attention module

thanks for sharing the code.

i have a question about the attention module in "transformer_models.py"

in your code , assume that there are 10 tokens in setence, i think the 7th token can only calculate with 1-6th tokens, as writen in your paper.

but when i read the GPT2 paper and , then read "GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES" find the attention module is not calculate in that way as your paper.

Where can I find relevant information about the GPT2's attention calculation method？ i only find the attention with left and right tokens.

thank you

zhuohan123 / terapipe

some question about attention module #53