zhuohan123 / terapipe

65 stars 5 forks source link

some question about attention module #53

Open oujieww opened 2 years ago

oujieww commented 2 years ago

thanks for sharing the code.

i have a question about the attention module in "transformer_models.py"

in your code , assume that there are 10 tokens in setence, i think the 7th token can only calculate with 1-6th tokens, as writen in your paper.

but when i read the GPT2 paper and , then read "GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES" find the attention module is not calculate in that way as your paper.

Where can I find relevant information about the GPT2's attention calculation method? i only find the attention with left and right tokens.

thank you

oujieww commented 2 years ago

Should I understand that the division of a token is the local attention of a block, as "GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES"? ^_^