zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.16k stars 1.18k forks source link

How are the positional encodings derived #279

Open bnicholl opened 3 years ago

bnicholl commented 3 years ago

After reading the paper, it seems as though the content stream consists of its own word embedding and positional encoding, along with the word embeddings and positional encodings associated with its respective permutation vector, and the query stream consist of its positional encoding and a random W embedding, along with the word embeddings and the positional encodings of its respective permutation vector. My question is, what is the positional encoding? Is it a learnable vector as in the case of BERT, or the sinusoid function used in other transformers? I'd like to understand how this encoding is derived. Thanks!