zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.18k stars 1.18k forks source link

Long Sequence in SQuAD #41

Open ecchochan opened 5 years ago

ecchochan commented 5 years ago

Case: SQuAD task, sequence length > 512

Does your script utilizes cached memory/extended context in a segment, such that the predictions are inferred from sequence longer than 512 tokens?

If yes, where is the code that achieves this?

If not, what do you suggest to utilize cached memory to perform QA task?

Thank you for such a great work!

kimiyoung commented 5 years ago

We are not using cached memory for finetuning yet. Cached memory was used during pretraining to improve modeling long sequences. Once training is done, the model is better for long sequence modeling even with memory removed.

Including cached memory for finetuning is also an option, but not included for now. I would suggest using the same mechanism as in pretraining but backpropagating the gradients across segments.

ecchochan commented 5 years ago

Thank you!

I am trying to understand how previous segments are aligned to feed into the model.

Input Sentence Length: 1024 max_seq_length: 512 mem_len: 256

Am I correct to understand it to have 3 segments? [0:512], [256:768], [512:1024] ? And the data need to be passed into the model 3 times like sliding windows? But this would make it unable to infer answer from context that is after the window.

Can you advice on how it long sequences can be processed to infer?

Sorry for asking this which might be a foundation knowledge because I am relatively new to the architecture. Btw do you plan on releasing code on using cached memory for finetuning?

SivilTaram commented 5 years ago

@ecchochan The memory cache is actually origins from the Transformer-XL paper. I think you could dig into details from the paper. If I get it right, there are only two segments [0:512], [513:1024]. And the cached memory is [256:512] for the second segment though. As for the question

But this would make it unable to infer answer from context that is after the window.

The cached memory aims to catch long-dependency. The problem will be stuck whether there is cached memory (just like the BERT arch, which has a fixed length 512).