Open ecchochan opened 5 years ago
We are not using cached memory for finetuning yet. Cached memory was used during pretraining to improve modeling long sequences. Once training is done, the model is better for long sequence modeling even with memory removed.
Including cached memory for finetuning is also an option, but not included for now. I would suggest using the same mechanism as in pretraining but backpropagating the gradients across segments.
Thank you!
I am trying to understand how previous segments are aligned to feed into the model.
Input Sentence Length: 1024 max_seq_length: 512 mem_len: 256
Am I correct to understand it to have 3 segments? [0:512], [256:768], [512:1024] ? And the data need to be passed into the model 3 times like sliding windows? But this would make it unable to infer answer from context that is after the window.
Can you advice on how it long sequences can be processed to infer?
Sorry for asking this which might be a foundation knowledge because I am relatively new to the architecture. Btw do you plan on releasing code on using cached memory for finetuning?
@ecchochan The memory cache is actually origins from the Transformer-XL paper. I think you could dig into details from the paper. If I get it right, there are only two segments [0:512], [513:1024]. And the cached memory is [256:512] for the second segment though. As for the question
But this would make it unable to infer answer from context that is after the window.
The cached memory aims to catch long-dependency. The problem will be stuck whether there is cached memory (just like the BERT arch, which has a fixed length 512).
Case: SQuAD task, sequence length > 512
Does your script utilizes cached memory/extended context in a segment, such that the predictions are inferred from sequence longer than 512 tokens?
If yes, where is the code that achieves this?
If not, what do you suggest to utilize cached memory to perform QA task?
Thank you for such a great work!