We have tried XLNet and compared it with the vanilla BERT-large model. The benchmark we used is the MRQA task, which is a collection of several machine reading comprehension datasets from different domains with various characteristics. Contradictory to our expectation, we found the XLNet model performs worse than vanilla BERT on datasets containing longer passages (much longer than RACE).
These datasets are:
SearchQA (avg. doc length 744 tokens)
TriviaQA Web (avg. doc length 782 tokens)
The BERT-large model gets the following performance on these datasets (according to the results shown on the official MRQA GitHub repo):
Dataset
F1
SearchQA
79.0
TriviaQA
74.7
In comparison, the XLNet performs as follows:
Dataset
F1
SearchQA
78.45
TriviaQA
72.79
Meanwhile, for other datasets in MRQA containing longer passages (DuoRC, TextbookQA and NewsQA), the performance of XLNet is inferior to the BERT baseline we implemented (a little better than the MRQA official baseline).
Hope to get some suggestions to improve the performance on these datasets. Does somewhere in the code need to be changed (such as the sliding windows)? Thank you very much!
The XLNet model is trained on 8 V100 GPUs with batch_size=32 and seq_len=512 for 3 epochs. The BERT-large model we implemented follows the same configuration.
HI,
We have tried XLNet and compared it with the vanilla BERT-large model. The benchmark we used is the MRQA task, which is a collection of several machine reading comprehension datasets from different domains with various characteristics. Contradictory to our expectation, we found the XLNet model performs worse than vanilla BERT on datasets containing longer passages (much longer than RACE).
These datasets are:
The BERT-large model gets the following performance on these datasets (according to the results shown on the official MRQA GitHub repo):
In comparison, the XLNet performs as follows:
Meanwhile, for other datasets in MRQA containing longer passages (DuoRC, TextbookQA and NewsQA), the performance of XLNet is inferior to the BERT baseline we implemented (a little better than the MRQA official baseline).
Hope to get some suggestions to improve the performance on these datasets. Does somewhere in the code need to be changed (such as the sliding windows)? Thank you very much!