Worse results compared with vanilla BERT-large model on reading comprehension tasks with longer passages

yangapku commented 5 years ago

HI,

We have tried XLNet and compared it with the vanilla BERT-large model. The benchmark we used is the MRQA task, which is a collection of several machine reading comprehension datasets from different domains with various characteristics. Contradictory to our expectation, we found the XLNet model performs worse than vanilla BERT on datasets containing longer passages (much longer than RACE).

These datasets are:

SearchQA (avg. doc length 744 tokens)
TriviaQA Web (avg. doc length 782 tokens)

The BERT-large model gets the following performance on these datasets (according to the results shown on the official MRQA GitHub repo):

Dataset	F1
SearchQA	79.0
TriviaQA	74.7

In comparison, the XLNet performs as follows:

Dataset	F1
SearchQA	78.45
TriviaQA	72.79

Meanwhile, for other datasets in MRQA containing longer passages (DuoRC, TextbookQA and NewsQA), the performance of XLNet is inferior to the BERT baseline we implemented (a little better than the MRQA official baseline).

Hope to get some suggestions to improve the performance on these datasets. Does somewhere in the code need to be changed (such as the sliding windows)? Thank you very much!

yangapku commented 5 years ago

The XLNet model is trained on 8 V100 GPUs with batch_size=32 and seq_len=512 for 3 epochs. The BERT-large model we implemented follows the same configuration.

ecchochan commented 5 years ago

May you share your results on SQuAD for comparsion too?

zihangdai / xlnet

Worse results compared with vanilla BERT-large model on reading comprehension tasks with longer passages #127