zihangdai / xlnet

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Apache License 2.0
6.18k stars 1.18k forks source link

Worse results compared with vanilla BERT-large model on reading comprehension tasks with longer passages #127

Open yangapku opened 5 years ago

yangapku commented 5 years ago

HI,

We have tried XLNet and compared it with the vanilla BERT-large model. The benchmark we used is the MRQA task, which is a collection of several machine reading comprehension datasets from different domains with various characteristics. Contradictory to our expectation, we found the XLNet model performs worse than vanilla BERT on datasets containing longer passages (much longer than RACE).

These datasets are:

The BERT-large model gets the following performance on these datasets (according to the results shown on the official MRQA GitHub repo):

Dataset F1
SearchQA 79.0
TriviaQA 74.7

In comparison, the XLNet performs as follows:

Dataset F1
SearchQA 78.45
TriviaQA 72.79

Meanwhile, for other datasets in MRQA containing longer passages (DuoRC, TextbookQA and NewsQA), the performance of XLNet is inferior to the BERT baseline we implemented (a little better than the MRQA official baseline).

Hope to get some suggestions to improve the performance on these datasets. Does somewhere in the code need to be changed (such as the sliding windows)? Thank you very much!

yangapku commented 5 years ago

The XLNet model is trained on 8 V100 GPUs with batch_size=32 and seq_len=512 for 3 epochs. The BERT-large model we implemented follows the same configuration.

ecchochan commented 5 years ago

May you share your results on SQuAD for comparsion too?