yaolu / Multi-XScience

Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
MIT License
41 stars 5 forks source link

The high proportion of novel unigrams #4

Closed StevenLau6 closed 2 years ago

StevenLau6 commented 3 years ago

Thank you for sharing this dataset. According to your statistic data, there are 42.33% of novel unigrams in the target summary. Is it too high for a summarization task? I understand authors may tend to use new expressions when introducing others' previous works, and the proportion of bigram, trigram, and 4-gram in this dataset can be relatively higher than that of other datasets. But the novel unigrams may not be very common, even in academic papers. I worry a large proportion of information in the target summaries is not included in the inputs, which may beyond the scope of text summarization. The settings of the dataset construction and the quality of data sources may contribute to the high proportion of novel unigrams. Besides, I found there are 3403 reference papers' abstracts that are empty in the test set and some of the abstracts are not the real abstracts. I understand it is difficult to ensure the quality of data sources and thanks for your efforts to build this dataset.