The high proportion of novel unigrams

Thank you for sharing this dataset. According to your statistic data, there are 42.33% of novel unigrams in the target summary. Is it too high for a summarization task? I understand authors may tend to use new expressions when introducing others' previous works, and the proportion of bigram, trigram, and 4-gram in this dataset can be relatively higher than that of other datasets. But the novel unigrams may not be very common, even in academic papers. I worry a large proportion of information in the target summaries is not included in the inputs, which may beyond the scope of text summarization. The settings of the dataset construction and the quality of data sources may contribute to the high proportion of novel unigrams. Besides, I found there are 3403 reference papers' abstracts that are empty in the test set and some of the abstracts are not the real abstracts. I understand it is difficult to ensure the quality of data sources and thanks for your efforts to build this dataset.

yaolu / Multi-XScience

The high proportion of novel unigrams #4