There is a bug in the Preprocessed cnndm dataset

aidejieceng commented 2 years ago

when I download your Preprocessed cnndm data , I found that there is a problem that the text and the summary do not correspond . Here is an example (cnndm_cased val 10002.json): article:Trying to find a way to explain the birds and the bees to children can be a difficult task for any parent . So luckily for this father , a pair of raccoons took it upon themselves to make his job a little easier by giving a little demonstration in the garden . The hilarious footage captured in Seattle begins innocently enough with some excited children looking out of the window at two raccoons scaling their fence . The children watch on excitedly as the male raccoon chases the female from the fence and into the garden . As the youngsters speculate about whether they will jump from the fence -- before a little muffin man ' song interlude -- one of the raccoons descends into the garden closely pursued by the other . One child , sensing the tension , asks : Can raccoons fight ? abstract: Carol Woodle is one of the most sought after celebrity look-a-likes in the business . Today Carol , 59 , appears as Oprah at 90th birthday parties , corporate events and women 's shelters . At one appearance she gave away iPads like Oprah 's famous car giveaways saying : ` You get an iPad and you get an iPad and YOU get an iPad . But after her first husband walked out on her and their three young boys she thought her life was over . I was so low I was even hospitalized for 30 days due to malnutrition and depression ' Being Oprah allowed her to put the boys ' through college .

yixinL7 commented 2 years ago

Hi, could you provide more details on your finding (e.g. how did you notice this example? how did it affect your experiments?) I spot-checked around 10 examples and didn't find any misalignment between the source text and the summary. It's possible that there are several misaligned examples but it shouldn't affect the experiments.

aidejieceng commented 2 years ago

I trained the model using huggingface's run_summarization.py script using the "article" and "abstract" of each json file in the cnndm dataset and found that the rouge-2 score is very low. So I checked "article" and "article_untok" and found a lot of them are different. For example: 100001.json, 100002.json, 100004.json, 100005.json in the training set.
I use "article_untok" and "abstract_untok" to get normal summary results, 21.1% for rouge-2.

thangld201 commented 2 years ago

@yixinL7 I also found that the 'article' field and the 'abstract' field in the preprocessed data is misaligned (I checked around ~10, all of which were mismatched), and besides being irrelevant, the 'article' text is much shorter than that of the 'article_untok' text, which I found quite strange. The 'article_untok' and 'abstact_untok' seem to be normal though.

yixinL7 commented 2 years ago

Hi, @aidejieceng, @thangld201, thanks a lot for bringing this to my attention! I found that there indeed is a misalignment between the tokenized input articles and reference summaries for CNN/DailyMail dataset. The reason is that some special characters in the input articles caused the text to be split into multiple lines after tokenization. Luckily, this misalignment doesn't have an actual effect as the tokenized articles are not used for either training or evaluation. Nonetheless I've fixed this problem and updated the preprocessed datasets here: https://github.com/yixinL7/BRIO/blob/main/README.md#preprocessed-data.

Thank you again!

yixinL7 / BRIO

There is a bug in the Preprocessed cnndm dataset #18