yixinL7 / SimCLS

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021
182 stars 25 forks source link

Train set and test set ranking distribution difference #16

Open Hannibal046 opened 2 years ago

Hannibal046 commented 2 years ago

Hi, since the model used for CNNDM is Facebook/bart-large-cnn, which means the model actually got fine-tuned on the CNNDM training set. Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect. Do I understand this correctly ? How do you avoid this to generate useful data for ranking ? And does Pegasus also fine tuned on the CNNDM before generating summary candidate ? Thanks .

Hannibal046 commented 2 years ago

I check the distribution of given data, and I found that train and test set give same data distribution, how to achieve this using PLM fine-tuned on CNNDM ?

test ipynb — SimCLS  SSH: 45a3159k71 zicp vip  2022-05-17 21-58-05
yixinL7 commented 2 years ago

Good questions :)

Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect.

That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.

How do you avoid this to generate useful data for ranking ?

We found diverse beam search to be very useful in terms of generating diverse data. Please refer to https://github.com/yixinL7/BRIO/blob/main/gen_candidate.py.

And does Pegasus also fine tuned on the CNNDM before generating summary candidate ?

It is only fine-tuned on XSum.

and I found that train and test set give same data distribution, how to achieve this using PLM fine-tuned on CNNDM ?

Firstly, having similar ROUGE scores doesn't necessarily mean the data distribution is the same. For example, if you calculate the extractive oracle performance on the training set and test set on CNN/DM, you will find the score is higher on the test set. Second, as I mentioned, the checkpoint (facebook/bart-large-cnn) is probably not overfitting too much on the training set. Also, sampling 16 outputs using diverse beam search may also help to mitigate the effect of overfitting. Consider this: if the model has perfect performance on the training set, it would mean p_{model}(reference_summary) = 1, which may actually makes the other candidate summaries much worse.

Hannibal046 commented 2 years ago

Hi, thanks for reply. But I am still a little bit confused.

That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.

This is true. But when using the model to generate candidates on the training set, which means the model has already seen the ground truth summary during training, p_{model}(reference_summary) = 1 as you mentioned, how could the average max rouge score of training set is almost equivalent to the test set ?

Also, Diverse Beam Search may mitigate the problem for some extend, but what I suppose is something like this:

For rouge score:
diverse_beam_search_max_train > beam_search_train >>> diverse_beam_search_max_test > beam_search_test

And as you recommend in this https://github.com/yixinL7/SimCLS/issues/14, I check this paper SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization, there indeed has some special tricks for this mismatch problem:

2203 06569 pdf 2022-05-18 11-38-04
yixinL7 commented 2 years ago

I'd like to emphasize my point that if the model is overfitting too much on the training set it would not perform well on the evaluation set. So it's possible that the selected checkpoint doesn't really overfit the training data. It's really an empirical question in the end. So I'd recommend you to use the pre-trained model (facebook/bart-large-cnn) with the original generation script/method (beam search) to generate the outputs on both the test set and training set and evaluate if your assumption is correct :)

Hannibal046 commented 2 years ago

Hi, Sorry for take your time. But here I think it is maybe not a overfitting problem but a memorization problem. If train set and validation set gives same results with respect to some metrics, what is the point of validation set ? I think the meaning of validation set is to test the model's ability with data in the same distribution with training data but not exactly the same as training data considering model's memorization capacity.

I admit this is an empirical problem. And thanks so much for providing data for reranking and generation scripts. But considering the large dataset 280k , large model bart-large and large beam 16, I can't test it myself in a short time.

So just to be clear, the whole process of SimCLS on CNNDM is as follows (correct me if wrong) :

  1. fine tune bart large on the training set of CNNDM, pick the best ckpt according the performance of validation set. (Facebook/bart-large-cnn)
  2. use the ckpt to generate candidate summaries on train, validation and test set of CNNDM with diverse beam search.
  3. use the generated data to train a reranking model on the train set, pick the best ckpt according the performance of validation set.
  4. use trained reranking model to select the best candidates of the test set as final results.
Hannibal046 commented 2 years ago

Hi, I use bart-large-cnn ckpt to eval full test set and 2000 random data from train set, it gives almost identical results, this really surprises me. Train on the train set, also test on the train set, isn't this 100% label leakage ? I am so confused... image

Hannibal046 commented 2 years ago

I have to admit this surprises me a lot. Because from my previous experience, training a transformer model from scratch in translation or summarization task, the BLEU or ROUGE of training set demonstrates a totally different distribution with that of test set. This is an interesting problem actually, I guess it may be a unique phenomenon in Large PLM, I am verifying this with vanilla transformer and bart_base, I will let you know if there is any progress. Thanks again for your detailed explanation !