Open Hannibal046 opened 2 years ago
I check the distribution of given data, and I found that train and test set give same data distribution, how to achieve this using PLM fine-tuned on CNNDM ?
Good questions :)
Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect.
That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.
How do you avoid this to generate useful data for ranking ?
We found diverse beam search to be very useful in terms of generating diverse data. Please refer to https://github.com/yixinL7/BRIO/blob/main/gen_candidate.py.
And does Pegasus also fine tuned on the CNNDM before generating summary candidate ?
It is only fine-tuned on XSum.
and I found that train and test set give same data distribution, how to achieve this using PLM fine-tuned on CNNDM ?
Firstly, having similar ROUGE scores doesn't necessarily mean the data distribution is the same. For example, if you calculate the extractive oracle performance on the training set and test set on CNN/DM, you will find the score is higher on the test set.
Second, as I mentioned, the checkpoint (facebook/bart-large-cnn
) is probably not overfitting too much on the training set.
Also, sampling 16 outputs using diverse beam search may also help to mitigate the effect of overfitting. Consider this: if the model has perfect performance on the training set, it would mean p_{model}(reference_summary) = 1, which may actually makes the other candidate summaries much worse.
Hi, thanks for reply. But I am still a little bit confused.
That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.
This is true. But when using the model to generate candidates on the training set, which means the model has already seen the ground truth summary during training, p_{model}(reference_summary) = 1 as you mentioned, how could the average max rouge score of training set is almost equivalent to the test set ?
Also, Diverse Beam Search
may mitigate the problem for some extend, but what I suppose is something like this:
For rouge score:
diverse_beam_search_max_train > beam_search_train >>> diverse_beam_search_max_test > beam_search_test
And as you recommend in this https://github.com/yixinL7/SimCLS/issues/14, I check this paper SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization, there indeed has some special tricks for this mismatch problem:
I'd like to emphasize my point that if the model is overfitting too much on the training set it would not perform well on the evaluation set. So it's possible that the selected checkpoint doesn't really overfit the training data.
It's really an empirical question in the end. So I'd recommend you to use the pre-trained model (facebook/bart-large-cnn
) with the original generation script/method (beam search) to generate the outputs on both the test set and training set and evaluate if your assumption is correct :)
Hi, Sorry for take your time. But here I think it is maybe not a overfitting problem but a memorization problem. If train set and validation set gives same results with respect to some metrics, what is the point of validation set ? I think the meaning of validation set is to test the model's ability with data in the same distribution with training data but not exactly the same as training data considering model's memorization capacity.
I admit this is an empirical problem. And thanks so much for providing data for reranking and generation scripts. But considering the large dataset 280k
, large model bart-large
and large beam 16
, I can't test it myself in a short time.
So just to be clear, the whole process of SimCLS
on CNNDM
is as follows (correct me if wrong) :
Hi, I use bart-large-cnn
ckpt to eval full test set and 2000 random data from train set, it gives almost identical results, this really surprises me. Train on the train set, also test on the train set, isn't this 100% label leakage ? I am so confused...
I have to admit this surprises me a lot. Because from my previous experience, training a transformer model from scratch in translation or summarization task, the BLEU
or ROUGE
of training set demonstrates a totally different distribution with that of test set. This is an interesting problem actually, I guess it may be a unique phenomenon in Large PLM, I am verifying this with vanilla transformer
and bart_base
, I will let you know if there is any progress. Thanks again for your detailed explanation !
Hi, since the model used for CNNDM is
Facebook/bart-large-cnn
, which means the model actually got fine-tuned on the CNNDM training set. Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect. Do I understand this correctly ? How do you avoid this to generate useful data for ranking ? And does Pegasus also fine tuned on the CNNDM before generating summary candidate ? Thanks .