doubt regarding inputs to preprocess.py

yixinL7 / SimCLS

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

183 stars 25 forks source link

doubt regarding inputs to preprocess.py #4

Closed ramgj28 closed 3 years ago

ramgj28 commented 3 years ago

Hello there, first of all thank you so much for giving your code as open source so others like me can learn from it. I saw that the preprocess.py script requires many file inputs including the candidate summary. But those are generated by the model right? I couldn't find those in the data. Also in the example json file, I noticed that article untokenized and tokenized both seem to be sentence tokenized. So what is the difference?

yixinL7 commented 3 years ago

Candidate summaries are generated by a pre-trained abstractive model (in our work we use BART on CNNDM). Our code is for training the evaluation model in our paper. We've provided the preprocessed date along with the generated candidate summaries.
Untokenized text is for model input, following the requirement of RoBERTa. Tokenized data is for evaluation (computing ROUGE), following the previous work.

ramgj28 commented 3 years ago

Oh, now I get it. Thank you so much for taking your time explaining this. Really appreciate it.

ramgj28 commented 3 years ago

doubt cleared.Thanks;)