Hello there, first of all thank you so much for giving your code as open source so others like me can learn from it. I saw that the preprocess.py script requires many file inputs including the candidate summary. But those are generated by the model right? I couldn't find those in the data. Also in the example json file, I noticed that article untokenized and tokenized both seem to be sentence tokenized. So what is the difference?
Candidate summaries are generated by a pre-trained abstractive model (in our work we use BART on CNNDM). Our code is for training the evaluation model in our paper. We've provided the preprocessed date along with the generated candidate summaries.
Untokenized text is for model input, following the requirement of RoBERTa. Tokenized data is for evaluation (computing ROUGE), following the previous work.
Hello there, first of all thank you so much for giving your code as open source so others like me can learn from it. I saw that the preprocess.py script requires many file inputs including the candidate summary. But those are generated by the model right? I couldn't find those in the data. Also in the example json file, I noticed that article untokenized and tokenized both seem to be sentence tokenized. So what is the difference?