Open qute012 opened 3 years ago
Hi, thanks for your issue.
What I do now (as opposed to for the JCDL paper) is below. I "just" compare the titles of the documents. The documents come from ake-datasets. I transformed the files to jsonl
(it was sometimes not easy to get the title).
For the version of KP20k that I shared in this repo I must come clean, I don't remember exactly what was done, but it was something similar to Ken Chan's method. I did not remove duplicates inside the training set of KP20k, but removed the documents appearing in PubMed (Schutz, 2008), KDD and WWW (Caragea, 2014).
For reference I add the work by Ken Chan from keyphrase-generation-rl, and compare the number of duplicates I found to what they found.
Table: Common files between the training set of KP20k and the datasets listed. | Corpus | split | this code | kenchan0226 |
---|---|---|---|---|
Inspec | test | 59 | 61 | |
ACM | test | 1005 | 161 | |
NUS | test | 135 | 137 | |
SemEval-2010 | test | 84 | 85 | |
KP20k | test | 1113 | 1395 | |
KP20k | valid | 1077 | 1353 | |
KP20k | train | 14023 | 17609 |
Thank you for reply.
I understood. What is KP20k train common files in the table? Is it duplicate in itself?
Yes it the duplicates in itself (for each cluster of documents that share a title only one is kept).
Thank you for great work!
As you said, it need to remove duplicate documents in KP20k. Can i know source code or methods how to check same document between KP20k and test sets?
Thank you.