Remove duplicate documents from KP20k and test datasets.

qute012 commented 3 years ago

Thank you for great work!

As you said, it need to remove duplicate documents in KP20k. Can i know source code or methods how to check same document between KP20k and test sets?

We also found out that the training set of KP20k contains a non-negligible number of documents from the test sets of other datasets. We removed those documents prior to training.

Thank you.

ygorg commented 3 years ago

Hi, thanks for your issue.

What I do now (as opposed to for the JCDL paper) is below. I "just" compare the titles of the documents. The documents come from ake-datasets. I transformed the files to jsonl (it was sometimes not easy to get the title).

Code

```python import re import json import string from tqdm import tqdm # Create a dictionary that maps any punctuation/whitespace to None punct_tab = str.maketrans(dict.fromkeys(string.punctuation)) for c in '\t\n \r': # add whitespace characters punct_tab[ord(c)] = None def preproc(title): # Preprocess a title to remove any puncutation/whitespace and # make it lowercase return title.strip().lower().translate(punct_tab) # `jsonl` files with ['id', 'title', 'keyword'] dataset_files = [ 'KP20k.test.jsonl', 'ACM.test.jsonl', 'SemEval-2010.test.jsonl', 'Inspec.test.jsonl', 'NUS.test.jsonl', 'KP20k.train.json' ] # Cluster documents according to their title # Build a Dict[title,Tuple[info]] title_mapping = {} for path in dataset_files: with open(path) as g: # Load the data g = map(json.loads, g) # Only keep id, title, keywords # And preprocess the titles to increase matching g = map(lambda d: (d['id'], d['title'], preproc(d['title'])), g) for id_, og_title, pr_title in g: # Fill the dict if pr_title in title_mapping: title_mapping[pr_title].append((id_, og_title, path)) else: title_mapping[pr_title] = [(id_, og_title, path)] def duplicates(a, b, title_mapping): dups = {} for cluster in title_mapping.values(): if len(cluster) <= 1: continue doc_in_a = [d for d in cluster if a in d[2]] doc_in_b = [d for d in cluster if b in d[2]] if not doc_in_a or not doc_in_b: continue if a != b: for d in doc_in_a: dups[d[0]] = doc_in_b else: # Keep one document from the cluster for d in doc_in_a[1:]: dups[d[0]] = doc_in_b return dups duplicates('KP20k.train', 'Inspec.test', title_mapping) ```

For the version of KP20k that I shared in this repo I must come clean, I don't remember exactly what was done, but it was something similar to Ken Chan's method. I did not remove duplicates inside the training set of KP20k, but removed the documents appearing in PubMed (Schutz, 2008), KDD and WWW (Caragea, 2014).

For reference I add the work by Ken Chan from keyphrase-generation-rl, and compare the number of duplicates I found to what they found.

Table: Common files between the training set of KP20k and the datasets listed.	Corpus	split	this code
Inspec	test	59	61
ACM	test	1005	161
NUS	test	135	137
SemEval-2010	test	84	85
KP20k	test	1113	1395
KP20k	valid	1077	1353
KP20k	train	14023	17609

qute012 commented 3 years ago

Thank you for reply.

I understood. What is KP20k train common files in the table? Is it duplicate in itself?

ygorg commented 3 years ago

Yes it the duplicates in itself (for each cluster of documents that share a title only one is kept).

ygorg / JCDL_2020_KPE_Eval

Remove duplicate documents from KP20k and test datasets. #2