wellcometrust / covid19

Covid QA system for kaggle challenge
0 stars 0 forks source link

Add language filter and don't include duplicates in paper text #19

Closed lizgzil closed 4 years ago

lizgzil commented 4 years ago

Languages found in all the metadata docs were: Counter({'en': 57243, 'fr': 170, 'es': 149, 'it': 12, 'nl': 12, 'de': 11, 'pt': 5, 'ca': 4, 'ro': 2, 'et': 1, 'af': 1})

this reduces data_text from 3353 to 2757