Fix the regex pdf parser uses

wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning

MIT License

25 stars 4 forks source link

Fix the regex pdf parser uses #152

Closed lizgzil closed 5 years ago

lizgzil commented 5 years ago

In policytool/policytool/pdf_parser/tools⁩/extraction.py we search for the regex regex = r''.join([r'(^|[\W]+)', keyword, r's?(?=[\W]+|$)']) where keyword comes from a list given in policytool/⁨policytool⁩/resources⁩/section_keywords.txt (currently the keywords are 'reference' and 'bibliograph').

Whilst this works for searching for the sections 'references' and 'reference' (when keyword = 'reference') it doesn't find sections called 'bibliography' when keyword='bibliograph'.

I suggest changing to regex = r''.join([r'(^|[\W]+)', keyword, r'?[a-z]+(?=[\W]+|$)'])

(i.e. changing the s? to ?[a-z]+) so that it searches for any keyword with the any number of letter endings 'bibliography, bibliographical, bibliographkdfhdskjfhdskfjh'.

This is a question since shall we do a quite fix of this now (and do you have any comments on my new regex idea?) or shall we do something more complicated without regexes?

hblanks commented 5 years ago

Hmm. My two cents would be to make the quick fix. Not that it seems like you're seeking to change s? to [a-z]*, for what's decribed above would change:

r'(^|[\W]+)references?(?=[\W]+|$)'

to:

r'(^|[\W]+)reference?[a-z]+(?=[\W]+|$)'

which would match references but also referencing because of the e?[a-z]+.

nsorros commented 5 years ago

@lizgzil now that you evaluated the component, how often does this appear? or to put it differently, is this a big problem?

lizgzil commented 5 years ago

No it isn't a big problem, the results are pretty similar whether you do this fix or not (at least for our metrics). With the new IOBE splitter and the new pdf data (all in https://github.com/wellcometrust/policytool/pull/161) test 1 and 2 results are:

current regex (with the s?):

-----Information about evaluation 1:-----
F1-score
0.94

Classification report for the endnotes section
F1-score (micro avg): 0.99

Classification report for the bibliograph section
F1-score (micro avg): 0.98

Classification report for the reference section
F1-score (micro avg): 0.85

-----Information about evaluation 2:-----

Mean normalised Levenshtein distance
0.6495045095526315

Strict accuracy (micro)
0.0

Lenient accuracy (micro)
0.2692307692307692

better regex (with ?[a-z]+):

-----Information about evaluation 1:-----
F1-score
0.944
Classification report for the endnotes section
F1-score (micro avg): 0.99

Classification report for the bibliograph section
F1-score (micro avg): 0.99

Classification report for the reference section
F1-score (micro avg): 0.85

-----Information about evaluation 2:-----

Mean normalised Levenshtein distance
0.6115218527857542

Strict accuracy (micro)
0.0

Lenient accuracy (micro)
0.3076923076923077

nsorros commented 5 years ago

Sounds like an improvement given our metrics, even if it is a small one. I would say submit the change.