Closed lizgzil closed 5 years ago
Hmm. My two cents would be to make the quick fix. Not that it seems like you're seeking to change s?
to [a-z]*
, for what's decribed above would change:
r'(^|[\W]+)references?(?=[\W]+|$)'
to:
r'(^|[\W]+)reference?[a-z]+(?=[\W]+|$)'
which would match references
but also referencing
because of the e?[a-z]+
.
@lizgzil now that you evaluated the component, how often does this appear? or to put it differently, is this a big problem?
No it isn't a big problem, the results are pretty similar whether you do this fix or not (at least for our metrics). With the new IOBE splitter and the new pdf data (all in https://github.com/wellcometrust/policytool/pull/161) test 1 and 2 results are:
current regex (with the s?
):
-----Information about evaluation 1:-----
F1-score
0.94
Classification report for the endnotes section
F1-score (micro avg): 0.99
Classification report for the bibliograph section
F1-score (micro avg): 0.98
Classification report for the reference section
F1-score (micro avg): 0.85
-----Information about evaluation 2:-----
Mean normalised Levenshtein distance
0.6495045095526315
Strict accuracy (micro)
0.0
Lenient accuracy (micro)
0.2692307692307692
better regex (with ?[a-z]+
):
-----Information about evaluation 1:-----
F1-score
0.944
Classification report for the endnotes section
F1-score (micro avg): 0.99
Classification report for the bibliograph section
F1-score (micro avg): 0.99
Classification report for the reference section
F1-score (micro avg): 0.85
-----Information about evaluation 2:-----
Mean normalised Levenshtein distance
0.6115218527857542
Strict accuracy (micro)
0.0
Lenient accuracy (micro)
0.3076923076923077
Sounds like an improvement given our metrics, even if it is a small one. I would say submit the change.
In
policytool/policytool/pdf_parser/tools/extraction.py
we search for the regexregex = r''.join([r'(^|[\W]+)', keyword, r's?(?=[\W]+|$)'])
wherekeyword
comes from a list given inpolicytool/policytool/resources/section_keywords.txt
(currently the keywords are 'reference' and 'bibliograph').Whilst this works for searching for the sections 'references' and 'reference' (when
keyword = 'reference'
) it doesn't find sections called 'bibliography' whenkeyword='bibliograph'
.I suggest changing to
regex = r''.join([r'(^|[\W]+)', keyword, r'?[a-z]+(?=[\W]+|$)'])
(i.e. changing the
s?
to?[a-z]+
) so that it searches for any keyword with the any number of letter endings 'bibliography, bibliographical, bibliographkdfhdskjfhdskfjh'.This is a question since shall we do a quite fix of this now (and do you have any comments on my new regex idea?) or shall we do something more complicated without regexes?