Description

This is a temporary fix in the current version of the deep reference parser used by Reach to not get the error

Traceback (most recent call last):
  File "./extract_refs_task.py", line 104, in <module>
    extracter.execute()
  File "/opt/reach/hooks/sentry.py", line 21, in wrapped_f
    return f(*args, **kwargs)
  File "./extract_refs_task.py", line 56, in execute
    for split_references, parsed_references in refs:
  File "/opt/reach/refparse/refparse.py", line 183, in yield_structured_references
    doc.section
  File "/usr/local/lib/python3.6/site-packages/deep_reference_parser/split_section.py", line 78, in split
    doc = nlp(text)
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 392, in __call__
    Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 1154040 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

when trying to split really large references sections into separate references.

Next steps for a better fix:

Merge https://github.com/wellcometrust/reach/pull/505
Make this correction to the deep reference parser: https://github.com/wellcometrust/deep_reference_parser/issues/36
Update deep reference parser release with above change
Remove this change from Reach

Assumptions:

It's rare that the reference section of a doc to have >1000000 characters so this change shouldn't effect results too much.

Type of change

Please delete options that are not relevant.

[x] :bug: Bug fix (Add Fix #(issue) to your PR)
[ ] :sparkles: New feature
[ ] :fire: Breaking change
[ ] :memo: Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can run the tests. Please also list any relevant details for your test configuration:

Checklist:

[ ] My code follows the style guidelines of this project (pep8 AND pyflakes)
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] If needed, I changed related parts of the documentation
[ ] I included tests in my PR
[ ] New and existing unit tests pass locally with my changes
[ ] Any dependent changes have been merged and published in downstream modules
[ ] If my PR aims to fix an issue, I referenced it using #(issue)

wellcometrust / reach

Use less characters in text to be split #506

Description

Type of change

How Has This Been Tested?

Checklist: