wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

Use less characters in text to be split #506

Closed lizgzil closed 4 years ago

lizgzil commented 4 years ago

Description

This is a temporary fix in the current version of the deep reference parser used by Reach to not get the error

Traceback (most recent call last):
  File "./extract_refs_task.py", line 104, in <module>
    extracter.execute()
  File "/opt/reach/hooks/sentry.py", line 21, in wrapped_f
    return f(*args, **kwargs)
  File "./extract_refs_task.py", line 56, in execute
    for split_references, parsed_references in refs:
  File "/opt/reach/refparse/refparse.py", line 183, in yield_structured_references
    doc.section
  File "/usr/local/lib/python3.6/site-packages/deep_reference_parser/split_section.py", line 78, in split
    doc = nlp(text)
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 392, in __call__
    Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 1154040 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

when trying to split really large references sections into separate references.

Next steps for a better fix:

Assumptions:

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can run the tests. Please also list any relevant details for your test configuration:

Checklist: