wo / paperscraper

tracking and parsing new philosophy papers on the internet
9 stars 4 forks source link

prevent crashes on huge documents #82

Closed wo closed 8 years ago

wo commented 8 years ago

http://www.socsci.uci.edu/~jabarret/bio/publications/bookpdf.pdf is a 398MB scanned book that causes a memory error in requests: File "/home/wo/opp-tools/bin/../opp/scraper.py", line 155, in scrape process_link(li) File "/home/wo/opp-tools/bin/../opp/scraper.py", line 211, in process_link r = li.fetch(url=url, only_if_modified=not(force_reprocess)) File "/home/wo/opp-tools/bin/../opp/scraper.py", line 799, in fetch if not r.text: File "/usr/lib/python3/dist-packages/requests/models.py", line 711, in text encoding = self.apparent_encoding File "/usr/lib/python3/dist-packages/requests/models.py", line 598, in apparent_encoding return chardet.detect(self.content)['encoding'] File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 30, in detect u.feed(aBuf) File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 128, in feed if prober.feed(aBuf) == constants.eFoundIt: File "/usr/lib/python3/dist-packages/chardet/charsetgroupprober.py", line 64, in feed st = prober.feed(aBuf) File "/usr/lib/python3/dist-packages/chardet/sbcharsetprober.py", line 72, in feed aBuf = self.filter_without_english_letters(aBuf) File "/usr/lib/python3/dist-packages/chardet/charsetprober.py", line 57, in filter_without_english_letters aBuf = re.sub(b'([A-Za-z])+', b' ', aBuf) File "/usr/lib/python3.4/re.py", line 179, in sub return _compile(pattern, flags).sub(repl, string, count) MemoryError

I should either try to catch that or check the filesize first (before downloading even?) and refuse to handle ridiculously large documents.