webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
657 stars 83 forks source link

Created Invalid WARC record #317

Closed rgaudin closed 6 months ago

rgaudin commented 1 year ago

The following run completed, reaching the size threshold.

Running browsertrix-crawler crawl: crawl --waitUntil load --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --sizeLimit 4294967296 --diskUtilization 90 --timeLimit 7200 --url https://www.wikidata.org/wiki/Q4414 --userAgentSuffix Youzim.it+contact+zimfarm@kiwix.org --cwd /output/.tmp0lehhxlr --statsFilename /output/crawl.json
{"logLevel":"info","timestamp":"2023-05-22T06:50:10.260Z","context":"general","message":"Browsertrix-Crawler 0.10.0-beta.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T06:50:10.264Z","context":"general","message":"Seeds","details":[{"url":"https://www.wikidata.org/wiki/Q4414","include":["/^https?:\\/\\/www\\.wikidata\\.org\\/wiki\\//"],"exclude":[],"scopeType":"prefix","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}

...

{"logLevel":"info","timestamp":"2023-05-22T08:12:00.817Z","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.wikidata.org/wiki/Q63144794","workerid":0}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.821Z","context":"general","message":"Size threshold reached 4302013563 >= 4294967296, stopping","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.844Z","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.844Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.484Z","context":"general","message":"Saving crawl state to: /output/.tmp0lehhxlr/collections/crawl-20230522065007735/crawls/crawl-20230522081200-177caf49d5fa.yaml","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.710Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":977,"total":22667,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.712Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.754Z","context":"general","message":"Crawl status: interrupted","details":{}}

For some reason, some produced WARC are invalid (not readable via warcio)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/warcio/recordloader.py", line 224, in _detect_type_load_headers
    rec_headers = self.warc_parser.parse(stream, statusline)
  File "/usr/local/lib/python3.10/dist-packages/warcio/statusandheaders.py", line 270, in parse
    raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: ÄiŒPQFNՃ¾À‘¨|%MÑ5"†¥Ø%ÍKÏäoq¾‘ÿ¼ÁdóY€–´…Wj³ß*TËJzvØE1$æD*eX€W(hÒ3ò>yD̼1¸’¥à“*îïx$I”ÍŒ0:Q2ÖqQô©ÄÀ˾ª[²d¾ª¯ŒþéÈ+@‘i¿^5˜_N*þäºÏ ™žP\Å#‘Vÿä÷u ÏB¸¶â©2%†Â<Œ÷;fÎ֌,ç‡â”&Ô/-™´û៝½[ÃIÅH÷.ŒÜZ[tDäëÎ

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 514, in <module>
    zimit()
  File "/usr/bin/zimit", line 419, in zimit
    return warc2zim(warc2zim_args)
  File "/usr/local/lib/python3.10/dist-packages/warc2zim/main.py", line 819, in warc2zim
    return warc2zim.run()
  File "/usr/local/lib/python3.10/dist-packages/warc2zim/main.py", line 374, in run
    self.find_main_page_metadata()
  File "/usr/local/lib/python3.10/dist-packages/warc2zim/main.py", line 453, in find_main_page_metadata
    for record in self.iter_all_warc_records():
  File "/usr/local/lib/python3.10/dist-packages/warc2zim/main.py", line 450, in iter_all_warc_records
    yield from iter_warc_records(self.inputs)
  File "/usr/local/lib/python3.10/dist-packages/warc2zim/main.py", line 739, in iter_warc_records
    for record in buffering_record_iter(ArchiveIterator(fh), post_append=True):
  File "/usr/local/lib/python3.10/dist-packages/cdxj_indexer/bufferiter.py", line 17, in buffering_record_iter
    for record in record_iter:
  File "/usr/local/lib/python3.10/dist-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/usr/local/lib/python3.10/dist-packages/warcio/archiveiterator.py", line 257, in _next_record
    record = self.loader.parse_record_stream(self.reader,
  File "/usr/local/lib/python3.10/dist-packages/warcio/recordloader.py", line 86, in parse_record_stream
    _detect_type_load_headers(stream,
  File "/usr/local/lib/python3.10/dist-packages/warcio/recordloader.py", line 229, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: ÄiŒPQFNՃ¾À‘¨|%MÑ5"†¥Ø%ÍKÏäoq¾‘ÿ¼ÁdóY€–´…Wj³ß*TËJzvØE1$æD*eX€W(hÒ3ò>yD̼1¸’¥à“*îïx$I”ÍŒ0:Q2ÖqQô©ÄÀ˾ª[²d¾ª¯ŒþéÈ+@‘i¿^5˜_N*þäºÏ ™žP\Å#‘Vÿä÷u ÏB¸¶â©2%†Â<Œ÷;fÎ֌,ç‡â”&Ô/-™´û៝½[ÃIÅH÷.ŒÜZ[tDäëÎ
rgaudin commented 1 year ago

With 0.10.0-beta.0

tw4l commented 6 months ago

This should no longer be happening in 1.1.x and forward - we added some checks to make sure that the WARC records are written before the crawler shuts down as long as it's a graceful shut down. Feel free to reopen if you encounter this again!