webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.41k stars 217 forks source link

warc.gz files created by grab-site throw multiple errors when adding to a collection #878

Closed RomeSilvanus closed 11 months ago

RomeSilvanus commented 11 months ago

Describe the bug

Apologies if this is not an issue with pywb at all. I have a lot of warc.gz files made with grab-site, when trying to import them the following errors pop up:

  1. WARNING: Record not followed by newline, perhaps Content-Length is invalid
  2. Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: 5e11

I did a search in the Issues but nothing came up.

(I also want to add this here without making a new issue that the docker image should have an ENV option to enable recording)

Steps to reproduce the bug

Make a warc with grab-site.

Expected behavior

A better error messages telling me if this is actually bad and can lead to problems, or if it is just a warning. Also the process of adding large files takes hours, so a progress indicator would be nice.

Environment

Additional context

Remainder: b'https://www.googletagmanager.com/gtag/gtm.js 1615152312 - - VZLOUV6FFUIVHTWDHTXZDT4TLUWTV4KN 1289 678505637 wikifur.com-2021-03-07-2f8602d4-00000.warc.gz <urn:uuid:f97d7a08-e20b-4547-ab0b-c280424232f6>\n'

    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 261612
    Remainder: b'https://www.googletagmanager.com/gtag/gtm.js 1615152313 - - VZLOUV6FFUIVHTWDHTXZDT4TLUWTV4KN 1289 678509424 wikifur.com-2021-03-07-2f8602d4-00000.warc.gz <urn:uuid:461e3e9f-a4d1-4c6a-86ef-3ef36c45068d>\n'
2023-12-12 20:57:45,258: [INFO]: Copied /data/Furry/wikifur.com/warc/wikifur.com-2021-02-06-6e3ef9af-meta.warc.gz to /webarchive/collections/archive/archive
2023-12-12 20:58:21,052: [INFO]: Copied /data/Furry/wikifur.com/warc/wikifur.com-2021-03-07-2f8602d4-00000.warc.gz to /webarchive/collections/archive/archive
2023-12-12 20:58:48,377: [INFO]: Copied /data/Furry/wikifur.com/warc/wikifur.com-2021-03-07-2f8602d4-meta.warc.gz to /webarchive/collections/archive/archive
2023-12-12 20:59:08,146: [INFO]: Copied /data/Furry/wikifur.com/warc/wikifur.com-2021-02-06-6e1cb86a-meta.warc.gz to /webarchive/collections/archive/archive
2023-12-12 20:59:42,724: [INFO]: Copied /data/Furry/wikifur.com/warc/wikifur.com-2021-02-06-6e1cb86a-00000.warc.gz to /webarchive/collections/archive/archive
2023-12-12 21:00:12,469: [INFO]: Copied /data/Furry/sinnerdragon.keenspace.com/warc/sinnerdragon.keenspace.com-2021-06-13-7bc59e6e-00000.warc.gz to /webarchive/collections/archive/archive
2023-12-12 21:00:29,630: [INFO]: Copied /data/Furry/sinnerdragon.keenspace.com/warc/sinnerdragon.keenspace.com-2021-06-13-7bc59e6e-meta.warc.gz to /webarchive/collections/archive/archive
2023-12-12 21:01:02,173: [INFO]: Copied /data/Furry/sinnerdragon.keenspace.com/warc/sinnerdragon.keenspace.com-2021-06-13-7bc59e6e.cdx to /webarchive/collections/archive/archive
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 23
    Remainder: b'http://sinnerdragon.keenspace.com/ 1623609354 - - 7SX2O6A5JWZKPMPSN5Z26MQMMPXJVAKN 1553 2839 sinnerdragon.keenspace.com-2021-06-13-7bc59e6e-00000.warc.gz <urn:uuid:37a84b1d-177e-4497-84e7-4aea7022816b>\n'
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 436
    Remainder: b'http://sinnerdragon.keenspace.com/sitemap.xml 1623609354 - - 5UJOZFP2XPLTZTZDX5WEEJF4WZBXZ6HE 1563 6264 sinnerdragon.keenspace.com-2021-06-13-7bc59e6e-00000.warc.gz <urn:uuid:3a494ccf-8511-4ac2-8672-7a8e9d54033d>\n'
2023-12-12 21:02:09,113: [INFO]: Copied /data/Furry/theyiffgallery.com/warc/theyiffgallery.com-2021-06-05-ecfffb33-00048.warc.gz to /webarchive/collections/archive/archive
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 96905049
    Remainder: b'idn\'t Happen Right Before</a><span class=menuInfoCat title="7 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [7] </span></li><li> <a href="index?/category/4603">Chilly Weather</a><span class=menuInfoCat title="4 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [4] </span></li><li class="liClosed"> <a href="index?/category/4601">Toriero</a><span class=menuInfoCatByChild title="10 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8 1 \xe4\xb8\xaa\xe5\xad\x90\xe7\x9b\xb8\xe5\x86\x8c\xe4\xb8\xad"> [10] </span><ul><li> <a href="index?/category/4602">Part II</a><span class=menuInfoCat title="10 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [10] </span></li></ul></li><li> <a href="index?/category/4600">Meeting Serenox</a><span class=menuInfoCat title="7 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [7] </span></li><li> <a href="index?/category/4599">What happens in the Changing room..</a><span class=menuInfoCat title="4 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [4] </span></li><li class="liClosed"> <a href="index?/category/4584">Silver Soul</a><span class=menuInfoCatByChild title="1125 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8 15 \xe4\xb8\xaa\xe5\xad\x90\xe7\x9b\xb8\xe5\x86\x8c\xe4\xb8\xad"> [1125] </span><ul><li> <a href="index?/category/6787">Volume XII</a><span class=menuInfoCat title="96 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [96] </span></li><li> <a href="index?/category/6541">Volume XI</a><span class=menuInfoCat title="94 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [94] </span></li><li> <a href="index?/category/6394">Volume X</a><span class=menuInfoCat title="107 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [107] </span></li><li> <a href="index?/category/6195">Volume IX</a><span class=menuInfoCat title="100 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [100] </span></li><li> <a href="index?/category/6094">Volume VIII Extra</a><span class=menuInfoCat title="8 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [8] </span></li><li> <a href="index?/category/5917">Volume VIII</a><span class=menuInfoCat title="97 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [97] </span></li><li class="liClosed"> <a href="index?/category/5723">Bonus</a><span class=menuInfoCatByChild title="9 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8 1 \xe4\xb8\xaa\xe5\xad\x90\xe7\x9b\xb8\xe5\x86\x8c\xe4\xb8\xad"> [9] </span><ul><li> <a href="index?/category/5724">Operation Executed</a><span class=menuInfoCat title="9 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [9] </span></li></ul></li><li> <a href="index?/category/5722">Volume VII</a><span class=menuInfoCat title="92 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [92] </span></li><li> <a href="index?/category/5562">Volume VI</a><span class=menuInfoCat title="100 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [100] </span></li><li> <a href="index?/category/5364">Volume V - Abyss</a><span class=menuInfoCat title="94 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [94] </span></li><li> <a href="index?/category/5178">Volume IV - Duality</a><span class=menuInfoCat title="99 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [99] </span></li><li> <a href="index?/category/4982">Volume III - Shadows</a><span class=menuInfoCat title="98 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [98] </span></li><li> <a href="index?/category/4710\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [15] </span></li><li> <a href="index?/category/5820">Chapter 1</a><span class=menuInfoCat title="34 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [34] </span></li></ul></li><li> <a href="index?/category/4844">Dungeon</a><span class=menuInfoCat title="5 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [5] </span></li><li> <a href="index?/category/4843">A Primal Desire</a><span class=menuInfoCat title="7 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [7] </span></li><li> <a href="index?/category/4842">Playing with mating spells</a><span class=menuInfoCat title="17 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [17] </span></li><li> <a href="index?/category/4832">Relax! Hot Springs Adventure</a><span class=menuInfoCat title="25 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [25] </span></li><li> <a href="index?/category/4817">The Silk Sash</a><span class=menuInfoCat title="7 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [7] </span></li><li> <a href="index?/category/4771">A day by the lake</a><span class=menuInfoCat title="11 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [11] </span></li><li class="liClosed"> <a href="index?/category/4767">Teenage Kicks</a><span class=menuInfoCatByChild title="27 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8 2 \xe4\xb8\xaa\xe5\xad\x90\xe7\x9b\xb8\xe5\x86\x8c\xe4\xb8\xad"> [27] </span><ul><li> <a href="index?/category/5791">Original</a><span class=menuInfoCat title="11 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [11] </span></li><li> <a href="index?/category/5790">Color</a><span class=menuInfoCat title="16 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [16] </span></li></ul></li><li class="liClosed"> <a href="index?/category/4759">Zell Usagi</a><span class=menuInfoCatByChild title="24 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8 2 \xe4\xb8\xaa\xe5\xad\x90\xe7\x9b\xb8\xe5\x86\x8c\xe4\xb8\xad"> [24] </span><ul><li> <a href="index?/category/4761">With Text</a><span class=menuInfoCat title="12 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [12] </span></li><li> <a href="index?/category/4760">Plain</a><span class=menuInfoCat title="12 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [12] </span></li></ul></li><li> <a href="index?/category/4754">Lovers</a><span class=menuInfoCat title="16 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [16] </span></li><li class="liClosed"> <a href="index?/category/4747">Wishes</a><span class=menuInfoCatByChild title="55 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8 2 \xe4\xb8\xaa\xe5\xad\x90\xe7\x9b\xb8\xe5\x86\x8c\xe4\xb8\xad"> [55] </span><ul><li> <a href="index?/category/6547">Chapter 2</a><span class=menuInfoCat title="25 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [25] </span></li><li> <a href="index?/category/6546">Chapter 1</a><span class=menuInfoCat title="30 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [30] </span></li></ul></li><li> <a href="index?/category/4746">Culture Shock</a><span class=menuInfoCat title="11 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [11] </span></li><li> <a href="index?/category/4745">Sir Yes Sir!</a><span class=menuInfoCat title="3 \xe7\x9b\xb8\xe5\x86\x8c \xe5\x9c\xa8\xe6\xad\xa4\xe7\x9b\xb8\xe5\x86\x8c\xe9\x87\x8c"> [3] </span></li><li> <a href="index?/category/4744">Unexpect\r\n'
2023-12-12 21:02:15,641: [ERROR]: Error while indexing file theyiffgallery.com-2021-06-05-ecfffb33-00048.warc.gz, Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/warcio/recordloader.py", line 224, in _detect_type_load_headers
    rec_headers = self.warc_parser.parse(stream, statusline)
  File "/usr/local/lib/python3.8/site-packages/warcio/statusandheaders.py", line 270, in parse
    raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: 5e11

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pywb-2.7.4-py3.8.egg/pywb/indexer/cdxindexer.py", line 306, in write_multi_cdx_index
    for entry in entry_iter:
  File "/usr/local/lib/python3.8/site-packages/pywb-2.7.4-py3.8.egg/pywb/indexer/archiveindexer.py", line 342, in __call__
    for entry in entry_iter:
  File "/usr/local/lib/python3.8/site-packages/pywb-2.7.4-py3.8.egg/pywb/indexer/archiveindexer.py", line 215, in join_request_records
    for entry in entry_iter:
  File "/usr/local/lib/python3.8/site-packages/pywb-2.7.4-py3.8.egg/pywb/indexer/archiveindexer.py", line 148, in create_record_iter
    for record in raw_iter:
  File "/usr/local/lib/python3.8/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/usr/local/lib/python3.8/site-packages/warcio/archiveiterator.py", line 257, in _next_record
    record = self.loader.parse_record_stream(self.reader,
  File "/usr/local/lib/python3.8/site-packages/warcio/recordloader.py", line 85, in parse_record_stream
    (the_format, rec_headers) = (self.
  File "/usr/local/lib/python3.8/site-packages/warcio/recordloader.py", line 229, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: 5e11
anjackson commented 11 months ago

Judging by this line...

2023-12-12 21:01:02,173: [INFO]: Copied /data/Furry/sinnerdragon.keenspace.com/warc/sinnerdragon.keenspace.com-2021-06-13-7bc59e6e.cdx to /webarchive/collections/archive/archive

...it looks like you're adding .cdx files to the collection, and pywb is trying to interpret them as WARC files. You should only add the .warc.gz files as pywb generates a new .cdxj file in a different folder.

RomeSilvanus commented 11 months ago

Alright, I really didn't see this. My fault and I made some changes.