webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
32 stars 6 forks source link

Issue with parsing #18

Closed guerra321 closed 3 years ago

guerra321 commented 3 years ago

Hello there,

There seems to be an isssue with the parser in both warcio.js and warcio when it comes to files produced by node-warc. Even though the content-length is invalid, the warning printout does not display. This is not an issue in the replayweb.page however since that renders the page correctly, so I'm thinking its an issue with this (and obviously node-warc as well, but that's unrelated). Just letting you know.

ikreymer commented 3 years ago

Do you have an example of a WARC file that you could share that reproduces this issue?

guerra321 commented 3 years ago

Yes I do. Where do I upload the file?

On Sun, Jan 10, 2021 at 8:23 PM Ilya Kreymer notifications@github.com wrote:

Do you have an example of a WARC file that you could share that reproduces this issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/webrecorder/warcio.js/issues/18#issuecomment-757581881, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASF3VM2WMOHSSOKCC2BWRM3SZJHKPANCNFSM4VR7PBTQ .

ikreymer commented 3 years ago

You should be able to attach it to this issue, or you could also email it to info [at] webrecorder.net. How big is the file?

guerra321 commented 3 years ago

Here is the file.

x4W45Vf.warc.zip

ikreymer commented 3 years ago

Thanks, I've tested the WARC and am getting plenty of warnings about the content-length:

...
{"offset":60741,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/img/home/dere.png"}
Content-Length Too Small: Record not followed by newline, Remainder Length: 210, Offset: 108325
{"offset":108559}
Content-Length Too Small: Record not followed by newline, Remainder Length: 10, Offset: 129810
{"offset":129895}
Content-Length Too Small: Record not followed by newline, Remainder Length: 150, Offset: 143440
{"offset":143673}
Content-Length Too Small: Record not followed by newline, Remainder Length: 8, Offset: 145145
{"offset":145155,"warc-target-uri":"https://yanderesimulator.com/img/home/yan.png"}
{"offset":145790,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/img/home/yan.png"}
Content-Length Too Small: Record not followed by newline, Remainder Length: 230, Offset: 211756
{"offset":212057}
Content-Length Too Small: Record not followed by newline, Remainder Length: 129, Offset: 230532
{"offset":230681}
...

warcio.js (and python warcio) will attempt to parse the records if they can, even if the content-length is too small. replayweb.page will attempt to replay the resources as well. I think this is working as expected?

guerra321 commented 3 years ago

I forgot to mention that while I did get mentions of the content length warnings on the other resources, the very first response record (the HTML response) didn't display any content length warning for me. Does it detect that there was a content length error only after its been parsed?

ikreymer commented 3 years ago

The errors are detected as the WARC is parsed. The errors are printed to stderr, while the index is printed to stdout. If redirecting both to same file, I think the errors are printed as expected: With the WARC file you sent, I get:

warcio.js index ./x4W45Vf.warc &> ./output

output then contains, including the error on the first response record:

{"offset":0,"warc-type":"warcinfo"}
{"offset":344,"warc-type":"request","warc-target-uri":"https://yanderesimulator.com/"}
{"offset":947,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/"}
Content-Length Too Small: Record not followed by newline, Remainder Length: 3, Offset: 10337
{"offset":10344,"warc-type":"request","warc-target-uri":"https://yanderesimulator.com/dist/css/style.min.css"}
{"offset":11001,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/dist/css/style.min.css"}
{"offset":44297,"warc-type":"request","warc-target-uri":"https://yanderesimulator.com/dist/js/main.min.js"}
{"offset":44948,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/dist/js/main.min.js"}
{"offset":49298,"warc-type":"request","warc-target-uri":"https://yanderesimulator.com/dist/images/logo.svg"}
{"offset":49951,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/dist/images/logo.svg"}
{"offset":60094,"warc-type":"request","warc-target-uri":"https://yanderesimulator.com/img/home/dere.png"}
{"offset":60741,"warc-type":"response","warc-target-uri":"https://yanderesimulator.com/img/home/dere.png"}
Content-Length Too Small: Record not followed by newline, Remainder Length: 210, Offset: 108325
{"offset":108559}
...

I suppose there could be an option to terminate on first error, though I'm not sure if that's as useful as finishing and printing the errors.. Python warcio also detects this error.

If there aren't any other issues, I think can close this one

guerra321 commented 3 years ago

Oh, OK. I think i get it now. Sorry about this, I had a misunderstanding of how the software worked. Thanks