webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
30 stars 6 forks source link

Invalid Warc Files #22

Open jlarmstrongiv opened 3 years ago

jlarmstrongiv commented 3 years ago

Originated from https://github.com/webrecorder/warcio.js/issues/21#issuecomment-816835171

Files (links expire in 7 days):

Validators:

App:

ikreymer commented 3 years ago

The booya.warc has resource record with no Content-Type, this is breaking the warcat validation. ReplayWeb.page and other tools are more lenient. Did you mean to use a resource record here instead of a response? If so, it should set a Content-Type header..

WARC/1.0
WARC-Target-URI: https://swapi.dev/api/planets/1
WARC-Date: 2021-04-09T17:15:34Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:0751dafe-046a-4bea-a519-7b6c184a4de7>
WARC-Payload-Digest: sha-256:ae44afda086df85dfef397de89f1e108aa6eb0d5d1739777749a178cce3f02dd
WARC-Block-Digest: sha-256:ae44afda086df85dfef397de89f1e108aa6eb0d5d1739777749a178cce3f02dd
Content-Length: 821

{"name":"Tatooine","rotation_period":"23","orbital_period":"304","diameter":"10465","climate":"arid","gravity":"1 standard","terrain":"desert","surface_water":"1","population":"200000","residents":["http://swapi.dev/api/people/1/","http://swapi.dev/api/people/2/","http://swapi.dev/api/people/4/","http://swapi.dev/api/people/6/","http://swapi.dev/api/people/7/","http://swapi.dev/api/people/8/","http://swapi.dev/api/people/9/","http://swapi.dev/api/people/11/","http://swapi.dev/api/people/43/","http://swapi.dev/api/people/62/"],"films":["http://swapi.dev/api/films/1/","http://swapi.dev/api/films/3/","http://swapi.dev/api/films/4/","http://swapi.dev/api/films/5/","http://swapi.dev/api/films/6/"],"created":"2014-12-09T13:50:49.641000Z","edited":"2014-12-20T20:58:18.411000Z","url":"http://swapi.dev/api/planets/1/"}

warcio.js probably should just default to application/octet-stream though, as per: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#content-type

warctools is very old and does not support WARC 1.1, your first warcinfo record is WARC 1.1, while the rest are 1.0 - changing that to 1.0 will actually have it pass...

I realize additional examples will make it easier to use, will try to add them when I have chance!

jlarmstrongiv commented 3 years ago

Thanks so much for looking into this!

On my initial tests, opening the file with the Unarchiver still failed, but I’ll be able to try more combinations late tonight or tomorrow. Are there any other items or sample files I could check?

Yes, I was working on building flexible methods for saving resources related to the page that aren’t requests or responses. For compatibility, I can also try saving as a response type and see if that works.

More examples are always welcome—most of what I learned so far was from the test cases and readme.

jlarmstrongiv commented 3 years ago

I made the changes to calculate the warcHeaders { "Content-Type": "mime/type" } on each of my resources. I also tried removing my resources, but both The Unarchiver and jwattools still choked. How would you rate jwattools @ikreymer ? Not really sure what else to check or what’s different with the working node-warc version.

Demo file: https://share.fromtheexchange.space/file/space-fromtheexchange-share/booya-no-resources.warc