webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
375 stars 58 forks source link

Undocumented and non-standardised default Content-Type application/warc-record #92

Open JustAnotherArchivist opened 5 years ago

JustAnotherArchivist commented 5 years ago

warcio uses a default Content-Type value for WARC records of application/warc-record. This MIME type is not documented or specified anywhere; the WARC spec only mentions application/warc as the MIME type for WARC files and application/warc-fields for warcinfo and metadata records (though it is ambiguous on whether that is required or recommended).

ikreymer commented 4 years ago

Not sure what would be a better option here.. It is a fallback if no other Content-Type is specified and/or its a non-standard record. application/warc-fields is for the warcinfo style fields, which this is not. and application/warc makes sense for the content-type for the WARC itself, but not for the payload of the record.. I suppose it could be application/octet-stream but that would imply that its binary.

JustAnotherArchivist commented 4 years ago

The Content-Type header is optional, so omitting it would be one option. application/octet-stream also seems sensible to me. WARC is a byte-oriented file format, so any payload must also be a collection of bytes. While the underlying data could be bit-based, it must be padded to bytes, which makes the container an octet-stream again. The WARC specification also mentions:

If the media type remains unknown, the reader should treat it as type “application/octet-stream”.

Personally, I think omitting the header would be the best option.