Open wumpus opened 6 years ago
I suppose a check for trailing CR/LF in headers could be added to prevent accidentally invalid headers. Then again, this hasn't been something that's happened, and its generally understood that headers should not have newlines in them..
The new record_http
semantics make it easier to create WARC records from network traffic automatically..
Did you have something in mind specifically? My only thought is checking warc headers for trailing newlines perhaps.
warc headers is one, http_headers is another to check. Since these are multi line things, it's easy to imagine a programmer screwing up and putting \r\n on the end. I think both of those would cause problems. I suppose I could write a test that demonstrates that...
I was reccently bitten by this problem in multiple ways:
Both are stripped at some point when jugling with records thus changing the digests that were computed before the stripping and make the verification fail afterwards.
I think those are separate bugs from what we were just discussing but bugs none the less -- can you provide a warc with records having these 'features' and a test that round trips it and complains that it's changed?
Given this whitespace-related header bug that crept into the August 2018 Common Crawl crawl , it would be nice if it was somewhat difficult to create broken WARC files using warcio.
I see a couple of possible issues: