Open wumpus opened 5 years ago
Work in progress -- now a pullreq https://github.com/webrecorder/warcio/pull/66
$ warcio test test/data/*.warc.gz test/data/*.warc
test/data/example-bad-non-chunked.warc.gz
saw exception
ERROR: non-chunked gzip file detected, gzip block continues
beyond single record.
This file is probably not a multi-member gzip but a single gzip file.
To allow seek, a gzipped WARC must have each record compressed into
a single gzip member and concatenated together.
This file is likely still valid and can be fixed by running:
warcio recompress <path/to/file> <path/to/new_file>
skipping rest of file
test/data/example-resource.warc.gz
WARC-Record-ID <urn:uuid:6e7f60da-2c7b-11e7-aaf7-0242ac120007>
WARC-Type resource
digest pass
comment: unknown field, no validation performed Warc-Referer https://webrecorder.io/temp-GRWZVUTV/temp/test/record/http://example.com/
comment: unknown field, no validation performed Warc-User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
test/data/example.warc.gz
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
WARC-Type revisit
digest present but not checked
recommendation: missing recommended header WARC-Refers-To
comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
test/data/example-wget-bad-target-uri.warc.gz
WARC-Record-ID <urn:uuid:CEF11DC9-8D86-4F4B-9B8C-2235515B4537>
WARC-Type request
digest pass
error: uri must not be within <> warc-target-uri <http://example.com/>
error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
WARC-Record-ID <urn:uuid:FD8A6D04-AF8A-4A36-A889-8094487CDF2D>
WARC-Type response
payload digest failed sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
error: uri must not be within <> warc-target-uri <http://example.com/>
error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
WARC-Record-ID <urn:uuid:E5AC383F-F107-47BC-99B7-176FD8DE6E94>
WARC-Type metadata
digest pass
error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
WARC-Record-ID <urn:uuid:543BCA4F-A305-4383-B511-0BCF23F7AD8D>
WARC-Type resource
digest pass
error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
WARC-Record-ID <urn:uuid:CCD67DB5-13FA-447B-BF05-BF1BDC8ED3A0>
WARC-Type resource
digest pass
error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
test/data/example-wrong-chunks.warc.gz
saw exception Invalid WARC record, first line: <!doctype html>
skipping rest of file
test/data/post-test.warc.gz
WARC-Record-ID <urn:uuid:59a6b068-cbc2-4767-9525-33043d2709c7>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:5eb8ee92-cda1-4503-a7a3-c63f1ab6515b>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:c79a62e3-5a4b-450d-a093-3a7fefa09664>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
test/data/example-digest-bad.warc
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
test/data/example-iana.org-chunked.warc
WARC-Record-ID <urn:uuid:c46fbf5f-0876-4652-a348-e9b6c322eabb>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
test/data/example-trunc.warc
WARC-Record-ID <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
WARC-Type response
block digest failed: sha1:DR5MBP7OD3OPA7RFKWJUD4CTNUQUGFC5
payload digest failed sha1:G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK
WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 2560
Remainder: b'\x00\x00\r\n'
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
test/data/example.warc
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
WARC-Type revisit
digest present but not checked
recommendation: missing recommended header WARC-Refers-To
comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
I built a thing that tests a warc for standards conformance. The cli is similar to "warcio check". It's 440 lines of code so far, likely to be around 1,000 when done.
It will need an extended testing and tweaking period while it's tested against everything in the ecosystem that generates warcs. Discussion might be ... vigorous. I'm currently labeling things as "not standard conforming", "following/not following recommendations", and "comments". Hopefully not too many hairs will be split.
Does this belong in warcio? My hope is that it will be commonly used; with luck that means that the entire web archiving ecosystem will keep warcio installed and part of their testing processes.