webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
370 stars 58 forks source link

Warc tester #59

Open wumpus opened 5 years ago

wumpus commented 5 years ago

I built a thing that tests a warc for standards conformance. The cli is similar to "warcio check". It's 440 lines of code so far, likely to be around 1,000 when done.

It will need an extended testing and tweaking period while it's tested against everything in the ecosystem that generates warcs. Discussion might be ... vigorous. I'm currently labeling things as "not standard conforming", "following/not following recommendations", and "comments". Hopefully not too many hairs will be split.

Does this belong in warcio? My hope is that it will be commonly used; with luck that means that the entire web archiving ecosystem will keep warcio installed and part of their testing processes.

wumpus commented 5 years ago

Work in progress -- now a pullreq https://github.com/webrecorder/warcio/pull/66

$ warcio test test/data/*.warc.gz test/data/*.warc
test/data/example-bad-non-chunked.warc.gz
  saw exception 
    ERROR: non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-member gzip but a single gzip file.

    To allow seek, a gzipped WARC must have each record compressed into
    a single gzip member and concatenated together.

    This file is likely still valid and can be fixed by running:

    warcio recompress <path/to/file> <path/to/new_file>
  skipping rest of file
test/data/example-resource.warc.gz
  WARC-Record-ID <urn:uuid:6e7f60da-2c7b-11e7-aaf7-0242ac120007>
    WARC-Type resource
    digest pass
    comment: unknown field, no validation performed Warc-Referer https://webrecorder.io/temp-GRWZVUTV/temp/test/record/http://example.com/
    comment: unknown field, no validation performed Warc-User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
test/data/example.warc.gz
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example-wget-bad-target-uri.warc.gz
  WARC-Record-ID <urn:uuid:CEF11DC9-8D86-4F4B-9B8C-2235515B4537>
    WARC-Type request
    digest pass
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:FD8A6D04-AF8A-4A36-A889-8094487CDF2D>
    WARC-Type response
    payload digest failed sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:E5AC383F-F107-47BC-99B7-176FD8DE6E94>
    WARC-Type metadata
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
  WARC-Record-ID <urn:uuid:543BCA4F-A305-4383-B511-0BCF23F7AD8D>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
  WARC-Record-ID <urn:uuid:CCD67DB5-13FA-447B-BF05-BF1BDC8ED3A0>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
test/data/example-wrong-chunks.warc.gz
  saw exception Invalid WARC record, first line: <!doctype html>
  skipping rest of file
test/data/post-test.warc.gz
  WARC-Record-ID <urn:uuid:59a6b068-cbc2-4767-9525-33043d2709c7>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:5eb8ee92-cda1-4503-a7a3-c63f1ab6515b>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:c79a62e3-5a4b-450d-a093-3a7fefa09664>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-digest-bad.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-iana.org-chunked.warc
  WARC-Record-ID <urn:uuid:c46fbf5f-0876-4652-a348-e9b6c322eabb>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-trunc.warc
  WARC-Record-ID <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
    WARC-Type response
    block digest failed: sha1:DR5MBP7OD3OPA7RFKWJUD4CTNUQUGFC5
    payload digest failed sha1:G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 2560
    Remainder: b'\x00\x00\r\n'
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests