Open JustAnotherArchivist opened 5 years ago
Is this a bug in the standard instead of a bug in warcio and many other tools? Can you point out any common tool that actually obeys the standard?
There are a couple of examples where the standard was changed to match the tools, this might be a new one.
I see from your links that this has been discussed by the WARC standards folks, but it sure doesn't seem like the subset of tools that preserve transfer encoding have adopted that digest algorithm.
One thing this does inspire me to do is to add some code to "warcio check" and "warcio test" to check for and comment on this situation
From the discussion I linked, it seems like the intention was as written, not as implemented. So in that sense, I'd say it's not a bug in the standard.
I haven't investigated every tool; I just did some validation of WARCs created with wpull and warcio, both of which had this issue. From a glance over the relevant code, it seems like wget does remove chunking for the payload digest. However, it also looks like it doesn't handle compression correctly (with which wget has had issues before). Due to the lack of a suitable testing setup, I haven't been able to test that yet though.
Well, if you'd like to look farther, the develop branch of warcio supports "warcio check", which can be used to also check digests -- albeit with warcio's current opinion that digests are computed on the bytes without removing transfer encoding. You can also install the branch from my "warcio test" pullreq (which is a work in progress) and give "warcio test" a spin!
It's my hope that "warcio test" will lead the community to interact about all of these conflicts between the standard and the tools people actually use.
Thanks for flagging this! Hm, I wonder if any tools actually do what the standard suggests in this regard? I think the content-decoding would also need to be decoded.
The change in warcio should be simple, I think just:
- for buf in self._iter_stream(record.raw_stream):
+ for buf in self._iter_stream(record.content_stream()):
Of course, would invalidate previous payload digests, so not sure if this is the right thing to do, especially if other tools also behave the same way now..
Also, I wonder what the implications of this are.. presumably the problem would be for deduplication if the same content is served with different transfer-encoding and/or content-encodings and would unnecessarily be fetched? Or chunks of different size used for same content? I wonder how likely/prevalent this is in practice.
I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.
These issues are exactly why I suspect the standard is what's going to change. Those "bad" digests are all over petabytes of response and revisit records, and the cdx indices ...
@wumpus Yes, I want to play around with this further. I need to set up a test HTTP server that can produce all the different relevant combinations (unchunked+uncompressed [which should be unproblematic in all implementations], chunked+uncompressed, unchunked+compressed, and chunked+compressed). httpbin unfortunately can't do it; it only has compression options for content encoding, not for transfer encoding, and it doesn't seem to be possible to combine it with chunking either. I guess I'll just write something that serves some hardcoded example data for each of these.
@ikreymer
No, the entity-body
in RFC 2616 is the (potentially content-encoded) body before applying transfer encoding (if any), so only the transfer encoding needs to be stripped, but the content encoding needs to stay intact. At least that's what the standards say. Whether that actually makes sense, I'm not sure.
I guess the intention is indeed deduplication when the chunk boundaries or something in the compression changes. I've definitely seen varying chunks returned from web servers for identical content, which isn't too surprising since those chunks are often based on a buffer which may get flushed before being full depending on a variety of factors (current CPU/disk speed, concurrent activity, etc.). I have little experience with compression transfer encoding. But I wonder now whether such deduplication is really a good idea. WARC is all about preserving HTTP requests and responses as accurately as possible. But if revisit records might not actually be identical duplicates, that's a loss of information, namely that the transfer encoding from that server is variable rather than constant. That's only a tiny bit of information and probably not at all useful on its own, but it still seems a bit odd.
I'm aware of the implications for existing archives. But unfortunately, those files do not conform to the standard as it's written currently, and I believe that's a significant issue that needs to be solved, whether that requires changes in the softwares or in the standard. Based on the discussion linked above (or rather, the fact that this exact issue was discussed fairly extensively before), I assumed that a change of the standard is not wanted/needed. But I realise now that this may very well just be an issue of the people working on the standard not knowing how the existing implementations work in this regard.
I'll continue my investigations soon how other tools react to this and file an issue on the WARC specification repo instead if it turns out that the majority of relevant tools don't adhere to the written standard, i.e. if the standard should likely be changed.
I'm planning on writing a jumbo article on what "warcio test" finds, and this will end up being a part of that, if you don't beat me to it. I've got a repo with a pile of warcs from all of the major tools, etc etc.
Thanks for looking into this more deeply. At first glance, removing transfer-encoding but not content-encoding seems to make even less sense then keeping both or removing both.
I wanted to add, the IIPC hosts a weekly open source web archiving call on Tuesdays, alternating between 9am and 6pm EST currently. I think this would be a great conversation to bring up there as well for more immediate feedback. The next one is tomorrow, April 29 at 9am EST, following week will be 6pm. I can also mention this issue on the IIPC slack channel as well if you'd like.
This came up for me recently as well; it broke my expectations about how CDX indexes worked, in that when I fetched files via wayback (without replay) and wrote out to disk, the on-disk blob's digest didn't match the digest in the corresponding CDX line. Our CDX build process simply copies the WARC content digest header as-is without verification or re-calculation, and the browser (and/or wayback earlier) strips any encoding before writing to disk.
Here's a URL which consistently induces the ambiguity: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf
A notable tool which does "follow the spec" is warcprox, used as part of brozzler, as of 2017 (https://github.com/internetarchive/warcprox/commit/3a0f6e0947)
That's an easy problem to see because many http clients will undo chunked encoding and decompress by default. So you can't depend on the checksum not changing. And even if the community agreed about transfer encoding and checksums, decompression is still going to change it.
@wumpus Sounds great! I remember your request on the wpull repository from a few months ago. Great to hear that this project of yours is coming along nicely. :-) I assume those WARCs all cover very different content, right? It would be very interesting to do a WARC-writing tool intercomparison where all tools (both current and common versions, I think, but not all historical releases) are run against the same sample server. Perhaps the small static server I hinted at could be used for this, so that it'd be "hey, you're maintaining a software that makes HTTP requests and writes to WARC, could you run it against this server and this list of URLs please?" instead of "oh, I need to hunt down an old version of this OS and that library to get this to run". I'll be looking into writing such a server soon.
@ikreymer At first glance, it seems silly, yes. But Transfer-Encoding is intended for encoding the data on-the-fly whereas Content-Encoding is at-rest compression. So the content-encoded data should never change between requests unless the source data has changed whereas the transfer encoding may vary freely. So I guess it kind-of makes sense.
First time I hear about that call after working with WARCs daily for two years, although I must admit that I never searched for things like this either. Is this publicly accessible?
Please do spread it wherever it makes sense. We should also probably find a better place for this discussion than here, but I have no idea where that might be.
@bnewbold That example URL uses chunked encoding, so it makes perfect sense that this causes issues. Thanks for pointing out how warcprox behaves!
@JustAnotherArchivist Yes, I see your point, though I suppose both could just as easily applied on-the-fly, eg. with nginx gzip on
and chunked_transfer_encoding on
options.
The IIPC call is open to everyone and there is more info in the IIPC slack at: https://iipc.slack.com #oh-sos channel (I can invite you if you'd like, PM your email)
@bnewbold Good to know that warcprox does follow the standard, was about to look at what it does.
So, I decided to run a few quick tests using different tools and got 3 different digests! However, they were consistent across multiple tries of the same tool.
I tried: 1) pywb and webrecorder.io, which produces the same digest, but as it turns out, actually strip out transfer-encoding before writing to the WARC 2) The new warcio capture_http semantics, eg:
from warcio.capture_http import capture_http
import requests
with capture_http('2-warcio.warc.gz') as fh:
requests.get('http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf')
This does preserve the chunked encoding in the WARC record. 3) warcprox, which also preserves the chunked encoding. At first glance, the chunking in 3) and 2) appeared to be the same, though the digests were different..
Trying each tool multiple times produces the same results, so that's good news.
And, running warcio extract --payload
on each produced exactly same payloads as well, also good.
But, clearly there's a few off, such as 1) stripping out the transfer-encoding altogether) while 2) and 3) should be matching. Also, warcio check
appears to be getting confused by the chunked encoding, hitting the 'digest present but not checked' condition. (@wumpus if you have a chance to take a look at that)
I've attached three files here. Will try to investigate further.. 1-pywb.warc.gz 2-warcio.warc.gz 3-warcprox.warc.gz
Thanks, will contact you shortly.
1-pywb.warc.gz is arguably the worst because it preserves the Transfer-Encoding header while decoding the body. That means the data consumer will have to detect and work around that. It means that the payload digest is according to spec though.
2-warcio.warc.gz has the expected, spec-violating "payload with transfer encoding" hash.
3-warcprox.warc.gz has the (arguably) correct "payload with transfer encoding removed" hash. But it writes the hash in hexadecimal form rather than the base-32 one which is common in other WARCing tools.
Looks like 3-warcprox.warc.gz has an incorrect hash for the request record though: it is equal to the block hash when it should be sha1('')
since there is no request body.
I thought my latest pullreq (which got pulled into the devel branch) fixed that bug, and whatever I have installed locally sees digest pass. tests/data/example-iana.org-chunked.warc works fine for me, too, although it is excluded from test_check_digest_examples.py for some reason. Maybe you haven't pulled devel since that merge?
(Your unchunking code wasn't reading the last couple of bytes... ;-) )
Hmm, perhaps this should move to #75 or separate issue, but running warcio check -v
with latest develop (https://github.com/webrecorder/warcio/commit/3fffe1e85b2c447faf6cd81943f3196820151af2) on OSX py3.6, I get:
$ warcio check -v 1-pywb.warc.gz
1-pywb.warc.gz
WARC-Record-ID <urn:uuid:763e3df6-6af3-11e9-a87a-dca90482c8ee> response
digest pass
WARC-Record-ID <urn:uuid:d5f051b6-e65d-4f1a-90e8-668081f70d97> request
digest pass
$ warcio check -v 2-warcio.warc.gz
2-warcio.warc.gz
WARC-Record-ID <urn:uuid:0db77d79-6df6-4aa5-bd94-4969a360b0f6> response
digest present but not checked
WARC-Record-ID <urn:uuid:edc48184-9595-4e7f-a408-26e78c63cb83> request
digest pass
$ warcio check -v 3-warcprox.warc.gz
3-warcprox.warc.gz
WARC-Record-ID <urn:uuid:379bd24c-b8b2-4d5d-885e-81a44557f52f> warcinfo
no digest to check
WARC-Record-ID <urn:uuid:8258e6d3-8289-4fd4-a5b9-37fce48723ab> response
digest present but not checked
WARC-Record-ID <urn:uuid:c05cabbf-6dda-441d-b617-f5be7d25c70a> request
payload digest failed: sha1:6d2916e9d1884ba8e3f0698b3f9a6f42c2457bb3
Somehow, the digest present but not checked
condition is still being reached... This doesn't repro for you?
Whoops, I fixed the bug in my develop-warcio-test branch, which is still a work in progress:
if not chunk_size:
# chunk_size 0 indicates end of file
+ final_data = self.stream.read(2)
+ assert(final_data == b'\r\n')
self.all_chunks_read = True
self._process_read(b'')
return
I'll prepare another "tweak" PR and make sure there's a test... hm and that assert needs to get wrapped in a try/except like the other ones in this method: https://github.com/webrecorder/warcio/pull/77
Regarding the choice made in the WARC standard, IMHO it makes as much sense as anything. My understand is that in theory,content-encoding
is supposed to inform you about the file in much the same way that content-type
does. If you download a content-encoding: gzip
url with wget
or curl -O
, it saves it gzipped, as it arrived over the wire. (Try it with https://httpbin.org/gzip).
On the other hand, wget
or curl -O
of a transfer-encoded url decodes it before saving. Transfer-encoding
is a hop-by-hop header, unlike content-encoding
. A confounding factor is that content-encoding: gzip
is often used where transfer-encoding: gzip
might make more sense.
Transfer-Encoding is a hop-by-hop header, that is applied to a message between two nodes, not to a resource itself. Each segment of a multi-node connection can use different Transfer-Encoding values. If you want to compress data over the whole connection, use the end-to-end Content-Encoding header instead.
(That last sentence might explain why we rarely see transfer-encoding: gzip
)
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding
@JustAnotherArchivist oh ouch thanks for pointing out that warcprox writes an incorrected WARC-Payload-Digest header on request records
I'm not sure what the best route forward is. I still intend to initiate a software comparison about this, both for WARC writing and reading tools. However, I think we could only really call this a bug in the standard if all common tools used the same wrong digest, i.e. if there were a consensus on how it should be done. We've already established that at least one common tool (warcprox) follows the standard, which leads me to think that the best way may be to fix all other tools to do so rather than amending the standard. This may be painful, but on the other hand, this issue was not noticed for nearly a decade since the initial WARC/1.0 specification release, so clearly digest validation is not something that is commonly done (correctly).
For the examples I mentioned at the top, one common tool was different and it was still considered a bug by IIPC... wget started putting <> around the URI in WARC-Target-URI, but no other tool did, and the standard was changed to not have brackets.
My suspicion is that every tool that doesn't remove transfer encoding is doing this wrong, other than warcprox.
And yes, digest validation isn't commonly done, that's why I put it in warcio! Along the way I discovered several other bugs in the standard. I opened tickets in the correct place for each of them; this repo isn't the place for the IIPC community to have a discussion about possible standards bugs.
Yeah, I agree that this isn't the place. This issue turned from "warcio is writing invalid digests" to "(nearly) everything is writing standard-violating digests" very quickly. Later I didn't want to open an issue on the warc-specifications repository without data on how the most common tools behave. And then I didn't have time for that analysis. Anyway, I'll move this discussion elsewhere.
warcio should probably keep doing its current digest calculation for now to avoid another mess like with wget's <URI>
, i.e. consider this issue on hold.
If this issue does get raised in another venue a link to it from here would be greatly appreciated!
I think the correct place is the IIPC issue tracker, examples: https://github.com/iipc/warc-specifications/issues/49 https://github.com/iipc/warc-specifications/issues/33 https://github.com/iipc/warc-specifications/issues/48 ... ideally one of us interested people would do a bit of a survey to see which tools keep transfer-encoding and then what they do for the digest. So far I've been too lazy to do that.
Yep, that's exactly what I meant by "data on how the most common tools behave" above. I did start working on that back in April but then got distracted by other things. To be continued soon™.
Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that warcio
shouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makes warcio
report false information, and a crawler writing proper WARC files would fail the check because of warcio
not being able to read and write WARC files entirely as defined by the specs.
If nobody wants to fix that issue, it's fine, I don't personally use warcio
and this is one of the reasons why, but at least warcio check
shouldn't be a thing. https://github.com/webrecorder/warcio#check
Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that
warcio
shouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makeswarcio
report false information, and a crawler writing proper WARC files would fail the check because ofwarcio
not being able to read and write WARC files entirely as defined by the specs.If nobody wants to fix that issue, it's fine, I don't personally use
warcio
and this is one of the reasons why, but at leastwarcio check
shouldn't be a thing. https://github.com/webrecorder/warcio#check
It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while.. But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..
It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while..
I understand that it is a mistake that is fairly easy to do because it's not clearly stated in the specs, but it's not ambiguous because entity-body
is clearly defined in RFC 2616.
But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..
I think the priority is to stop giving the possibility to verify WARCs with warcio
until this is fixed. And I definitely think the default behavior should be fixed, the "good way" shouldn't be optional, it should be the default and the only way to do that.
If I understand all the messages above, it sounds like people are expecting the standard to change to match what tools like warcio
are doing, it is not happening right now and it won't happen tomorrow. We can't know if it will happen one day, but in the meantime I think warcio
should be fixed and you should stop offering warcio check
until this is fixed.
I also understand that fixing warcio
isn't easy. I really do. It's why I put en emphasis on removing the check
capability until you figure out how to fix it.
JWAT and https://github.com/nlnwa/warchaeology also mark these digests as incorrect.
I though the purpose of these digests were to validate the content as it is in the payload. It makes no sense at all to have a digests that needs extra computations in order to validate the payload. It just add another useless layer with more possibilities for errors in writers and validators.
This is the first time since writing JWAT that I've been made aware of this digest issue. So as pointed out else where there are massive amounts of existing archives with exact payload digests that could be affected by changing how digests are calculated.
Wget has this issue as well. At Archive Team we use the modified version of Wget at https://github.com/ArchiveTeam/wget-lua. This is used to archive petabytes of data.
This modified Wget is now updated https://github.com/ArchiveTeam/wget-lua/issues/13, and as of version 1.21.3-at.20230623.01 calculates the correct WARC-Payload-Digest
value - that is with transfer encoding stripped for calculation of the digest, while preserving the transfer encoding in the WARC record block.
Over the next days this will be rolled out for all Archive Team distributed Warrior projects.
Coming back to this after a few years, I think the best solution is to strip out the Transfer-Encoding altogether when writing the WARCs, so that WARCs are written with the Transfer-Encoding removed from the data, and header replaced with something like X-Removed-Transfer-Removed (we might have discussed that in another issue), and the payload digest matches the data with no Transfer-Encoding.
After all these years, haven't really seen a need to access the transfer-encoded data, so I agree with @nclarkekb that it just creates more work for anyone using the data.
WARCs created capture via other methods, such as browser extensions / warcio.js also has the encoding already removed. The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3 so we might as well represent the 'canonical' text based HTTP response, no transfer encoding applied and no transfer-encoding header so that the data makes sense.
I also remember that (brief) discussion, but don't know where it happened. I thought it was in iipc/warc-specifications but couldn't find it there. It's certainly a discussion worth having, probably over there.
However, any tools that currently strip the transfer encoding from the record body, transform the HTTP headers, or write HTTP/2 or HTTP/3 data to WARC are undeniably violating the standard. This is not ambiguous in the WARC spec: only HTTP/1.1 with the exact HTTP-layer-level data as sent/received is compatible with WARC versions 1.0 and 1.1. And since this is a format intended for long-term archival, spec compliance is crucial in my opinion.
(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)
@ikreymer I very strongly disagree with this idea, and I agree with what @JustAnotherArchivist noted.
If the WARCs captured with "browser extensions / warcio.js" already have encoding removed, then these implementation are in violation of the standard and do not create proper WARCs. Whether this is an issue is up to the individual using the tool, but they should be aware of this, if the issue is known.
Previously (https://github.com/webrecorder/warcio/issues/74#issuecomment-487775071) the argument was made that
I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.
If the suggestion is to strip Transfer-Encoding entirely, store this in the WARC, and calculate the payload digest anyway on the stripped content, then what is stopping us from just calculating the payload anyway on the stripped body, while writing the raw body (including Transfer-Encoding) to WARC?
The WARC standard notes clearly what should be stored, and what the digests should be calculated over. For many implementations the standard was not accurately read/analysed. I would say it would be best if implementations are fixed whenever possible and simply start calculating the correct digests moving forward.
The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3
We should agree on a solution for this in https://github.com/iipc/warc-specifications, and move forward with that.
(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)
Yes, I agree with this. When people use tools that write WARCs, they will likely assume that these WARCs are entirely following the WARC standard. Some tools do not. Sometimes that is known, sometimes it is yet to be discovered. I think it would be good to make clear to people who use WARC tools in what ways these tools do not follow the WARC standard, and possibly why they do not follow the WARC standard. They can then decide for themselves whether that is a problem, possibly with recommendations from the IIPC on the long term aspect.
Weird that after 4 years of discussion, no one has yet to look at IA's and CC's warcs to see what they do with this issue.
Just putting my 2c, as a developer / maintainer of several different versions of crawlers and wayback machine at Internet Archive...
Content-Encoding: gzip
or not), but makes sense to me because that's how most basic HTTP client would work (they never fail to decode Transfer-Encoding, but may not handle Content-Encoding)Hey, so we are more than a year later now since the last message, and almost 5 years since this issue has been opened, the damages made has been huge so I wonder: still no plan about fixing this library? Thanks.
Per WARC/1.0 spec section 5.9:
The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)
Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.
warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.