webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
383 stars 58 forks source link

Incorrect WARC-Payload-Digest values when transfer encoding is present #74

Open JustAnotherArchivist opened 5 years ago

JustAnotherArchivist commented 5 years ago

Per WARC/1.0 spec section 5.9:

The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).

The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)

Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.

warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.

wumpus commented 5 years ago

Is this a bug in the standard instead of a bug in warcio and many other tools? Can you point out any common tool that actually obeys the standard?

There are a couple of examples where the standard was changed to match the tools, this might be a new one.

I see from your links that this has been discussed by the WARC standards folks, but it sure doesn't seem like the subset of tools that preserve transfer encoding have adopted that digest algorithm.

One thing this does inspire me to do is to add some code to "warcio check" and "warcio test" to check for and comment on this situation

JustAnotherArchivist commented 5 years ago

From the discussion I linked, it seems like the intention was as written, not as implemented. So in that sense, I'd say it's not a bug in the standard.

I haven't investigated every tool; I just did some validation of WARCs created with wpull and warcio, both of which had this issue. From a glance over the relevant code, it seems like wget does remove chunking for the payload digest. However, it also looks like it doesn't handle compression correctly (with which wget has had issues before). Due to the lack of a suitable testing setup, I haven't been able to test that yet though.

wumpus commented 5 years ago

Well, if you'd like to look farther, the develop branch of warcio supports "warcio check", which can be used to also check digests -- albeit with warcio's current opinion that digests are computed on the bytes without removing transfer encoding. You can also install the branch from my "warcio test" pullreq (which is a work in progress) and give "warcio test" a spin!

It's my hope that "warcio test" will lead the community to interact about all of these conflicts between the standard and the tools people actually use.

ikreymer commented 5 years ago

Thanks for flagging this! Hm, I wonder if any tools actually do what the standard suggests in this regard? I think the content-decoding would also need to be decoded.

The change in warcio should be simple, I think just:

- for buf in self._iter_stream(record.raw_stream):
+ for buf in self._iter_stream(record.content_stream()):

Of course, would invalidate previous payload digests, so not sure if this is the right thing to do, especially if other tools also behave the same way now..

Also, I wonder what the implications of this are.. presumably the problem would be for deduplication if the same content is served with different transfer-encoding and/or content-encodings and would unnecessarily be fetched? Or chunks of different size used for same content? I wonder how likely/prevalent this is in practice.

I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.

wumpus commented 5 years ago

These issues are exactly why I suspect the standard is what's going to change. Those "bad" digests are all over petabytes of response and revisit records, and the cdx indices ...

JustAnotherArchivist commented 5 years ago

@wumpus Yes, I want to play around with this further. I need to set up a test HTTP server that can produce all the different relevant combinations (unchunked+uncompressed [which should be unproblematic in all implementations], chunked+uncompressed, unchunked+compressed, and chunked+compressed). httpbin unfortunately can't do it; it only has compression options for content encoding, not for transfer encoding, and it doesn't seem to be possible to combine it with chunking either. I guess I'll just write something that serves some hardcoded example data for each of these.

@ikreymer No, the entity-body in RFC 2616 is the (potentially content-encoded) body before applying transfer encoding (if any), so only the transfer encoding needs to be stripped, but the content encoding needs to stay intact. At least that's what the standards say. Whether that actually makes sense, I'm not sure.

I guess the intention is indeed deduplication when the chunk boundaries or something in the compression changes. I've definitely seen varying chunks returned from web servers for identical content, which isn't too surprising since those chunks are often based on a buffer which may get flushed before being full depending on a variety of factors (current CPU/disk speed, concurrent activity, etc.). I have little experience with compression transfer encoding. But I wonder now whether such deduplication is really a good idea. WARC is all about preserving HTTP requests and responses as accurately as possible. But if revisit records might not actually be identical duplicates, that's a loss of information, namely that the transfer encoding from that server is variable rather than constant. That's only a tiny bit of information and probably not at all useful on its own, but it still seems a bit odd.

I'm aware of the implications for existing archives. But unfortunately, those files do not conform to the standard as it's written currently, and I believe that's a significant issue that needs to be solved, whether that requires changes in the softwares or in the standard. Based on the discussion linked above (or rather, the fact that this exact issue was discussed fairly extensively before), I assumed that a change of the standard is not wanted/needed. But I realise now that this may very well just be an issue of the people working on the standard not knowing how the existing implementations work in this regard.

I'll continue my investigations soon how other tools react to this and file an issue on the WARC specification repo instead if it turns out that the majority of relevant tools don't adhere to the written standard, i.e. if the standard should likely be changed.

wumpus commented 5 years ago

I'm planning on writing a jumbo article on what "warcio test" finds, and this will end up being a part of that, if you don't beat me to it. I've got a repo with a pile of warcs from all of the major tools, etc etc.

ikreymer commented 5 years ago

Thanks for looking into this more deeply. At first glance, removing transfer-encoding but not content-encoding seems to make even less sense then keeping both or removing both.

I wanted to add, the IIPC hosts a weekly open source web archiving call on Tuesdays, alternating between 9am and 6pm EST currently. I think this would be a great conversation to bring up there as well for more immediate feedback. The next one is tomorrow, April 29 at 9am EST, following week will be 6pm. I can also mention this issue on the IIPC slack channel as well if you'd like.

bnewbold commented 5 years ago

This came up for me recently as well; it broke my expectations about how CDX indexes worked, in that when I fetched files via wayback (without replay) and wrote out to disk, the on-disk blob's digest didn't match the digest in the corresponding CDX line. Our CDX build process simply copies the WARC content digest header as-is without verification or re-calculation, and the browser (and/or wayback earlier) strips any encoding before writing to disk.

Here's a URL which consistently induces the ambiguity: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf

A notable tool which does "follow the spec" is warcprox, used as part of brozzler, as of 2017 (https://github.com/internetarchive/warcprox/commit/3a0f6e0947)

wumpus commented 5 years ago

That's an easy problem to see because many http clients will undo chunked encoding and decompress by default. So you can't depend on the checksum not changing. And even if the community agreed about transfer encoding and checksums, decompression is still going to change it.

JustAnotherArchivist commented 5 years ago

@wumpus Sounds great! I remember your request on the wpull repository from a few months ago. Great to hear that this project of yours is coming along nicely. :-) I assume those WARCs all cover very different content, right? It would be very interesting to do a WARC-writing tool intercomparison where all tools (both current and common versions, I think, but not all historical releases) are run against the same sample server. Perhaps the small static server I hinted at could be used for this, so that it'd be "hey, you're maintaining a software that makes HTTP requests and writes to WARC, could you run it against this server and this list of URLs please?" instead of "oh, I need to hunt down an old version of this OS and that library to get this to run". I'll be looking into writing such a server soon.

@ikreymer At first glance, it seems silly, yes. But Transfer-Encoding is intended for encoding the data on-the-fly whereas Content-Encoding is at-rest compression. So the content-encoded data should never change between requests unless the source data has changed whereas the transfer encoding may vary freely. So I guess it kind-of makes sense.

First time I hear about that call after working with WARCs daily for two years, although I must admit that I never searched for things like this either. Is this publicly accessible?

Please do spread it wherever it makes sense. We should also probably find a better place for this discussion than here, but I have no idea where that might be.

@bnewbold That example URL uses chunked encoding, so it makes perfect sense that this causes issues. Thanks for pointing out how warcprox behaves!

ikreymer commented 5 years ago

@JustAnotherArchivist Yes, I see your point, though I suppose both could just as easily applied on-the-fly, eg. with nginx gzip on and chunked_transfer_encoding on options.

The IIPC call is open to everyone and there is more info in the IIPC slack at: https://iipc.slack.com #oh-sos channel (I can invite you if you'd like, PM your email)

@bnewbold Good to know that warcprox does follow the standard, was about to look at what it does.

So, I decided to run a few quick tests using different tools and got 3 different digests! However, they were consistent across multiple tries of the same tool.

I tried: 1) pywb and webrecorder.io, which produces the same digest, but as it turns out, actually strip out transfer-encoding before writing to the WARC 2) The new warcio capture_http semantics, eg:

from warcio.capture_http import capture_http
import requests

with capture_http('2-warcio.warc.gz') as fh:
    requests.get('http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf')

This does preserve the chunked encoding in the WARC record. 3) warcprox, which also preserves the chunked encoding. At first glance, the chunking in 3) and 2) appeared to be the same, though the digests were different..

Trying each tool multiple times produces the same results, so that's good news. And, running warcio extract --payload on each produced exactly same payloads as well, also good.

But, clearly there's a few off, such as 1) stripping out the transfer-encoding altogether) while 2) and 3) should be matching. Also, warcio check appears to be getting confused by the chunked encoding, hitting the 'digest present but not checked' condition. (@wumpus if you have a chance to take a look at that)

I've attached three files here. Will try to investigate further.. 1-pywb.warc.gz 2-warcio.warc.gz 3-warcprox.warc.gz

JustAnotherArchivist commented 5 years ago

Thanks, will contact you shortly.

1-pywb.warc.gz is arguably the worst because it preserves the Transfer-Encoding header while decoding the body. That means the data consumer will have to detect and work around that. It means that the payload digest is according to spec though.

2-warcio.warc.gz has the expected, spec-violating "payload with transfer encoding" hash.

3-warcprox.warc.gz has the (arguably) correct "payload with transfer encoding removed" hash. But it writes the hash in hexadecimal form rather than the base-32 one which is common in other WARCing tools.

JustAnotherArchivist commented 5 years ago

Looks like 3-warcprox.warc.gz has an incorrect hash for the request record though: it is equal to the block hash when it should be sha1('') since there is no request body.

wumpus commented 5 years ago

I thought my latest pullreq (which got pulled into the devel branch) fixed that bug, and whatever I have installed locally sees digest pass. tests/data/example-iana.org-chunked.warc works fine for me, too, although it is excluded from test_check_digest_examples.py for some reason. Maybe you haven't pulled devel since that merge?

wumpus commented 5 years ago

(Your unchunking code wasn't reading the last couple of bytes... ;-) )

ikreymer commented 5 years ago

Hmm, perhaps this should move to #75 or separate issue, but running warcio check -v with latest develop (https://github.com/webrecorder/warcio/commit/3fffe1e85b2c447faf6cd81943f3196820151af2) on OSX py3.6, I get:

$ warcio check -v 1-pywb.warc.gz 
1-pywb.warc.gz
  WARC-Record-ID <urn:uuid:763e3df6-6af3-11e9-a87a-dca90482c8ee> response
    digest pass
  WARC-Record-ID <urn:uuid:d5f051b6-e65d-4f1a-90e8-668081f70d97> request
    digest pass
$ warcio check -v 2-warcio.warc.gz 
2-warcio.warc.gz
  WARC-Record-ID <urn:uuid:0db77d79-6df6-4aa5-bd94-4969a360b0f6> response
    digest present but not checked
  WARC-Record-ID <urn:uuid:edc48184-9595-4e7f-a408-26e78c63cb83> request
    digest pass
$ warcio check -v 3-warcprox.warc.gz 
3-warcprox.warc.gz
  WARC-Record-ID <urn:uuid:379bd24c-b8b2-4d5d-885e-81a44557f52f> warcinfo
    no digest to check
  WARC-Record-ID <urn:uuid:8258e6d3-8289-4fd4-a5b9-37fce48723ab> response
    digest present but not checked
  WARC-Record-ID <urn:uuid:c05cabbf-6dda-441d-b617-f5be7d25c70a> request
    payload digest failed: sha1:6d2916e9d1884ba8e3f0698b3f9a6f42c2457bb3

Somehow, the digest present but not checked condition is still being reached... This doesn't repro for you?

wumpus commented 5 years ago

Whoops, I fixed the bug in my develop-warcio-test branch, which is still a work in progress:

         if not chunk_size:
             # chunk_size 0 indicates end of file
+            final_data = self.stream.read(2)
+            assert(final_data == b'\r\n')
             self.all_chunks_read = True
             self._process_read(b'')
             return

I'll prepare another "tweak" PR and make sure there's a test... hm and that assert needs to get wrapped in a try/except like the other ones in this method: https://github.com/webrecorder/warcio/pull/77

nlevitt commented 5 years ago

Regarding the choice made in the WARC standard, IMHO it makes as much sense as anything. My understand is that in theory,content-encoding is supposed to inform you about the file in much the same way that content-type does. If you download a content-encoding: gzip url with wget or curl -O, it saves it gzipped, as it arrived over the wire. (Try it with https://httpbin.org/gzip).

On the other hand, wget or curl -O of a transfer-encoded url decodes it before saving. Transfer-encoding is a hop-by-hop header, unlike content-encoding. A confounding factor is that content-encoding: gzip is often used where transfer-encoding: gzip might make more sense.

Transfer-Encoding is a hop-by-hop header, that is applied to a message between two nodes, not to a resource itself. Each segment of a multi-node connection can use different Transfer-Encoding values. If you want to compress data over the whole connection, use the end-to-end Content-Encoding header instead.

(That last sentence might explain why we rarely see transfer-encoding: gzip) https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding

nlevitt commented 5 years ago

@JustAnotherArchivist oh ouch thanks for pointing out that warcprox writes an incorrected WARC-Payload-Digest header on request records

JustAnotherArchivist commented 5 years ago

97 brought this to my attention again. Sorry, I haven't really had time the past few months to investigate further.

I'm not sure what the best route forward is. I still intend to initiate a software comparison about this, both for WARC writing and reading tools. However, I think we could only really call this a bug in the standard if all common tools used the same wrong digest, i.e. if there were a consensus on how it should be done. We've already established that at least one common tool (warcprox) follows the standard, which leads me to think that the best way may be to fix all other tools to do so rather than amending the standard. This may be painful, but on the other hand, this issue was not noticed for nearly a decade since the initial WARC/1.0 specification release, so clearly digest validation is not something that is commonly done (correctly).

wumpus commented 5 years ago

For the examples I mentioned at the top, one common tool was different and it was still considered a bug by IIPC... wget started putting <> around the URI in WARC-Target-URI, but no other tool did, and the standard was changed to not have brackets.

My suspicion is that every tool that doesn't remove transfer encoding is doing this wrong, other than warcprox.

And yes, digest validation isn't commonly done, that's why I put it in warcio! Along the way I discovered several other bugs in the standard. I opened tickets in the correct place for each of them; this repo isn't the place for the IIPC community to have a discussion about possible standards bugs.

JustAnotherArchivist commented 5 years ago

Yeah, I agree that this isn't the place. This issue turned from "warcio is writing invalid digests" to "(nearly) everything is writing standard-violating digests" very quickly. Later I didn't want to open an issue on the warc-specifications repository without data on how the most common tools behave. And then I didn't have time for that analysis. Anyway, I'll move this discussion elsewhere.

warcio should probably keep doing its current digest calculation for now to avoid another mess like with wget's <URI>, i.e. consider this issue on hold.

edsu commented 5 years ago

If this issue does get raised in another venue a link to it from here would be greatly appreciated!

wumpus commented 5 years ago

I think the correct place is the IIPC issue tracker, examples: https://github.com/iipc/warc-specifications/issues/49 https://github.com/iipc/warc-specifications/issues/33 https://github.com/iipc/warc-specifications/issues/48 ... ideally one of us interested people would do a bit of a survey to see which tools keep transfer-encoding and then what they do for the digest. So far I've been too lazy to do that.

JustAnotherArchivist commented 5 years ago

Yep, that's exactly what I meant by "data on how the most common tools behave" above. I did start working on that back in April but then got distracted by other things. To be continued soon™.

CorentinB commented 2 years ago

Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that warcio shouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makes warcio report false information, and a crawler writing proper WARC files would fail the check because of warcio not being able to read and write WARC files entirely as defined by the specs.

If nobody wants to fix that issue, it's fine, I don't personally use warcio and this is one of the reasons why, but at least warcio check shouldn't be a thing. https://github.com/webrecorder/warcio#check

ikreymer commented 2 years ago

Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that warcio shouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makes warcio report false information, and a crawler writing proper WARC files would fail the check because of warcio not being able to read and write WARC files entirely as defined by the specs.

If nobody wants to fix that issue, it's fine, I don't personally use warcio and this is one of the reasons why, but at least warcio check shouldn't be a thing. https://github.com/webrecorder/warcio#check

It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while.. But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..

CorentinB commented 2 years ago

It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while..

I understand that it is a mistake that is fairly easy to do because it's not clearly stated in the specs, but it's not ambiguous because entity-body is clearly defined in RFC 2616.

But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..

I think the priority is to stop giving the possibility to verify WARCs with warcio until this is fixed. And I definitely think the default behavior should be fixed, the "good way" shouldn't be optional, it should be the default and the only way to do that.

If I understand all the messages above, it sounds like people are expecting the standard to change to match what tools like warcio are doing, it is not happening right now and it won't happen tomorrow. We can't know if it will happen one day, but in the meantime I think warcio should be fixed and you should stop offering warcio check until this is fixed.

I also understand that fixing warcio isn't easy. I really do. It's why I put en emphasis on removing the check capability until you figure out how to fix it.

nclarkekb commented 1 year ago

JWAT and https://github.com/nlnwa/warchaeology also mark these digests as incorrect.

I though the purpose of these digests were to validate the content as it is in the payload. It makes no sense at all to have a digests that needs extra computations in order to validate the payload. It just add another useless layer with more possibilities for errors in writers and validators.

This is the first time since writing JWAT that I've been made aware of this digest issue. So as pointed out else where there are massive amounts of existing archives with exact payload digests that could be affected by changing how digests are calculated.

Arkiver2 commented 1 year ago

Wget has this issue as well. At Archive Team we use the modified version of Wget at https://github.com/ArchiveTeam/wget-lua. This is used to archive petabytes of data.

This modified Wget is now updated https://github.com/ArchiveTeam/wget-lua/issues/13, and as of version 1.21.3-at.20230623.01 calculates the correct WARC-Payload-Digest value - that is with transfer encoding stripped for calculation of the digest, while preserving the transfer encoding in the WARC record block.

Over the next days this will be rolled out for all Archive Team distributed Warrior projects.

ikreymer commented 1 year ago

Coming back to this after a few years, I think the best solution is to strip out the Transfer-Encoding altogether when writing the WARCs, so that WARCs are written with the Transfer-Encoding removed from the data, and header replaced with something like X-Removed-Transfer-Removed (we might have discussed that in another issue), and the payload digest matches the data with no Transfer-Encoding.

After all these years, haven't really seen a need to access the transfer-encoded data, so I agree with @nclarkekb that it just creates more work for anyone using the data.

WARCs created capture via other methods, such as browser extensions / warcio.js also has the encoding already removed. The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3 so we might as well represent the 'canonical' text based HTTP response, no transfer encoding applied and no transfer-encoding header so that the data makes sense.

JustAnotherArchivist commented 1 year ago

I also remember that (brief) discussion, but don't know where it happened. I thought it was in iipc/warc-specifications but couldn't find it there. It's certainly a discussion worth having, probably over there.

However, any tools that currently strip the transfer encoding from the record body, transform the HTTP headers, or write HTTP/2 or HTTP/3 data to WARC are undeniably violating the standard. This is not ambiguous in the WARC spec: only HTTP/1.1 with the exact HTTP-layer-level data as sent/received is compatible with WARC versions 1.0 and 1.1. And since this is a format intended for long-term archival, spec compliance is crucial in my opinion.

(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)

Arkiver2 commented 1 year ago

@ikreymer I very strongly disagree with this idea, and I agree with what @JustAnotherArchivist noted.

If the WARCs captured with "browser extensions / warcio.js" already have encoding removed, then these implementation are in violation of the standard and do not create proper WARCs. Whether this is an issue is up to the individual using the tool, but they should be aware of this, if the issue is known.

Previously (https://github.com/webrecorder/warcio/issues/74#issuecomment-487775071) the argument was made that

I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.

If the suggestion is to strip Transfer-Encoding entirely, store this in the WARC, and calculate the payload digest anyway on the stripped content, then what is stopping us from just calculating the payload anyway on the stripped body, while writing the raw body (including Transfer-Encoding) to WARC?

The WARC standard notes clearly what should be stored, and what the digests should be calculated over. For many implementations the standard was not accurately read/analysed. I would say it would be best if implementations are fixed whenever possible and simply start calculating the correct digests moving forward.

The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3

We should agree on a solution for this in https://github.com/iipc/warc-specifications, and move forward with that.

(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)

Yes, I agree with this. When people use tools that write WARCs, they will likely assume that these WARCs are entirely following the WARC standard. Some tools do not. Sometimes that is known, sometimes it is yet to be discovered. I think it would be good to make clear to people who use WARC tools in what ways these tools do not follow the WARC standard, and possibly why they do not follow the WARC standard. They can then decide for themselves whether that is a problem, possibly with recommendations from the IIPC on the long term aspect.

wumpus commented 1 year ago

Weird that after 4 years of discussion, no one has yet to look at IA's and CC's warcs to see what they do with this issue.

kngenie commented 1 year ago

Just putting my 2c, as a developer / maintainer of several different versions of crawlers and wayback machine at Internet Archive...

CorentinB commented 7 months ago

Hey, so we are more than a year later now since the last message, and almost 5 years since this issue has been opened, the damages made has been huge so I wonder: still no plan about fixing this library? Thanks.