Closed zuny26 closed 4 years ago
@zuny26 it appears that kremlin.warc
contains response
records without HTTP headers e.g. only contains the body of the response.
Currently warcio expects there to be HTTP headers followed by the response body if any for response
records.
We are looking into how to detect this case.
Do you know what tool generated these warcs?
The WARCs were produced by Bitextor using httrack
. That was sometime ago though, and I think now the same script produces correct WARCs in the first place (without the HTTP headers however).
I still think that Cyrillic and Greek characters might have something to do with this, because the 'greenpeace.canada' WARC (from the same page as before) does not include HTTP headers and works fine with warcio recompress
.
I tested to simply read and write a couple of the Internet Archive WARC files and I am getting the same error. Looks like some records have non ASCII characters:
HTTP/1.1 200 OK
Date: Tue, 16 Dec 2014 18:16:19 GMT
Server: Apache
X-Powered-By: PHP/5.5.19
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Pingback: http://XXXXXXXXXXXXXXXXXXXXX.XXXXXXX/XXXXXXXXXXXXXXXX.php
Link: <http://XXXXXXXXXXXXXXXXXXXXX.XXXXXXX/>; rel=shortlink
Set-Cookie: PHPSESSID=cd0cad18a3c95669f3a097438811f7bc; path=/
Strict-Transport-Security: “max-age=31536000″
Connection: close
Content-Type: text/html; charset=UTF-8
See both quotations marks at the Strict-Transport-Security
value.
I guess that the solution for this could be that warcwriter
implements a flag to avoid writing a HTTP header (if not given and not found in input payload), so the HTTP header checks are done by the user before calling warcwriter
. Or managing this UnicodeEncodeError
exception in the warcwriter
so no HTTP header is written when header is invalid.
Also, I tested @zuny26 WARCs and I printed the record.http_headers
and I saw that warcio
reader is reading the initial lines of the actual HTML payload, so part of the payload is "lost". The UnicodeEncodeError
error pops when those lines include Greek or Russian text from the HTML.
That means that warcio
doesn't properly detect HTTP headers when reading (or the absence of them). This is a very important issue, because, AFAIK, WARC draft doesn't force that records must contain HTTP headers.
The issue with original WARC is that they are WARC-Type: response
records, but do not have HTTP headers. warcio assumes that response
records always have HTTP headers, while resource
records do not. This has been the general practice with WARC usage, though I think standard could be more clear about that.
Now, for the recompress operation, it doesn't need to parse the http headers at all.
Probably the best option for a WARC like this example is to convert the response
-> resource
to make the WARC more standard.
@lpla it sounds like your issue is slightly different.. do you have an example of such a WARC? Currently, headers are assumed to be utf-8 compatible, though latin-1 parsing will also be attempted if that fails.
This should make warcio crash as the original Internet Archive file I am working on.
Thank you @ikreymer, using WARC-Type: resource
works well for this situations.
There is still one issue however. While working with WARC files from the Internet Archive, I have found that some WARC have non-ascii characters in the HTTP header, because the status line message is in non-english language. You mentioned that the headers are assumed to be utf-8 compatible, so I'm not sure if the UnicodeEncodeError
exception is expected in this case.
Here is an example of such a header:
HTTP/1.1 404 Artículo no encontrado
Server: nginx/1.6.2
Date: Thu, 18 Dec 2014 20:33:29 GMT
Content-Type: text/html; charset=utf-8
Connection: close
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Cache-Control: no-cache
Pragma: no-cache
Set-Cookie: ca4b745bd1bd9a76a1a437f8f0c63eb3=fc01e9a89495205620b06ee56599725e; path=/
@lpla the issue you have is now fixed in 1.7.2 release. @zuny26 the status line is just passed on as is, and not really used by warcio. I think it should probably be ok (assuming browser just ignores it also).
I was using
warcio recompress
command line tool to fix some incorrect (not individually compressed) WARC files and I have stumbled onto aUnicodeEncodeError
exception. I assume the reason for this bug is that the WARCs that I used contain Cyrillic and Greek characters. However, I don't suppose that is the expected behavior.The WARCs that I used can be found here. Specifically,
kremlin.warc.gz
andprimeminister.warc.xz
are the WARC files in question.This is the exact error that I've gotten: