webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
369 stars 58 forks source link

UnicodeEncodeError when using 'warcio recompress' #95

Closed zuny26 closed 4 years ago

zuny26 commented 4 years ago

I was using warcio recompress command line tool to fix some incorrect (not individually compressed) WARC files and I have stumbled onto a UnicodeEncodeError exception. I assume the reason for this bug is that the WARCs that I used contain Cyrillic and Greek characters. However, I don't suppose that is the expected behavior.

The WARCs that I used can be found here. Specifically, kremlin.warc.gz and primeminister.warc.xz are the WARC files in question.

This is the exact error that I've gotten:

Exception Details:
Traceback (most recent call last):
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 168, in to_ascii_bytes
    string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 130-136: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 105, in __call__
    count = self.load_and_write(stream, cmd.output)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/cli.py", line 145, in load_and_write
    writer.write_record(record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 368, in write_record
    self._write_warc_record(self.out, record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 248, in _write_warc_record
    self._set_header_buff(record)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/warcwriter.py", line 240, in _set_header_buff
    headers_buff = record.http_headers.to_ascii_bytes(self.header_filter)
  File "/home/elsa/bitextorenv/lib/python3.7/site-packages/warcio/statusandheaders.py", line 172, in to_ascii_bytes
    string = string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4517-4520: ordinal not in range(128)
N0taN3rd commented 4 years ago

@zuny26 it appears that kremlin.warc contains response records without HTTP headers e.g. only contains the body of the response. Currently warcio expects there to be HTTP headers followed by the response body if any for response records. We are looking into how to detect this case.

Do you know what tool generated these warcs?

zuny26 commented 4 years ago

The WARCs were produced by Bitextor using httrack. That was sometime ago though, and I think now the same script produces correct WARCs in the first place (without the HTTP headers however).

I still think that Cyrillic and Greek characters might have something to do with this, because the 'greenpeace.canada' WARC (from the same page as before) does not include HTTP headers and works fine with warcio recompress.

lpla commented 4 years ago

I tested to simply read and write a couple of the Internet Archive WARC files and I am getting the same error. Looks like some records have non ASCII characters:

HTTP/1.1 200 OK
Date: Tue, 16 Dec 2014 18:16:19 GMT
Server: Apache
X-Powered-By: PHP/5.5.19
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Pingback: http://XXXXXXXXXXXXXXXXXXXXX.XXXXXXX/XXXXXXXXXXXXXXXX.php
Link: <http://XXXXXXXXXXXXXXXXXXXXX.XXXXXXX/>; rel=shortlink
Set-Cookie: PHPSESSID=cd0cad18a3c95669f3a097438811f7bc; path=/
Strict-Transport-Security: “max-age=31536000″
Connection: close
Content-Type: text/html; charset=UTF-8

See both quotations marks at the Strict-Transport-Security value.

I guess that the solution for this could be that warcwriter implements a flag to avoid writing a HTTP header (if not given and not found in input payload), so the HTTP header checks are done by the user before calling warcwriter. Or managing this UnicodeEncodeError exception in the warcwriter so no HTTP header is written when header is invalid.

Also, I tested @zuny26 WARCs and I printed the record.http_headers and I saw that warcio reader is reading the initial lines of the actual HTML payload, so part of the payload is "lost". The UnicodeEncodeError error pops when those lines include Greek or Russian text from the HTML.

That means that warcio doesn't properly detect HTTP headers when reading (or the absence of them). This is a very important issue, because, AFAIK, WARC draft doesn't force that records must contain HTTP headers.

ikreymer commented 4 years ago

The issue with original WARC is that they are WARC-Type: response records, but do not have HTTP headers. warcio assumes that response records always have HTTP headers, while resource records do not. This has been the general practice with WARC usage, though I think standard could be more clear about that.

Now, for the recompress operation, it doesn't need to parse the http headers at all. Probably the best option for a WARC like this example is to convert the response -> resource to make the WARC more standard.

@lpla it sounds like your issue is slightly different.. do you have an example of such a WARC? Currently, headers are assumed to be utf-8 compatible, though latin-1 parsing will also be attempted if that fails.

lpla commented 4 years ago

broken.warc.gz

This should make warcio crash as the original Internet Archive file I am working on.

zuny26 commented 4 years ago

Thank you @ikreymer, using WARC-Type: resource works well for this situations.

There is still one issue however. While working with WARC files from the Internet Archive, I have found that some WARC have non-ascii characters in the HTTP header, because the status line message is in non-english language. You mentioned that the headers are assumed to be utf-8 compatible, so I'm not sure if the UnicodeEncodeError exception is expected in this case.

Here is an example of such a header:

HTTP/1.1 404 Artículo no encontrado
Server: nginx/1.6.2
Date: Thu, 18 Dec 2014 20:33:29 GMT
Content-Type: text/html; charset=utf-8
Connection: close
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Cache-Control: no-cache
Pragma: no-cache
Set-Cookie: ca4b745bd1bd9a76a1a437f8f0c63eb3=fc01e9a89495205620b06ee56599725e; path=/
ikreymer commented 4 years ago

@lpla the issue you have is now fixed in 1.7.2 release. @zuny26 the status line is just passed on as is, and not really used by warcio. I think it should probably be ok (assuming browser just ignores it also).