webrecorder / warcio.js

JS Streaming WARC IO optimized for Browser and Node
MIT License
35 stars 6 forks source link

Headers with non-ASCII characters silently disappear #81

Open wolfgang42 opened 3 days ago

wolfgang42 commented 3 days ago

Steps to reproduce

Take this example WARC, produced with wget --no-verbose -O /dev/null --max-redirect=0 --warc-file=example 'https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)'. It contains the line:

Location: https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ

Actual behavior

However, trying to read that header produces null:

import {WARCParser} from 'warcio'

for await (const record of new WARCParser(process.stdin)) {
    if (record.warcType !== 'response') continue
    console.log(record.httpHeaders.headers.get('Location'))
}
// => null

Expected behavior

I am unclear on what the “correct” behavior here is (RFC 7230 leads me to believe that the server probably shouldn't have sent a non-ASCII value to begin with, but since it did this header should have been interpreted as ISO-8859-1 mojibake; I can’t find anything in the Fetch spec that describes these values as anything other than a “byte sequence” though in practice Chrome, Firefox, and Node all limit values to U+00FF and below). Unfortunately since this behavior is in the wild I have to handle it anyway.

However, I do know that the header silently disappearing is very confusing and definitely not what I want to happen; it makes this problem hard to debug. I would have much preferred a crash, which would have let me immediately identify the cause.

Workaround

Setting keepHeadersCase: true when creating the WARCParser uses a plain Map instead of a Headers, which doesn’t have the same limitation on values. Unfortunately this loses the case-insensitive aspect.

Cause

The silent disappearance appears to be because of this code:

https://github.com/webrecorder/warcio.js/blob/fb1ff9c3d88a54fdc0acb2670ca25966b04fa95e/src/lib/statusandheaders.ts#L142-L150

...which is swallowing the error (“TypeError: Cannot convert argument to a ByteString because the character at index 204 has a value of 12513 which is greater than 255.”), along with anything else that might go wrong.

This behavior seems to have been there since 40610030ca54b7a0f0713aad52fc18ef60578721:

https://github.com/webrecorder/warcio.js/blob/40610030ca54b7a0f0713aad52fc18ef60578721/src/statusandheaders.js#L80-L85

...and since that was the initial commit, there's no obvious explanation for why it exists.

Suggested fix

This is enough for my use case:

-try { 
+// If this crashes with an error like "String contains non ISO-8859-1 code point"
+// or "character has a value which is greater than 255", the response contains a
+// header which cannot be represented by the Headers class; enable `keepHeadersCase`
+// to use a Map (which does not have this restriction) instead.
   if (canAppend && name.toLowerCase() === "set-cookie") { 
     headers.append(name, value); 
   } else { 
     headers.set(name, value); 
   } 
-} catch (_e) { 
-  // ignore 
-} 

However, this may be a problem for users who have code that depends on the current behavior, are fine with that, and don't want to have to rewrite their code to handle lookups in a Map. For that use case it might be necessary to thread through a flag (call it discardInvalidHeaders, say) to keep the current behavior. (Either way I think the defaults should change, which would be a bump to the semver major version.)

ikreymer commented 3 days ago

Thanks for reporting this! Yes, I think the initial approach was to skip invalid header, but that's not a good approach, as you mention. I think a better approach, and what we've done in the past, is to re-encode the header as latin1 / ascii compatible encoding, which should always be possible.

It appears to be what the standard Node fetch implementation does as well: https://github.com/nodejs/undici/pull/1560

Here's a proposed approach, using a locally defined decodeLatin1 so that it can work in browsers as well:

function decodeLatin1(buf) {
  let str = "";
  for (let i = 0; i < buf.length; i++) {
    str += String.fromCharCode(buf[i]);
  }
  return str;
}
d = decodeLatin1(new TextEncoder().encode("https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ"));

// this matches the headers from node fetch:
a = await fetch("https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)", {redirect: "manual"})
assert(a.headers.get("location") === d)
wolfgang42 commented 2 days ago

Wow, thanks for the quick response!

Aha, not sure why re-encoding didn’t occur to me but I think it makes perfect sense.

One minor nit: “latin1” is unfortunately an ambiguous encoding name; Node uses it for the 1:1 byte:unicode mapping, but modern web browsers follow the WhatWG spec which defines it as a synonym for Windows-1252 instead. Even when it refers to the ISO 8859-1 standard it’s sometimes interpreted as an encoding which is technically missing definitions for the C0/C1 control characters (the values for them are simply left undefined); the whole thing is a mess. I usually just call what we want here the “codepoint identity mapping” since there doesn’t seem to be an unambiguous official name for this encoding.

I can submit a PR for this at some point but it may take me a while to get around to it, so feel free to just make the change if you have the time before I do.