Open wolfgang42 opened 3 days ago
Thanks for reporting this! Yes, I think the initial approach was to skip invalid header, but that's not a good approach, as you mention. I think a better approach, and what we've done in the past, is to re-encode the header as latin1 / ascii compatible encoding, which should always be possible.
It appears to be what the standard Node fetch implementation does as well: https://github.com/nodejs/undici/pull/1560
Here's a proposed approach, using a locally defined decodeLatin1 so that it can work in browsers as well:
function decodeLatin1(buf) {
let str = "";
for (let i = 0; i < buf.length; i++) {
str += String.fromCharCode(buf[i]);
}
return str;
}
d = decodeLatin1(new TextEncoder().encode("https://wiki.archlinux.jp/index.php/Arch_User_Repository?rdfrom=https%3A%2F%2Fwiki.archlinux.org%2Findex.php%3Ftitle%3DAUR_Metadata_%28%25E6%2597%25A5%25E6%259C%25AC%25E8%25AA%259E%29%26redirect%3Dno#AUR_メタデータ"));
// this matches the headers from node fetch:
a = await fetch("https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)", {redirect: "manual"})
assert(a.headers.get("location") === d)
Wow, thanks for the quick response!
Aha, not sure why re-encoding didn’t occur to me but I think it makes perfect sense.
One minor nit: “latin1” is unfortunately an ambiguous encoding name; Node uses it for the 1:1 byte:unicode mapping, but modern web browsers follow the WhatWG spec which defines it as a synonym for Windows-1252 instead. Even when it refers to the ISO 8859-1 standard it’s sometimes interpreted as an encoding which is technically missing definitions for the C0/C1 control characters (the values for them are simply left undefined); the whole thing is a mess. I usually just call what we want here the “codepoint identity mapping” since there doesn’t seem to be an unambiguous official name for this encoding.
I can submit a PR for this at some point but it may take me a while to get around to it, so feel free to just make the change if you have the time before I do.
Steps to reproduce
Take this example WARC, produced with
wget --no-verbose -O /dev/null --max-redirect=0 --warc-file=example 'https://wiki.archlinux.org/title/AUR_Metadata_(%E6%97%A5%E6%9C%AC%E8%AA%9E)'
. It contains the line:Actual behavior
However, trying to read that header produces
null
:Expected behavior
I am unclear on what the “correct” behavior here is (RFC 7230 leads me to believe that the server probably shouldn't have sent a non-ASCII value to begin with, but since it did this header should have been interpreted as ISO-8859-1 mojibake; I can’t find anything in the Fetch spec that describes these values as anything other than a “byte sequence” though in practice Chrome, Firefox, and Node all limit values to U+00FF and below). Unfortunately since this behavior is in the wild I have to handle it anyway.
However, I do know that the header silently disappearing is very confusing and definitely not what I want to happen; it makes this problem hard to debug. I would have much preferred a crash, which would have let me immediately identify the cause.
Workaround
Setting
keepHeadersCase: true
when creating theWARCParser
uses a plainMap
instead of aHeaders
, which doesn’t have the same limitation on values. Unfortunately this loses the case-insensitive aspect.Cause
The silent disappearance appears to be because of this code:
https://github.com/webrecorder/warcio.js/blob/fb1ff9c3d88a54fdc0acb2670ca25966b04fa95e/src/lib/statusandheaders.ts#L142-L150
...which is swallowing the error (“TypeError: Cannot convert argument to a ByteString because the character at index 204 has a value of 12513 which is greater than 255.”), along with anything else that might go wrong.
This behavior seems to have been there since 40610030ca54b7a0f0713aad52fc18ef60578721:
https://github.com/webrecorder/warcio.js/blob/40610030ca54b7a0f0713aad52fc18ef60578721/src/statusandheaders.js#L80-L85
...and since that was the initial commit, there's no obvious explanation for why it exists.
Suggested fix
This is enough for my use case:
However, this may be a problem for users who have code that depends on the current behavior, are fine with that, and don't want to have to rewrite their code to handle lookups in a Map. For that use case it might be necessary to thread through a flag (call it
discardInvalidHeaders
, say) to keep the current behavior. (Either way I think the defaults should change, which would be a bump to the semver major version.)