zigdon / xkcd-Bucket

Bucket is the channel bot for #xkcd
http://wiki.xkcd.com/irc/Bucket
121 stars 31 forks source link

literal[*] does not work with Unicode characters #91

Open loudaslife opened 7 years ago

loudaslife commented 7 years ago

Using the literal[*] command with a trigger consisting of a Unicode character will create a .txt file link that returns a 404 error. Reproduced on both Firefox and Chromium.

<loudaslife> literal[*] ☃ <Bucket> loudaslife: Here's the full list (3): http://carabiner.peeron.com/xkcd/bucket/literal_%E2%98%83.txt

Firefox and Chromium both resolve %E2%98%83 to in the address bar automatically, but the 404 page contains a URL with completely different characters.

The requested URL /xkcd/bucket/literal_☃.txt was not found on this server.

This is also reproducible with non-snowman characters, like .

dgw commented 7 years ago

I suspect this is down to the configuration of the webserver that handles carabiner.peeron.com, rather than an issue with Bucket itself. If the link Bucket generates resolves to the correct character, then the problem is with the webserver's interpretation of it.

loudaslife commented 7 years ago

Some discussion in #xkcd suggests that bucket's logs actually have the same encoding error. One of the Ops copied and pasted a snippet of the log:

<barometz> loudaslife asked in loudaslife to dump out âH^

âH^ is the same incorrect parsing that the 404 page gives when you ask for .

EDIT: I actually just realized that âH^ is NOT the same incorrect parsing as before, it's completely different than either of the characters I tried. So either bucket's log transcoding problem is different than that of the webserver, or something was messed up on barometz's end.

loudaslife commented 7 years ago

http://string-functions.com/encodingerror.aspx is a nifty little tool, and it determines that the webserver problem is UTF-8 being read as Windows-1252. It does not come up with a possible encoding error for the âH^ bucket log string that was pasted in #xkcd.

dgw commented 7 years ago

Text encoding is a real bane… Would be interesting to know if barometz got that log line from Bucket's log file on the server or from their own client. That's the sort of thing I see happen a lot to HexChat users who haven't configured their setup correctly (charset/fonts). That really is a nifty little tool, but I had the same lack of results you did with finding a path from either or ☃ to âH^.

I added a test factoid to my own instance and generated a literal dump. Same issue. Carabiner uses Apache2, and I'm using nginx, so I actually don't think it's the webserver. ls shows a pretty useless filename (literal_â??.txt). vi literal_☃.txt (the filename that results from pasting ☃ and tabbing) says I'm editing "literal_â<98><83>.txt".

At this point, it seems that I was wrong before about it not being a Bucket issue. I did vi literal_☃.txt in the literal directory and ended up with a [New File]. Adding text and saving, then visiting the link that Bucket generated, showed the file just fine in my browser—no 404 error any more.

More investigation into Bucket's order of operations is warranted, but for now that's what I've found.

dgw commented 7 years ago

Furthermore, I just played around in Perl's debugger mode performing a trivial open-and-append on a file named literal_☃.txt, with no issues. It even shows up correctly in ls output, unlike the one generated by Bucket.

Curiouser and curiouser… It's probably worth pointing out that Perl's documentation says, "There are still several places where Unicode isn't fully supported, such as in filenames." (perlunicode docs) … But if it works in the debugger, shouldn't it work in Bucket?