Avoid double UTF-8 encoding in zmb

mattias-p commented 8 months ago

Purpose

This PR fixes a problem in how zmb handles UTF-8 data in its arguments.

Context

Fixes #1138.

Changes

Arguments are received as octet string. Before this change each argument octet would be interpreted as a native character (e.g. per ISO 8859-1) and encoded into UTF-8 before being injected into the JSON-RPC request. If the arguments were already UTF-8 encoded this would result in double UTF-8 encoding. After this change the argument octets are injected in verbatim into the JSON-RPC request.

How to test this PR

You should be able to specify domain names with U-label to zmb like this:

$ zmb start_domain_test --domain råttgift

mattias-p commented 8 months ago

I agree that I'm assuming that the terminal is using UTF-8, or rather that the user by whatever means manages to place UTF-8 encoded bytes into @ARGV. An improvement could be to validate that the bytes actually conform to UTF-8.

I also agree that its difficult to know what encoding is actually used for the bytes in @ARGV, and if we knew that it would be nice to perform the proper conversion.

Additionally I agree that it works for the wrong reasons. I'm tricking JSON::PP::encode into believing that it's dealing with Unicode strings whose encoding it shouldn't touch, when in fact it is dealing with UTF-8 encoded strings whose encoding it shouldn't touch. Luckily all the JSON syntax it generates happens to be in ASCII which has identical encoding in UTF-8.

~I guess the law abiding way would be to decode the UTF-8 byte strings taken from @ARGV before and let JSON::PP perform the UTF-8 encoding. Normally the law protects you from harm. But in the odd case when there's no harm in breaking the law, and when following the law entails having more code to write and maintain, more opportunity for bugs and less efficiency, I'm not sure I care so much for following the law.~

~Maybe I should just add some code comments to document the hack? And I just realized that I haven't thought about the UTF8 flag at all. If that gets set differently depending on the utf8() setting, that needs to be tweaked as well.~

Upon reflection I realized that if we validate the encoding of @ARGV we might as well keep the decoded strings and pass them to the JSON encoder and have it re-encode them. That's slightly less efficient but not terribly so, and we respect the JSON encoder contract. The increase in code to write and maintain is negligible.

mattias-p commented 6 months ago

I've updated this PR in response to @marc-vanderwal's comment as described in my last comment. Please review again.

MichaelTimbert commented 3 months ago

This PR works as expected on Ubuntu 20.04.6 and 22.04.4. Work fine with U-labels & A-Labels.

zonemaster / zonemaster-backend