Open sumpfralle opened 1 year ago
I must admit I never got into the XML specs proper; the parsers I tried up to now were all happy to eat whatever UTF-8 I feed them, and I (naively?) expected that putting in the XML header as we have it now should suffice:
<?xml version="1.0" encoding="UTF-8"?>
I'd be happy to make the necessary changes to make the output compliant, but it would help if someone who knows about these things can hold my hand and tell me how to do it though :)
the parsers I tried up to now were all happy to eat whatever UTF-8 I feed them
Yes, that's how it is supposed to work: utf-8 should never be a problem.
But here it is about malformed filenames, which cannot be represented as UTF-8.
In my case it seems to be caused by some directories, which were created a long time ago (in one case it was 17 years old). I guess, encoding of filenames was not standardised to be UTF-8 back then, but used the currently active locale (here: iso8859-15
).
From my point of view the problem has two layers:
/
) or the end-of-string indicator (\0
). Desktop environments usually care for reducing the set of input to valid utf-8 strings, but we (duc
) cannot rely on that.duc
needs to represent non-utf-8 sequences in its name
and root
attributes. But XML only allows utf-8 character sequences. Thus duc
needs to escape problematic sequences. Ideally it should be possible for a duc
user to transform these weird byte sequences (taken from the XML file) back into their original state (in order to map these names to the real files in the filesystem). The solution for this problem is the "surrogates" concept in unicode (as far as I understand it). Here a range of unicode characters is used for representing "unparseable" characters in a reversible fashion.In my case I am sanitizing the output of duc xml
with the following bit of python code (following up on the example snippets in my original post; here I am only transforming the name
attributes, but the root
attribute should be handled in the same way):
duc xml --database db.duc | python3 -c 'import re, sys; regex = re.compile(b"name=\"([^\"]+)\""); converter = lambda m: b"name=\"" + m.groups()[0].decode(errors="surrogateescape").encode(errors="backslashreplace") + b"\""; raw_xml = sys.stdin.buffer.read(); sys.stdout.buffer.write(regex.sub(converter, raw_xml))'
(I hope, my example is not too hard to read for non-native Python speakers)
The result (happily accepted by xmllint
and Python's etree
parser):
<?xml version="1.0" encoding="UTF-8"?>
<duc root="." size_apparent="266752" size_actual="274432" count="2">
<ent name="db.duc" size_apparent="266752" size_actual="274432" />
<ent name="foo-\udcf6" size_apparent="0" size_actual="0" />
</duc>
In short:
surrogateescape
optionbackslashreplace
optionHere surrogateescape
and backslashreplace
may be python-specific. I cannot tell, which kind of transformations are available with the libraries you are using for duc
.
First of all: I am sorry, for bringing up such a tricky problem :)
I guess, it is up to you as the maintainer to decide, whether the edge case of "malformed" (non-utf-8) input is worth being handled (with regard to the users of duc
). If all the users just want the fancy graphs (this brought me to duc
), then it may be irrelevant. But if some users try to base other things on duc
(that's me here: I want to generate an improved visualization of the storage used in maildirs in order to help users to find and remove their dark data), then you may want to handle these edge cases, too.
Sorry, that I cannot offer more detailed advise regarding XML or encoding details. My knowledge is limited in these fields.
Anyway: thank you for your time!
Right; that all makes perfect sense, I hadn't realized we were talking malformed file names here. The problem here is that duc is basically oblivious to encodings; it simply does not care about what the data means, the file names are just a sequence of characters. If it happens to be valid UTF-8, that's nice, but Duc does not care. XML does care however, so when exporting we should take proper care to emit valid UTF-8.
The problem is how to do this nice and proper in plain C. We would at least need to interpret all names as UTF-8 to see when a sequence of bytes results in invalid UTF-8, and find another way to encode these bytes. I'm not comfortable pulling in some large dependency like iconv, but I guess we should be able to whip up something lightweight for this.
I stumbled upon an issue with some local filenames (e.g. containing German umlauts). At least Python's xml library refuses to load the data generated by
duc
, if special characters in filenames are involved.The following procedure demonstrates the issue (comments are included):
I do not know the XML specification in detail, but I think, non-trivial characters (everything outside of 7-bit ASCII?) need to be escaped.
Python would emit the following for the above special character:
What do you think?
Thank you for maintaining
duc
!