zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
579 stars 78 forks source link

malformed xml output for non-ascii filenames #302

Open sumpfralle opened 1 year ago

sumpfralle commented 1 year ago

I stumbled upon an issue with some local filenames (e.g. containing German umlauts). At least Python's xml library refuses to load the data generated by duc, if special characters in filenames are involved.

The following procedure demonstrates the issue (comments are included):

# 1. assemble a filename containing a special character
[temp]:/tmp/cdt-Q4kGwX# filename="foo-$(printf '\xf6')"

# 2. create a file with that name
[temp]:/tmp/cdt-Q4kGwX# touch "$filename"

# 3. show content of the directory
[temp]:/tmp/cdt-Q4kGwX# find
.
./foo-?
[temp]:/tmp/cdt-Q4kGwX# ls -l
insgesamt 0
-rw-r--r-- 1 user user 0 30. Okt 05:49 'foo-'$'\366'

# 4. generate a duc database containing that file
[temp]:/tmp/cdt-Q4kGwX# duc index --database db.duc .

# 5. Python's XML parser refuses to load the XML data emitted by `duc` (pointing at the special character)
[temp]:/tmp/cdt-Q4kGwX# duc xml --database db.duc | python3 -c 'import sys; from xml.etree.ElementTree import XMLParser; XMLParser().feed(sys.stdin.buffer.read())'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 4, column 16

# 6. show the emitted xml data including the special character (badly rendered by my terminal)
[temp]:/tmp/cdt-Q4kGwX# duc xml --database db.duc
<?xml version="1.0" encoding="UTF-8"?>
<duc root="." size_apparent="266752" size_actual="274432" count="2">
 <ent name="db.duc" size_apparent="266752" size_actual="274432" />
 <ent name="foo-�" size_apparent="0" size_actual="0" />
</duc>

# 7. show the emitted xml data in detail
[temp]:/tmp/cdt-Q4kGwX# duc xml --database db.duc | hexdump -C
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  64 75 63 20 72 6f 6f 74  |F-8"?>.<duc root|
00000030  3d 22 2e 22 20 73 69 7a  65 5f 61 70 70 61 72 65  |="." size_appare|
00000040  6e 74 3d 22 32 36 36 37  35 32 22 20 73 69 7a 65  |nt="266752" size|
00000050  5f 61 63 74 75 61 6c 3d  22 32 37 34 34 33 32 22  |_actual="274432"|
00000060  20 63 6f 75 6e 74 3d 22  32 22 3e 0a 20 3c 65 6e  | count="2">. <en|
00000070  74 20 6e 61 6d 65 3d 22  64 62 2e 64 75 63 22 20  |t name="db.duc" |
00000080  73 69 7a 65 5f 61 70 70  61 72 65 6e 74 3d 22 32  |size_apparent="2|
00000090  36 36 37 35 32 22 20 73  69 7a 65 5f 61 63 74 75  |66752" size_actu|
000000a0  61 6c 3d 22 32 37 34 34  33 32 22 20 2f 3e 0a 20  |al="274432" />. |
000000b0  3c 65 6e 74 20 6e 61 6d  65 3d 22 66 6f 6f 2d f6  |<ent name="foo-.|
000000c0  22 20 73 69 7a 65 5f 61  70 70 61 72 65 6e 74 3d  |" size_apparent=|
000000d0  22 30 22 20 73 69 7a 65  5f 61 63 74 75 61 6c 3d  |"0" size_actual=|
000000e0  22 30 22 20 2f 3e 0a 3c  2f 64 75 63 3e 0a        |"0" />.</duc>.|
000000ee

I do not know the XML specification in detail, but I think, non-trivial characters (everything outside of 7-bit ASCII?) need to be escaped.

Python would emit the following for the above special character:

[temp]:/tmp/cdt-Q4kGwX# python3 -c 'import xml.etree.ElementTree as ET; print(ET.tostring(ET.Element("ent", {"name": "foo-ö"})).decode())'
<ent name="foo-&#246;" />

What do you think?

Thank you for maintaining duc!

zevv commented 1 year ago

I must admit I never got into the XML specs proper; the parsers I tried up to now were all happy to eat whatever UTF-8 I feed them, and I (naively?) expected that putting in the XML header as we have it now should suffice:

<?xml version="1.0" encoding="UTF-8"?>

I'd be happy to make the necessary changes to make the output compliant, but it would help if someone who knows about these things can hold my hand and tell me how to do it though :)

sumpfralle commented 1 year ago

the parsers I tried up to now were all happy to eat whatever UTF-8 I feed them

Yes, that's how it is supposed to work: utf-8 should never be a problem.

But here it is about malformed filenames, which cannot be represented as UTF-8. In my case it seems to be caused by some directories, which were created a long time ago (in one case it was 17 years old). I guess, encoding of filenames was not standardised to be UTF-8 back then, but used the currently active locale (here: iso8859-15).

What is the problem?

From my point of view the problem has two layers:

  1. Filesystems tolerate any binary ("list of bytes") representation for a filename, as long as it does not contain the tokenizer (/) or the end-of-string indicator (\0). Desktop environments usually care for reducing the set of input to valid utf-8 strings, but we (duc) cannot rely on that.
  2. duc needs to represent non-utf-8 sequences in its name and root attributes. But XML only allows utf-8 character sequences. Thus duc needs to escape problematic sequences. Ideally it should be possible for a duc user to transform these weird byte sequences (taken from the XML file) back into their original state (in order to map these names to the real files in the filesystem). The solution for this problem is the "surrogates" concept in unicode (as far as I understand it). Here a range of unicode characters is used for representing "unparseable" characters in a reversible fashion.

My "solution" (in Python)

In my case I am sanitizing the output of duc xml with the following bit of python code (following up on the example snippets in my original post; here I am only transforming the name attributes, but the root attribute should be handled in the same way):

duc xml --database db.duc | python3 -c 'import re, sys; regex = re.compile(b"name=\"([^\"]+)\""); converter = lambda m: b"name=\"" + m.groups()[0].decode(errors="surrogateescape").encode(errors="backslashreplace") + b"\""; raw_xml = sys.stdin.buffer.read(); sys.stdout.buffer.write(regex.sub(converter, raw_xml))'

(I hope, my example is not too hard to read for non-native Python speakers)

The result (happily accepted by xmllint and Python's etree parser):

<?xml version="1.0" encoding="UTF-8"?>
<duc root="." size_apparent="266752" size_actual="274432" count="2">
 <ent name="db.duc" size_apparent="266752" size_actual="274432" />
 <ent name="foo-\udcf6" size_apparent="0" size_actual="0" />
</duc>

In short:

  1. the input path name (a raw string as returned by the OS) is decoded into a string (unicode) with the surrogateescape option
  2. the resulting string (containing unicode-"surrogates" for representing malformed bytes) is encoded back into a binary representation suitable as XML attribute content with the backslashreplace option

Here surrogateescape and backslashreplace may be python-specific. I cannot tell, which kind of transformations are available with the libraries you are using for duc.

Conclusion

First of all: I am sorry, for bringing up such a tricky problem :)

I guess, it is up to you as the maintainer to decide, whether the edge case of "malformed" (non-utf-8) input is worth being handled (with regard to the users of duc). If all the users just want the fancy graphs (this brought me to duc), then it may be irrelevant. But if some users try to base other things on duc (that's me here: I want to generate an improved visualization of the storage used in maildirs in order to help users to find and remove their dark data), then you may want to handle these edge cases, too.

Sorry, that I cannot offer more detailed advise regarding XML or encoding details. My knowledge is limited in these fields.

Anyway: thank you for your time!

zevv commented 1 year ago

Right; that all makes perfect sense, I hadn't realized we were talking malformed file names here. The problem here is that duc is basically oblivious to encodings; it simply does not care about what the data means, the file names are just a sequence of characters. If it happens to be valid UTF-8, that's nice, but Duc does not care. XML does care however, so when exporting we should take proper care to emit valid UTF-8.

The problem is how to do this nice and proper in plain C. We would at least need to interpret all names as UTF-8 to see when a sequence of bytes results in invalid UTF-8, and find another way to encode these bytes. I'm not comfortable pulling in some large dependency like iconv, but I guess we should be able to whip up something lightweight for this.