zeux / pugixml

Light-weight, simple and fast XML parser for C++ with XPath support
http://pugixml.org/
MIT License
4.01k stars 728 forks source link

xmldocument::save uses non-transformed encoding parameter #617

Closed aral-matrix closed 4 months ago

aral-matrix commented 4 months ago

In pugixml.cpp / xml_document::save, the impl::xml_buffered_writer constructor invoked for buffered_writer uses get_write_encoding on parameter encoding, which can transform xml_encoding::encoding_auto into a specific encoding (e.g. encoding_utf8), stored in buffered_writer.encoding.

Accordingly, buffered_writer.encoding should be evaluated, while currently only the unmodified function parameter encoding is checked.

Pull request #616 _xmldocument::save - use encoding as interpreted by get_writeencoding… fixes this.

The same pull request also adds explicit UTF-8 encoding to the target file, but because this can change how an empty document looks, this requires a change to the tests/test_document.cpp, which I don't know how to make because the tests are hard to read due to macros.

zeux commented 4 months ago

Accordingly, buffered_writer.encoding should be evaluated, while currently only the unmodified function parameter encoding is checked.

What user visible consequences does this have without the change to add utf8?

Why should the default behavior be to emit encoding=utf-8 for every document? That’s already what the XML specification says should be the default.

aral-matrix commented 4 months ago

What user visible consequences does this have without the change to add utf8?

As I said in the pull request (forgot to mention it explicitly here): it's a minor bugfix - in the current implementation, I don't see a problem - however, if get_write_encoding ever gets modified to interpret e.g. encoding_auto as something that is specifically checked in xml_document::save to emit an encoding property, the check would fail / the buffered_writer would write a different encoding than the document header would specify.

That's why it's a bug without consequences in the current implementation, but it's still a bug as far as I can tell.

Why should the default behavior be to emit encoding=utf-8 for every document? That’s already what the XML specification says should be the default.

Well I suppose should is not the right word - it's just a matter of preference in my case - when working with XLSX documents created in e.g. LibreOffice, the encoding=utf-8 gets stripped when saving a modified file with pugixml.

I just thought that this initial xml tag has basically zero impact on performance, so explicit encoding is better than implicit. It's your project - leave that change out if you don't like it :)

zeux commented 4 months ago

Ok - I don't know that it matters either way, but feel free to submit a PR that changes the encoding check to buffered_writer.encoding.

I do not think the default output should be changed. Stylistic preferences can be accommodated using a documented method in this case: https://pugixml.org/docs/manual.html#saving.declaration - and the default should be the minimal output that results in unambiguous decoding result, which is why there's a special case for latin1 but not for any other encoding - every other encoding can/should use BOM or default to UTF8 without any extras.

I'm going to close this either way, feel free to change the PR to only change the encoding if you need that change.

aral-matrix commented 4 months ago

Actually, I am not entirely sure how to patch this - I just pushed (to my fork) a change as you suggest, and this change also breaks the checks - which I do not understand, because all I changed is two checks on encoding to check buffered_writer.encoding instead, and with line no. 3630 in get_write_encoding: if (encoding != encoding_auto) return encoding; this should be exactly the desired behavior?

Before I send you a pull request, I was hoping you could help me understand why my forked version breaks these validations:

Test document_save_file_wide failed: doc contents does not match L"<?xml version=\"1.0\"?><node/>" at tests/test_document.cpp:776
Test document_save_file failed: doc contents does not match L"<?xml version=\"1.0\"?><node/>" at tests/test_document.cpp:762
FAILURE: 2 out of 964 tests failed.

The "Test document_save_file_wide" failed makes be wonder whether line 3621 is the one that triggers a differing behavior under test with my path. Is there maybe an actual bug here?