Closed aral-matrix closed 4 months ago
Accordingly, buffered_writer.encoding should be evaluated, while currently only the unmodified function parameter encoding is checked.
What user visible consequences does this have without the change to add utf8?
Why should the default behavior be to emit encoding=utf-8 for every document? That’s already what the XML specification says should be the default.
What user visible consequences does this have without the change to add utf8?
As I said in the pull request (forgot to mention it explicitly here): it's a minor bugfix - in the current implementation, I don't see a problem - however, if get_write_encoding
ever gets modified to interpret e.g. encoding_auto
as something that is specifically checked in xml_document::save
to emit an encoding property, the check would fail / the buffered_writer would write a different encoding than the document header would specify.
That's why it's a bug without consequences in the current implementation, but it's still a bug as far as I can tell.
Why should the default behavior be to emit encoding=utf-8 for every document? That’s already what the XML specification says should be the default.
Well I suppose should is not the right word - it's just a matter of preference in my case - when working with XLSX documents created in e.g. LibreOffice, the encoding=utf-8 gets stripped when saving a modified file with pugixml.
I just thought that this initial xml tag has basically zero impact on performance, so explicit encoding is better than implicit. It's your project - leave that change out if you don't like it :)
Ok - I don't know that it matters either way, but feel free to submit a PR that changes the encoding check to buffered_writer.encoding.
I do not think the default output should be changed. Stylistic preferences can be accommodated using a documented method in this case: https://pugixml.org/docs/manual.html#saving.declaration - and the default should be the minimal output that results in unambiguous decoding result, which is why there's a special case for latin1 but not for any other encoding - every other encoding can/should use BOM or default to UTF8 without any extras.
I'm going to close this either way, feel free to change the PR to only change the encoding if you need that change.
Actually, I am not entirely sure how to patch this -
I just pushed (to my fork) a change as you suggest, and this change also breaks the checks - which I do not understand, because all I changed is two checks on encoding
to check buffered_writer.encoding
instead, and with line no. 3630 in get_write_encoding:
if (encoding != encoding_auto) return encoding;
this should be exactly the desired behavior?
Before I send you a pull request, I was hoping you could help me understand why my forked version breaks these validations:
Test document_save_file_wide failed: doc contents does not match L"<?xml version=\"1.0\"?><node/>" at tests/test_document.cpp:776
Test document_save_file failed: doc contents does not match L"<?xml version=\"1.0\"?><node/>" at tests/test_document.cpp:762
FAILURE: 2 out of 964 tests failed.
The "Test document_save_file_wide" failed makes be wonder whether line 3621 is the one that triggers a differing behavior under test with my path. Is there maybe an actual bug here?
In
pugixml.cpp
/xml_document::save
, theimpl::xml_buffered_writer
constructor invoked forbuffered_writer
usesget_write_encoding
on parameterencoding
, which can transformxml_encoding::encoding_auto
into a specific encoding (e.g.encoding_utf8
), stored in buffered_writer.encoding.Accordingly,
buffered_writer.encoding
should be evaluated, while currently only the unmodified function parameterencoding
is checked.Pull request #616 _xmldocument::save - use encoding as interpreted by get_writeencoding… fixes this.
The same pull request also adds explicit UTF-8 encoding to the target file, but because this can change how an empty document looks, this requires a change to the
tests/test_document.cpp
, which I don't know how to make because the tests are hard to read due to macros.