xoofx / SharpYaml

SharpYaml is a .NET library for YAML compatible with CoreCLR
Other
336 stars 61 forks source link

Force double quotes on integers and null in string #112

Closed jnystad closed 2 years ago

jnystad commented 2 years ago

Fixes #106

Also fixes quoting for integer values in strings.

I guess the double.TryParse was an optimization to avoid parsing long string values if unnecessary? If so, this might impact performance when serializing a lot of longer string values. Might add similar parsing attempts for bools and integers (null check is trivial), but not sure if C# parsers match the YAML spec exactly.

xoofx commented 2 years ago

Hm, I think that it was to avoid Schema.TryParse which is a lot more costly than the early exit double.TryParse

jnystad commented 2 years ago

Yes, that's what I thought. Any suggestion to fix? Perhaps parsing a substring would be sufficient.

nickcampau commented 2 years ago

I just ran into an issue that was caused by this issue yesterday. To my utter amazement this PR had just been opened a few hours before. I want to shoutout @jnystad 🥇 your timing couldn't have been more perfect. I was literally cloning down the repo to look into fixing this issue when I saw this PR.

I'd like to suggest that this PR not wait for any further performance enhancement modifications since it addresses the main issue of YAML non-compliance. Instead a new PR be created to improve the performance (if possible) to investigate those concerns.

xoofx commented 2 years ago

I'd like to suggest that this PR not wait for any further performance enhancement modifications since it addresses the main issue of YAML non-compliance. Instead a new PR be created to improve the performance (if possible) to investigate those concerns.

This behavior has been around for the past 5 years, so there is no urgency to land this PR.

Yes, that's what I thought. Any suggestion to fix? Perhaps parsing a substring would be sufficient.

Could you run a benchmark before/after using BenchmarkDotNet on a significant yml file to see the impact?

jnystad commented 2 years ago

Given the customizable nature of SharpYaml, and specifically since Schema in SerializerContext can be overridden I think any optimizations added here may come into conflict with custom Schema handling and therefore do not belong here.


Also, I'm not sure I have the overview of the library to be certain I'm creating useful performance tests.

A preliminary investigation using Stopwatch and loops indicate that the actual Serialize step has a negligible performance cost compared to initializing a new Serializer (i.e. creating a new Serializer and running Serialize takes ~500ms on a large object while the next thousand with the same Serializer takes ~5ms in total).

Also, I don't think the regexes used by the real parser are significantly slower than any other check one can do, at least not in a scale that matters. They probably return fast enough regardless of string length considering the limited matches they produce.

But as I said, I don't know the internal mechanics enough to determine if I'm actually testing this properly (as in, is there some caching where feeding the same deserialized object to the same Serializer somehow skips the actual serialization?).


Regarding the urgency, I wonder how existing users of SharpYaml and YamlDotNet (which has the same flaw with at least integer values) live with this. Perhaps they override the default behaviour (like I have done where I'm currently using SharpYaml until this fix is merged).

xoofx commented 2 years ago

A preliminary investigation using Stopwatch and loops indicate that the actual Serialize step has a negligible performance cost compared to initializing a new Serializer (i.e. creating a new Serializer and running Serialize takes ~500ms on a large object while the next thousand with the same Serializer takes ~5ms in total).

Afair, It is recommended to use a Serializer per thread and keep it around, as it is storing metadata information that you really don't want to recompute.

Also, I don't think the regexes used by the real parser are significantly slower than any other check one can do, at least not in a scale that matters. They probably return fast enough regardless of string length considering the limited matches they produce.

Right, this could be also further optimized with .NET 7+ regex optimizations in the future.


Anyway, thanks for the fix, I'm gonna merge it. The regression in perf might be negligeable and if someone notice it, we will have a candidate to fix it 😅

jnystad commented 2 years ago

Thanks!