yaml / pyyaml

Canonical source repository for PyYAML
MIT License
2.47k stars 507 forks source link

'0008' and '0009' strings are not single-quoted when dumping. #740

Closed jeshow closed 11 months ago

jeshow commented 11 months ago

I'm using pyyaml version 6.0.1 and I've discovered that strings that end in '8' and '9' that are prefixed with at least one '0' are treated differently than strings ending with any other digit. This is problematic in that other interpreters (namely yaml-cpp) then fail to properly interpret the data as a string.

After diving into the resolver.py code, it looks like this is because of this regex, which matches '0007' but does not match '0008' or '0009'.

Resolver.add_implicit_resolver(
        'tag:yaml.org,2002:int',
        re.compile(r'''^(?:[-+]?0b[0-1_]+
                    |[-+]?0[0-7_]+
                    |[-+]?(?:0|[1-9][0-9_]*)
                    |[-+]?0x[0-9a-fA-F_]+
                    |[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+)$''', re.X),
        list('-+0123456789'))

Here is a minimal example:

import yaml
fields = {'0007': {'key': 'val'}, '0008': {'key': 'val'}, '0009': {'key': 'val'}, '0010': {'key': 'val'}}
with open('tmp.yaml', 'w') as stream:
  yaml.safe_dump(fields, stream)

The resulting file looks like:

'0007':
  key: val
0008:
  key: val
0009:
  key: val
'0010':
  key: val
jeshow commented 11 months ago

There is a similar issue when parsing this data from text:

import yaml
text = '''
0007: {key: val}
0008: {key: val}
0009: {key: val}
0010: {key: val}
'''
data = yaml.safe_load(text)
print(data.keys())

This produces dict_keys([7, '0008', '0009', 8]).

nitzmahone commented 11 months ago

You're running afoul of the YAML 1.1 base-8 integer representation. This is all "legit" through that lens, since PyYAML currently only supports YAML 1.1. It sucks, which is why octals got revamped in YAML 1.2 (and there are numerous ways to disable/bypass that behavior), but it's not a bug. Since it sounds like you're interopping with another YAML implementation that's reading those as 0-padded base-10 ints (1.2 behavior), you'd probably do better to quote or !str-tag them in your documents, and ensure that the PyYAML emitting side is emitting strings, not ints (which will be subject to the 1.1-octal-aware quoting behavior).

Closing as "not a bug, just an unfortunate reality until PyYAML grows proper 1.2 support".

jeshow commented 11 months ago

Thank you @nitzmahone, for the answer and suggestion! For anyone finding this in the future, I realized the real problem was much simpler.

My process involved emitting YAML from pyyaml, ingesting that with yaml-cpp, emitting a new file from yaml-cpp, and then ingesting that with pyyaml. In short, pyyaml -> yaml-cpp -> pyyaml.

The real problem occurred in the yaml-cpp -> pyyaml step, because my version of yaml-cpp treated all 000x elements as 0-prefixed strings. It would then write them as such with the default format -- without quotes.

When pyyaml went to process the fields, however, it would read them all as octal integers, converting elements like 0007 to 7 and 0010 to 8, but leaving 0008 and 0009 as strings (because they won't be interpreted as base-8).

With @nitzmahone's suggestion above, I modified my yaml-cpp encoder to explicitly set the default string tag for the spurious elements:

static Node encode(const ExampleStructure& rhs)
{
  Node node;
  node["data"] = rhs.data;
  node["name"] = rhs.name;
  node["name"].SetTag("tag:yaml.org,2002:str");
  return node;
}

That tagged the appropriate data elements in the YAML emitted by yaml-cpp, and then pyyaml could properly interpret it again.