Closed jeshow closed 11 months ago
There is a similar issue when parsing this data from text:
import yaml
text = '''
0007: {key: val}
0008: {key: val}
0009: {key: val}
0010: {key: val}
'''
data = yaml.safe_load(text)
print(data.keys())
This produces dict_keys([7, '0008', '0009', 8])
.
You're running afoul of the YAML 1.1 base-8 integer representation. This is all "legit" through that lens, since PyYAML currently only supports YAML 1.1. It sucks, which is why octals got revamped in YAML 1.2 (and there are numerous ways to disable/bypass that behavior), but it's not a bug. Since it sounds like you're interopping with another YAML implementation that's reading those as 0-padded base-10 ints (1.2 behavior), you'd probably do better to quote or !str
-tag them in your documents, and ensure that the PyYAML emitting side is emitting strings, not ints (which will be subject to the 1.1-octal-aware quoting behavior).
Closing as "not a bug, just an unfortunate reality until PyYAML grows proper 1.2 support".
Thank you @nitzmahone, for the answer and suggestion! For anyone finding this in the future, I realized the real problem was much simpler.
My process involved emitting YAML from pyyaml
, ingesting that with yaml-cpp
, emitting a new file from yaml-cpp
, and then ingesting that with pyyaml
. In short, pyyaml
-> yaml-cpp
-> pyyaml
.
The real problem occurred in the yaml-cpp
-> pyyaml
step, because my version of yaml-cpp
treated all 000x
elements as 0-prefixed strings. It would then write them as such with the default format -- without quotes.
When pyyaml
went to process the fields, however, it would read them all as octal integers, converting elements like 0007
to 7
and 0010
to 8
, but leaving 0008
and 0009
as strings (because they won't be interpreted as base-8).
With @nitzmahone's suggestion above, I modified my yaml-cpp
encoder to explicitly set the default string tag for the spurious elements:
static Node encode(const ExampleStructure& rhs)
{
Node node;
node["data"] = rhs.data;
node["name"] = rhs.name;
node["name"].SetTag("tag:yaml.org,2002:str");
return node;
}
That tagged the appropriate data elements in the YAML emitted by yaml-cpp
, and then pyyaml
could properly interpret it again.
I'm using pyyaml version 6.0.1 and I've discovered that strings that end in '8' and '9' that are prefixed with at least one '0' are treated differently than strings ending with any other digit. This is problematic in that other interpreters (namely
yaml-cpp
) then fail to properly interpret the data as a string.After diving into the
resolver.py
code, it looks like this is because of this regex, which matches '0007' but does not match '0008' or '0009'.Here is a minimal example:
The resulting file looks like: