yaml / pyyaml

Canonical source repository for PyYAML
MIT License
2.47k stars 507 forks source link

Result of safe_load_all is not writable with safe_dump_all #779

Closed rwcgi closed 5 months ago

rwcgi commented 5 months ago

I have a file testfile2.yml:

{"f1":"field1", "v1":"value1"}
{"f1":"field2", "v1":"value2"}

(which is the result of an Elastic REST API call. Note: data not in a list etc.) I want to munge this file (actually change machine names as I move data between environments) and write to a new file. Reading it is ok, writing it is the issue. Having read the docs, I expected to be able to use safe_dump_all.

#!/usr/bin/python

import yaml
import json

filename="testfile2.yml"
newfilename=filename+".new"

with open(filename, 'r') as file:

    try:
        # Reading it in works ok
        yamldata = yaml.safe_load_all(file)

        print("type: %s" % (type(yamldata)))  # Gives "type: <class 'generator'>"

        with open(newfilename, 'w') as newfile:
            print("Writing:")

            # https://pyyaml.org/wiki/PyYAMLDocumentation
            # yaml.dump(yamldata, newfile, default_flow_style=True)
            # Results in:
            # TypeError: cannot pickle 'generator' object

            # yaml.safe_dump_all(yamldata, newfile, default_flow_style=True)
            # Results in:
            # GAH!
            # expected '<document start>', but found '{'
            #   in "testfile2.yml", line 2, column 1
            # Tried various options including explicit_end, default_style, default_flow_style

            # newfile.write(yamldata)
            # Results in:
            #   TypeError: write() argument must be str, not generator

            # https://docs.python.org/3/library/json.html
            # json.dumps(yamldata, indent=4)
            # Results in:
            #   "TypeError: Object of type generator is not JSON serializable"
    except yaml.YAMLError as e:
        print("GAH!")
        print(e)
        exit(1)

print("Done!")

Right now I'd love for someone to tell me "oh, you've made a mistake..." because I'm up against the clock (just hit midnight in the UK so off to bed now) so do say if that's the case.

Thanks

perlpunk commented 5 months ago

First of all, the example YAML is not valid. If you want two documents, you need at least one seperator, --- between them:

{"f1":"field1", "v1":"value1"}
---
{"f1":"field2", "v1":"value2"}

(Read more about documents)

Second, be aware that safe_load_all returns a generator. That's the reason why the load actually succeeds. It's lazy loading. Then the error appears when you are trying to process it. For playing with it, I suggest to turn it into a list first: l = list(yamldata)

But this code should work if the file is actually valid:

>>> s="""
... {"f1":"field1", "v1":"value1"}
... ---
... {"f1":"field2", "v1":"value2"}
... """
>>> yamldata = yaml.safe_load_all(s)
>>> d=yaml.safe_dump_all(yamldata)
>>> print(d)
f1: field1
v1: value1
---
f1: field2
v1: value2
rwcgi commented 5 months ago

Thanks very much for the quick reply and guidance.

My difficulty is that I have many 1000's of these files which are exported by an Elastic REST API (too many to edit by hand). I guess that API should provide a valid format (i.e. include "---" separators) but they currently don't. But I think what is also odd (to me) is that surely the safe_load_all should also fail for the same reason i.e. no "---" separators? Are the load and dump methods not inconsistent with their behaviour?

Maybe I'll have to resort to just processing as a text file unless there are some clever/ugly hacks? It's a one-off migration so done is better than perfect in this case - an ugly solution won't offend me! :-)

Many thanks.

perlpunk commented 5 months ago

safe_load_all is lazy loading, as I said. It returns a generator object. This object is basically "empty", meaning, PyYAML hasn't even started to parse the file yet. Only when you process the generator, it will start to parse the file, document by document. That can be done by iterating over the generator, or by requesting a list, in which case it will iterate over the generator until it's finished and return the results as a list. But you're not doing that. Instead, you are then calling safe_dump_all. This method accepts a list of data structures, but it also accepts a generator object, and it will then iterate over it. That's the reason why it appears that safe_dump_all results in a parsing error.

Second, what the REST API returns, is not YAML, and also not really original JSON. I believe it's what is called JSON lines format. Every JSON object is serialized on one line. That way it's possible to return multiple JSON objects. In YAML it was always possible to have multiple documents via the --- header. YAML is a superset of JSON, but not of JSON lines format.

In this case it should be very easy to just insert a --- after every line. But since the returned data seems to be JSON (lines format), why not reading the file yourself and load every line as JSON? You can still serialize that as YAML then.

rwcgi commented 5 months ago

Thank you @perlpunk