yaml / pyyaml

Canonical source repository for PyYAML
MIT License
2.56k stars 518 forks source link

Multiline strings are ugly after dumping #240

Open neumond opened 5 years ago

neumond commented 5 years ago
>>> lines = """
... a
... b
... c
... """
>>> lines
'\na\nb\nc\n'
>>> print(yaml.dump({'a': lines}, default_flow_style=False))
a: '

  a

  b

  c

  '

>>> # although roundtrip of dump-load is correct
>>> yaml.load(yaml.dump({'a': lines}))
{'a': '\na\nb\nc\n'}

This could output much shorter

a: |
  a
  b
  c
ingydotnet commented 5 years ago

Or shorter still:

a: "a\nb\nc\n"

Most dumpers have to guess how to dump a string.This is one of those items where it's hard to guess.

I do agree the single quote style is ugly here.

If someone wants to create a PR for scalar style guessing (with lots of tests), we'd be happy to review and consider integrating it. Be aware that the same logic needs to be made to pyyaml and libyaml.

ysaakpr commented 5 years ago
def str_presenter(dumper, data):
    try:
        dlen = len(data.splitlines())
        if (dlen > 1):
            return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    except TypeError as ex:
        return dumper.represent_scalar('tag:yaml.org,2002:str', data)
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)

Tried adding this using add_representer, But somehow the results are not always consistent. For some, I am able them to get properly block quoted, but for some string, though they are a multi line, they will continue a: "a\nb\nc\n" style.

If anyone can give me a hint, on where I have to make the changes, I can put some amount of time to make the PR

ysaakpr commented 5 years ago

My issue was related to https://github.com/yaml/pyyaml/issues/121 its painful to do spend lots of time to findout what is the reason

schollii commented 4 years ago

Nice. If there is fundamental issue we could check in another language that has yaml load/dump (like golang) and supports it. Or ruamel-yaml.

melezhik commented 2 years ago

Actually, this issue is only reason why I can't use pyyaml on my project, for me changing from | multiline to line1 \n line2 format is not acceptable as it will break readable representation for people using GitOps manifests in our repository, where helm values passed as YAML multiline stings, I have to use ruamel.yaml now instead, which has other drawbacks for me ... sigh

ingydotnet commented 2 years ago

OK I've added this to https://github.com/yaml/pyyaml/projects/9

So we'll look at that for the next release, though I can't say when that will happen.

schollii commented 2 years ago

Sone suggestions :

Separate concerns of representation from data otherwise you will end up with a mess of code that is hard to maintain.

Eg the | symbol is presentation, it is metadata about the layout of the data. Similarly references are metadata. And I think most people would consider Comments to be data in the context of yaml, but they can also be treated as metadata because they provide info about surrounding data or context.

So at load time the metadata data must be loaded and stored, and used at output time.

And as mentioned, there may be third-party open source libs out there eg in Java, Go or Javascript (or even in Python) that have already solved this problem. This is one of the purposes of open source, sharing knowledge. There is no reason to reinvent the wheel here, use them as inspiration.

melezhik commented 2 years ago

@schollii i am not sure if you understand the issue. let me repeat. pyyaml parser voluntary changes an original yaml markup when doing a dump. it shouldn't do this or at least this behavior should be configurable.

our programs are not dependent on presentation layers, people who read and edit yaml files are.

ingydotnet commented 2 years ago

@melezhik right. Since we are probably adding a better config system in the next release, the rough plan for this is that we add a config option for the format of multiline strings to prefer. Also I suspect it should be easy to configure this with a custom function that can provide that the preference.

jgunstone commented 2 years ago

found this on StackOverflow as a quick fix for my requirement:

parsed = {'fdir_root': '/mnt/c/engDev/git_mf/ipyrun/tests/examples/line_graph_batch',
 'fpth_config': '/mnt/c/engDev/git_mf/ipyrun/tests/examples/line_graph_batch/config-shell_handler.json',
 'title': '# Plot Straight Lines\n### example RunApp',
 'configs': []}

import yaml
def str_presenter(dumper, data):
    """configures yaml for dumping multiline strings
    Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data"""
    if len(data.splitlines()) > 1:  # check for multiline string
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)
yaml.representer.SafeRepresenter.add_representer(str, str_presenter) # to use with safe_dum

s = yaml.dump(parsed, indent=2)  # , sort_keys=True)
print(s)

>>> configs: []
>>> fdir_root: /mnt/c/engDev/git_mf/ipyrun/tests/examples/line_graph_batch
>>> fpth_config: /mnt/c/engDev/git_mf/ipyrun/tests/examples/line_graph_batch/config-shell_handler.json
>>> title: |-
>>>   # Plot Straight Lines
>>>   ### example RunApp
ingydotnet commented 2 years ago

Also https://github.com/yaml/pyyaml/issues/121#issuecomment-1018117110

cjw296 commented 2 years ago

Slight tweak, better handles strings ending in a newline and might be a bit faster:

def str_presenter(dumper, data):
    """configures yaml for dumping multiline strings
    Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data"""
    if data.count('\n') > 0:  # check for multiline string
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)
Sir-Fancy commented 6 months ago

Sorry to necro this, but wanted to save others the headache. This solution only works if you do not have trailing spaces on any of your lines. If there is a trailing space somewhere, you'll see the original behavior of the string getting all messed up and "\n"s everywhere. To prevent this from accidentally occurring, you can strip them with this modification to @cjw296's code:

def str_presenter(dumper, data):
    if data.count('\n') > 0:
        data = "\n".join([line.rstrip() for line in data.splitlines()])  # Remove any trailing spaces, then put it back together again
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
    return dumper.represent_scalar('tag:yaml.org,2002:str', data)

This behavior took way too long than I'd care to admit to track down.

manugarri commented 4 months ago

is there a plan to add this to the library, maybe as a style? its a very common use case and its a shame we have to do these hacks to get proper yaml.

nirs commented 2 months ago
        data = "\n".join([line.rstrip() for line in data.splitlines()])  # Remove any trailing spaces, then put it back together again
        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')

This converts "|" blocks to "|-". To preserve the style we can use:

        # Remove any trailing spaces messing out the output.
        block = "\n".join([line.rstrip() for line in data.splitlines()])
        if data.endswith("\n"):
            block += "\n"
        return dumper.represent_scalar("tag:yaml.org,2002:str", block, style="|")
perlpunk commented 2 months ago

Here's a PR for discussion: