HTML-like labels should be opt-in

bayersglassey-zesty commented 2 years ago

I'm starting this based on this Prefect issue: https://github.com/PrefectHQ/prefect/issues/5656

Long story short, in graphviz library's quoting.py, there is this regex:

HTML_STRING = re.compile(r'<.*>$', re.DOTALL)

...which incorrectly matches strings like '<lambda> <map>'.

To reproduce:

mkdir graphviz_test
cd graphviz_test
python3.9 -m venv env
. env/bin/activate
pip install -U pip
pip install graphviz
cat >test.py <<EOF
from graphviz import Digraph

if __name__ == '__main__':

    graph = Digraph()
    graph.node('a', '<lambda> <map>')
    graph.view()
EOF
python test.py

...this results in:

graphviz.backend.execute.CalledProcessError: Command '[PosixPath('dot'), '-Kdot', '-Tpdf', '-O', 'Digraph.gv']' returned non-zero exit status 1. [stderr: b"Error: Digraph.gv: syntax error in line 2 near ']'\n"]

...and the generated Digraph.gv looks like:

digraph {
    a [label=<lambda> <map>]
}

bayersglassey-zesty commented 2 years ago

There is a workaround: you can wrap your labels in graphviz.quoting.nohtml. I guess this issue is basically the reason for that function's existence, so I take it this will be marked "wontfix". But it would be really nice if things were the other way around: by default, we assume that strings are not HTML, and if you want to mark one as being HTML, then maybe you wrap it in graphviz.quoting.html.

xflr6 commented 2 years ago

Thanks for the detailed report.

This is indeed working as documented, see https://graphviz.readthedocs.io/en/stable/manual.html#quoting-and-html-like-labels

Hint: In case you are want to pass arbitrary user input (e.g. for labels), you might want to consider using graphviz.escape() instead of graphviz.nohtml(), which also disables backslash-escapes: https://graphviz.readthedocs.io/en/stable/manual.html#backslash-escapes (unless you want to be able use them).

Side note: use the official public API names (which are also simpler): https://graphviz.readthedocs.io/en/stable/api.html#quoting-escaping (all directly under graphviz).

And here is the hacky regex the graphviz library is using to detect such labels, which incorrectly matches e.g. '<lambda> <map>'... in ~/.venvs/3.9/prefect/lib/python3.9/site-packages/graphviz/quoting.py:

This interpretation of <...> as HTML-like label comes from upstream Graphviz, see https://graphviz.org/doc/info/lang.html:

import graphviz

graphviz.Source('''
    digraph {
        a [label=<lambda> <map>]
    }
''').render()
Error: <stdin>: syntax error in line 3 near ']'

IIUC you might be implying that graphviz.quoting.quote() should rather detect that lambda> <map is not valid XML by trying to parse the input input as XML and adding quotes in case that fails. IMHO that would be un-Pythonic ('In the face of ambiguity, refuse the temptation to guess.'). AFAIU, it would not resolve the ambiguity because we still need a way to render something that is angled-bracketed and inside looks like valid XML literally instead of rendering it as an HTML-like label (e.g. graphviz.nohtml('<>')).

I completely agree that this is something for users to stumble upon and it would probably better for HTML-like labels to be opt-in. The same reasoning might apply to backslash-escapes, should they be opt-in as well? #53 is another wart that probably requires a backwards-incompatible behavioural change or maybe duplicating some arguments or function names to provide for the different use cases (treat strings literally vs. let them be interpreted).

My plan/proposal is to address all these shortcomings that might involve subtle API/behavioural changes in a single version jump that might be allowed break backwards-compatibility, e.g. for the 1.0 or maybe even 2.0 version of this project. As you see, there is a lot to consider. Sorry for the inconvenience.

bayersglassey-zesty commented 2 years ago

Thank you for all the links to documentation! I learned some things about DOT today... Graphviz "record shapes" are pretty interesting. O_o I would never have imagined that such a feature existed!

The same reasoning might apply to backslash-escapes, should they be opt-in as well?

I think so!

IUC you might be implying that graphviz.quoting.quote() should rather detect that lambda> <map is not valid XML by trying to parse the input input as XML and adding quotes in case that fails.

No, actually I was surprised to see a regex looking for HTML in the first place! In all the time I've used the graphviz library, I had assumed it was doing something very simple with labels, like "if it contains special characters, quote it, otherwise don't bother". I came to use the graphviz library as a Python programmer who wanted to render simple graphs as quickly as possible. The graphviz library is the solution commonly suggested online. I started using it without knowing anything about Graphviz and DOT; later, I learned enough to understand that graphviz is just building output to feed to the dot command, and how to inspect the .gv files generated by Digraph.render. It's fairly easy to generate DOT oneself, but quoting rules can get tricky, so I had assumed that one of the main benefits of using graphviz was for it to take care of that for you. So I trusted it to take any string I gave it, and quote it so that it became a node/edge name.

Now that I know about the idea of id:port:compass, which is very fancy, I think it would have been nice if port and compass had been kwargs, as opposed to things parsed out of the name; or even if name was allowed to be any of: 'id', ('id,), ('id', 'port'), or ('id', 'port', 'compass'). (Edit: in fact, I now realize that I have run into this before!.. I was trying to use URLs as node IDs, and of course the colon in "http://" was resulting in everything after http: being quoted, which was about the most confusing behaviour I had ever seen!) As for HTML labels... that's a cool feature (especially when you see the results you can get with "record shapes"), but I bet most "Python programmers who want to render simple graphs as quick as possible" are never going to suspect that such a things exists, until it randomly bites them (as seems to have been the case with the Prefect library).

My plan/proposal is to address all these shortcomings that might involve subtle API/behavioural changes in a single version jump that might be allowed break backwards-compatibility, e.g. for the 1.0 or maybe even 2.0 version of this project. As you see, there is a lot to consider. Sorry for the inconvenience.

That sounds excellent. It's good to know you were already thinking about this stuff; and I'm sorry for the inconvenience of having people like me who use the library for ages without reading through all of the documentation first. :)

bayersglassey-zesty commented 2 years ago

Actually, I think it might be nice to mention graphviz.escape in a little "information bubble" in this section of the docs: https://graphviz.readthedocs.io/en/stable/manual.html#basic-usage ...I see that backslash-escaping and HTML-like labels are mentioned a bit further down, but I believe a casual "Python programmer who wants to render simple graphs as quickly as possible" is going to see those sections and actively ignore them, believing them to be too fancy for their needs.

So the first code example one sees on that page is:

dot = graphviz.Digraph('round-table', comment='The Round Table')

...and for most people, I think the takeaway is "I can pass an arbitrary string as the name, e.g. 'round-table'". But if there was an example there showing escape in action, people might think "ah ok, I probably want to always use escape until I get to the point where I want to learn about fancier features".

dot = graphviz.Digraph(graphviz.escape('round-table'), comment='The Round Table')

xflr6 commented 2 years ago

Thanks, and +1 to your proposal to add a better visible warning about possivle quoting/escaping gotchas. See 25043ed2b59bfdc1ac60db10a2d526132ed2a4ce.

I learned enough to understand that graphviz is just building output to feed to the dot command [...] quoting rules can get tricky, so I had assumed that one of the main benefits of using graphviz was for it to take care of that for you

Yeah, this is a very simple library (only around a thousand lines) and +1 the main thing it helps with is quoting/escaping. Because I did not manage to get that completely right on the first try and at least some users might depend on the ability to interpret e.g. backslash escapes, we need to be careful with fixing that (major version change).

if name was allowed to be any of: 'id', ('id,), ('id', 'port') , or ('id', 'port', 'compass').

+1. I have thought about that earlier: add support for tuples as a backwards-compatible first step towards fixing #53, so we can change the behaviour in a second step and require the use of tuples for port/compass later. Feel free to add that proposal to the issue thread (with the introduction of type annotations, Python provides nicer ways to document that by now as well). Let me know in case you would like to contribute on this feature.

bayersglassey-zesty commented 2 years ago

See https://github.com/xflr6/graphviz/commit/25043ed2b59bfdc1ac60db10a2d526132ed2a4ce.

Ah, that's exactly what I had in mind! Thanks for that. (There is one typo I noticed, "wraping" should be "wrapping".)

I have thought about that earlier: add support for tuples as a backwards-compatible first step towards fixing https://github.com/xflr6/graphviz/issues/53, so we can change the behaviour in a second step and require the use of tuples for port/compass later. Feel free to add that proposal to the issue thread (with the introduction of type annotations, Python provides nicer ways to document that by now as well). Let me know in case you would like to contribute on this feature.

Okay -- I'll mention it in there, and actually yes I would be interested in seeing if I can add tuple support as a first step :)

xflr6 commented 2 years ago

Thanks, fixed the typo in 4aa4d2441552d72925388e7ce2232c818e91d751.

GregIthaca commented 4 months ago

I'm seconding the call for more documentation, and for documentation in the places it is most critical. I came here to file an issue because as far as what I could tell from the API reference G.node(name: str, label: Optional[str] = None...) suggests that this is just a python string, and the behavior I was seeing looked like a bug. I would personally agree that assuming HTML parsing by default is a "bug" in the sense that assuming that angle-brackets around a string necessarily means it is HTML is a foolhardy assumption given how many other meanings they have. But well-placed documentation would at least help.

There's nothing there that hints at the possibility that there is a Quoting and HTML-like labels kind of functionality that reacts differently to different strings. If I was a novice reading the documentation from beginning to end, I might have known that, but an experienced programmer often just goes to the API documentation and looks for examples of how to invoke the functions.

Thus I had no idea why the first of these node labels had no quotes (and was of course syntactic garbage that led the renderer to barf) while the second one did have quotes:

        n23 [label=<&T as core::fmt::Display>]
        n24 [label="<&T as core::fmt::Display>::fmt"]

xflr6 commented 4 months ago

There's nothing there that hints at the possibility that there is a Quoting and HTML-like labels kind of functionality that reacts differently to different strings.

There is a warning at the end of README.md and https://graphviz.readthedocs.io/en/stable/manual.html#basic-usage (added in 25043ed2b59bfdc1ac60db10a2d526132ed2a4ce). See https://github.com/xflr6/graphviz/issues/168#issuecomment-1100677538 on backwards compatibility.

GregIthaca commented 4 months ago

I'm not saying it's not documented; once I found this thread I was able to find the documentation (hence link it above). I'm saying that someone who bookmarks the API documentation and goes there to try to understand the behavior doesn't get any hints.

It could be as simple as editing the description for the argument to say something like: Caption to be displayed (defaults to the node name). This string is parsed according to various patterns described in the basic usage section, which affect how it is rendered in the GraphViz file.

xflr6 commented 4 months ago

someone who bookmarks the API documentation

Oh okay, fair point: how about https://github.com/xflr6/graphviz/compare/master...api_docs_escape?

xflr6 commented 4 months ago

Added callouts:

bayersglassey-zesty commented 4 months ago

Thanks for this!

xflr6 / graphviz

HTML-like labels should be opt-in #168