Some comments on the parser

MichaelMure commented 4 years ago

This is more a FYI than a real issue.

I'm writing a Markdown renderer specialized for the terminal and I'm in the process of migrating to goldmark. If you are curious, you can have a look at https://github.com/MichaelMure/go-term-markdown/pull/20 (not merged yet).

First thing first, thanks for your work :)

I just wanted to document for you some of the struggle I had. Feel free to dismiss that entirely, it's just ideas throw over the fence:

as you can see with https://github.com/MichaelMure/go-term-markdown/blob/3c6f0c6934c14f102f846ad2c83713140897afa5/markdown.go#L15-L48, it's a little complex to create a new parser or even understand how to do it (I had to read the code a long time before it clicked). It'd be nice if the parser package would expose some helper to create a default parser or a GFM one.
I know that you don't need that to render in HTML, but exposing in the AST the numbering of the headings and list items would be very convenient. I'm working around and computing those from the AST but that's not super convenient.
in a similar fashion, exposing the level of depth of Blockquote and List would be convenient.
links reference blocks are represented in the AST as a TextBlock with no lines, attached to the root. I'm able to detect that but a specific node type would make things much cleaner.
inline HTML is split in the AST by RawHTML nodes for each HTML tag. In my case it's a problem because I want to interpret some of that HTML (lists for example) to render them as well. This means that I would have to reconstruct first the HTML section before interpreting it. It'd be nice if the parser was able to distinguish those section and output a single RawHTML node. In the following example, item1 and item2 would not be Text but be part of this RawHTML because the tags are unmatched at this point.

    Paragraph {
        RawText: "foo <ul><li>item1</li><li>item2</li></ul>"
        HasBlankPreviousLines: true
        Text: "foo
        RawHTML {
            RawText: <ul>
        }
        RawHTML {
            RawText: <li>
        }
        Text: "item1"
        RawHTML {
            RawText: </li>
        }
        RawHTML {
            RawText: <li>
        }
        Text: "item2"
        RawHTML {
            RawText: </li>
        }
        RawHTML {
            RawText: </ul>
        }
    }

Thank you!

MichaelMure commented 4 years ago

On the subject of links, it seems that there is no way in the AST to distinguish a complete link ([text](/url/)) from a reference ([ref]). Also a nice to have I think.

MichaelMure commented 4 years ago

Another thing you might find interesting. I needed to visualize the possible cases in the AST, how node types relate to each other, so I generated this diagram from my test cases. It's not perfect, some cases are missing (notably how much garbage can be added in a link's text) but it's helpful. Might be handy for your documentation.

jschaf commented 4 years ago

On the subject of links, it seems that there is no way in the AST to distinguish a complete link [text](/url/) from a reference ([ref]). Also a nice to have I think.

~Not the author but I'm currently digging through the source code to implement citations, e.g. [pg 3, @bibtex-key]. I think the trouble is that link references are implemented as a paragraph transformer. I'm guessing that's probably the case because goldmark can't know if there's a corresponding definition for a short reference. The commonmark demo shows that for:~

[foo]

[bar]

[bar]: example.com

[foo] is the literal text [foo] and [bar] is a LinkReference.

Edit

After a bit more digging, I think what's actually happening for links is:

Goldmark parses all blocks with block parsers.
Goldmark runs the link reference paragraph transformer. That transformer stores link reference lines, e.g. [foo]: http://example.com in the parser context.
Goldmark runs the inline parsers, including the link parser. The link parsers handles short reference links, e.g. [foo], by checking the parser context. If a link exists the short reference link is promoted to a full link.

So, the request was instead of transforming a short reference link into a full link to differentiate with something like ShortRefLink?

yuin commented 4 years ago

You might already know, glamour is a markdown renderer for terminals that uses goldmark. glow is driven by glamour. I think it will be helpful for you.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

yuin / goldmark

Some comments on the parser #131

Edit