xoofx / markdig

A fast, powerful, CommonMark compliant, extensible Markdown processor for .NET
BSD 2-Clause "Simplified" License
4.37k stars 453 forks source link

Consuming Markdig from inside Roslyn #636

Open CyrusNajmabadi opened 2 years ago

CyrusNajmabadi commented 2 years ago

Hi all!

I'm a member of the dotnet/Roslyn team, and I work primarily on the IDE side of things adding and improving the features users interact with when using C# or VB in VS and its siblings. I'm interested in prototyping out support for markdown in C#/VB strings and comments. e.g. an experience where you might be able to write something like:

// markdown
// ... markdown follows ...

And have a pleasant editing/viewing experience for this comment now (e.g. classification, brace highlighting, indentation, completion, etc.).

I was interested in using Markdig for this for numerous reasons, including (but not limited to) it's claims about conformance, large amount of supported extensions for common ecosystem patterns, and the use of a "roundtrippable' API to help ensure that features can map the data model back to the original text the user wrote.

One thing that is not clear to me though is if the system can operate on text abstractions that aren't System.String based. e.g. in Roslyn itself we commonly use data-representations for user code that are not strings. Including, but not limited to, Span<char>, ITextSnapshot (a VS concept), SourceText (a roslyn abstraction over a fixed length sequence of characters, etc. etc.

This is especially important if we want to light things up within a string, where there often isn't a single character in text for each conceptual character that a tool like markdig would be parsing. For example, in a string, a user may have written \u0060 instead of `. Internally we 'virtualize' this, representing this as a single character (actually a System.Text.Rune) that maps to a span in teh document that is 6 characters long.

Is it possible to wrap such abstractions in a lightweight fashion such that markdig can consume them and them speak in relation to the wrapped construct?

Thanks very much!

xoofx commented 2 years ago

Hey Cyrus,

I'm interested in prototyping out support for markdown in C#/VB strings and comments.

I would love to use Markdown for comments, so your questions are very welcome! But upfront, I can say that it might be challenging. 🙂

the use of a "roundtrippable' API to help ensure that features can map the data model back to the original text the user wrote.

So for roundtrip, afair, it is supporting mainly the core of CommonMark but almost no extensions. It will mostly depend which extensions exactly you will support. One related topic is that docfx has added many extensions (e.g to link types) and it will require likely some sync with them to see what can be supported or not.

One thing that is not clear to me though is if the system can operate on text abstractions that aren't System.String based. e.g. in Roslyn itself we commonly use data-representations for user code that are not strings. Including, but not limited to, Span, ITextSnapshot (a VS concept), SourceText (a roslyn abstraction over a fixed length sequence of characters, etc. etc.

So today, we don't have an abstraction but mostly an implementation LineReader that could be abstracted as it returns a StringSlice per line , but it comes with an underlying assumption. That the StringSlice it produced per line can use an existing string to point into it (can be a substring), but it can't support for a same line multiple underlying strings. The rest of Markdown is entirely based on this StringSlice abstraction, so if you have other requirements, you might not be able to use directly Markdig like this (but e.g would have to extract comments to a string or string lines before going back and forth, which might not be ideal, but I don't have other options)

Making LineReader abstractable should be feasible without losing too much perf (depending on whether we want to keep a struct and so we would have to make the MarkdownParser using generics for it)

cc: @MihaZupan other thoughts?

CyrusNajmabadi commented 2 years ago

Making LineReader abstractable should be feasible without losing too much perf (depending on whether we want to keep a struct and so we would have to make the MarkdownParser using generics for it)

I can def look into this. Note: there are ways to be abstractable, but still have perf. For example, using this pattern:

interface ILineReader
{
    // add the pieces the parser needs
}

// then later:

private static void ProcessBlocks<TLineReader>(BlockProcessor blockProcessor, TLineReader lineReader)
    where TLineReader : struct, ILineReader
{
    // access some ILineReader member off of lineReader
}

This means you get a specialized generic impl for each different line-reader struct impl you have that wraps some underlying data and then exposes it. Even though it looks like you're calling through an interface, the runtime will know the concrete struct type and will call directly into that non-virtual member (which it can also optimize with inlining and the like).

I can look into seeing how doable this is and if it explodes into crazy complexity, or is relatively self contained.

CyrusNajmabadi commented 2 years ago

So for roundtrip, afair, it is supporting mainly the core of CommonMark but almost no extensions. It will mostly depend which extensions exactly you will support. One related topic is that docfx has added many extensions (e.g to link types) and it will require likely some sync with them to see what can be supported or not.

Actually, i misspoke. We don't care as much about roundtripping (we're unlikely to support editing the rendered objects, and then persisting back to code). What is more relevant is this bit: "Abstract Syntax Tree with precise source code location for syntax tree" :)

Do you know if extensions generally are good about supporting that aspect of the system? Thanks!

CyrusNajmabadi commented 2 years ago

I can look into seeing how doable this is and if it explodes into crazy complexity, or is relatively self contained.

Ah, unfortunately it's not selfcontained. I can see that the markdig api exposes these publicly deeply, and even represents StringSlice in a way that directly exposes its underlying string as a readonly field :(.

That will make things unfortunately much more challenging. Will have to think on ways to represent that. I presume breaking changes in the overall API are not an option here? :)

xoofx commented 2 years ago

there are ways to be abstractable, but still have perf. [...] This means you get a specialized generic impl

Yep, I mentioned the usage of generics exactly for this optimization (this is heavily used in other parts of Markdig).

Do you know if extensions generally are good about supporting that aspect of the system?

Many extensions are extending CommonMark constructs, so I believe their source location should be accurate. For the others, it might varies. Precise Source location is used in all extensions but it hasn't been tested extensively, so it might require a few fixes here and there.

But it all depends which extensions you will use. There is a wild zoo of extensions in Markdig. Not all of them will be relevant for comments. (And it is more likely that you might want to support the same set than docfx)

Ah, unfortunately it's not selfcontained. I can see that the markdig api exposes these publicly deeply, and even represents StringSlice in a way that directly exposes its underlying string as a readonly field :(.

Yeah, StringSlice is not abstractable and used all over the place, so we can't afford to have a class abstraction for it and generics propagation would be super painful. That's why I mentioned in the preamble that in that case it might require to have intermediate resynthesized line strings...

I presume breaking changes in the overall API are not an option here? :)

Markdig is still using 0.x.y meaning that we can break API, even though we try to avoid that. Problem would be more how invasive, maintainable and perf friendly the changes would be.

MihaZupan commented 2 years ago

Hey, having Markdown support sounds really cool. Would that extend to things like adding StringSyntaxAttribute.Markdown?

As Alexandre mentioned, source locations should be accurate, or if not can be fixed relatively easily. We can always increase test coverage 😄

This is especially important if we want to light things up within a string, where there often isn't a single character in text for each conceptual character that a tool like markdig would be parsing. For example, in a string, a user may have written \u0060 instead of `. Internally we 'virtualize' this, representing this as a single character (actually a System.Text.Rune) that maps to a span in teh document that is 6 characters long.

It may be easiest to do the translation before passing any input to Markdig instead of extending the parser to accept arbitrary string-like sources. I assume the common case is that there are no such transformations necessary, and all you need is a direct substring of source code?

That is, for source code like foo \u002Abar*, allocate a string "foo *bar*" as input for Markdig. Once you have the positions in the input, you can translate those back to their corresponding locations in the source code. The EmphasisInline is located at the [4-8] span, which corresponds to [4-13] in source code.

I wouldn't be surprised if this ends up being faster than customizing Markdig, as doing that would likely mean losing out on some optimizations like vectorization. Doing the input analysis in bulk should be faster than interrupting the parsing logic for each character. A good example of this is that for each input we will first do an IndexOf('\0') to see if we have to replace null chars. This will be insanely fast on a string, while being very slow if done per-character.

As an example, parsing "this is a **Markdown** comment that _may_ find it's way into some `C#` source code" on my machine with pretty bad single-core perf and all extensions enabled takes ~3 micro seconds. The overhead of parsing just "foo" is under 1 μs.

It may be possible to make Markdig work with a substring instead of forcing a string allocation for input. That wouldn't be a game changer for perf though, as parsing is an order of magnitude more expensive than the substring allocation. Accepting input that isn't backed by a contiguous block of chars in memory would likely be more expensive than doing that allocation upfront.

CyrusNajmabadi commented 2 years ago

Unfortunately (especially as I believe markdown is whitespace sensitive), the common case is needing to exclude things (like leading whitespace for raw strings, or // for comments).

That said, does markdig support a system where we can supply lines one at a time?

MihaZupan commented 2 years ago

As long as all the lines (StringSlices) shared the same underlying string, we could probably make it work. But if they could be different string instances, StringSlice would have to be replaced with something where you could override behavior.

Do you have any data on how expensive such reallocations are for Roslyn? My guess is that in this case while it would be an interesting optimization to try and avoid them, it shouldn't make or break the perf.

xoofx commented 2 years ago

As long as all the lines (StringSlices) shared the same underlying string, we could probably make it work. But if they could be different string instances, StringSlice would have to be replaced with something where you could override behavior.

Hm, but you can return a StringSlice for each line that doesn't point to the same line afair. Or I forgot the detail and we assume that string slices point to the same string?

That said, does markdig support a system where we can supply lines one at a time?

That should be exactly what the LineReader does. the implementation assume a same input string, but another implementation could provide a StringSlice that uses different a difference string for each line.

xoofx commented 2 years ago

e.g In the case of Roslyn, a StringSlice feeded to Markdig could be the remaining line after the ///

MihaZupan commented 2 years ago

One case where we actually check that the instances are the same is here, where we extend an existing LiteralInline: https://github.com/xoofx/markdig/blob/89a10ee76b516e1f7d304c2c78785cfa2d58bbc7/src/Markdig/Parsers/Inlines/LiteralInlineParser.cs#L74

Other than that, it's mostly that the offsets we store are offsets into the original string. If every line used its own offsets, we would have to double-check what happens if we ever index into the StringSlice.Text outside of the current line (I don't know that we do), or what happens to source positions/parsing logic if offsets jump backwards.

xoofx commented 2 years ago

One case where we actually check that the instances are the same is here, where we extend an existing LiteralInline:

Right, but that's an optimization and if it doesn't point to the same string, it will create a different Literal, but that should still be fine in terms of rendering or semantic.

we would have to double-check what happens if we ever index into the StringSlice.Text

I haven't checked the code, but that should be fixable if we have anything like this. Afair, the design for StringSlice was not to assume a same underlying string, and if we do in some places for optimization purpose, we can hopefully fix it.