Intercepting builtin tags?

xoofx / markdig

A fast, powerful, CommonMark compliant, extensible Markdown processor for .NET

BSD 2-Clause "Simplified" License

4.34k stars 448 forks source link

Intercepting builtin tags? #63

Open yetanotherchris opened 8 years ago

yetanotherchris commented 8 years ago

In the readme you mention you can plug into the core parsing:

Even the core Markdown/CommonMark parsing is pluggable, so it allows to disable builtin Markdown/Commonmark parsing (e.g Disable HTML parsing) or change behaviour (e.g change matching # of a headers with @)

Is there an example of this you could share? I'm looking specifically for image and link tags (as I mentioned in your blog post) - I want to rewrite the urls.

The MarkdownSharp way of doing this is to hack the source, for example add an event handler call inside private string DoImages(string text). Given your architecture I'm guessing it's a lot less messy in Markdig.

xoofx commented 8 years ago

There is currently no public callbacks specifically for post processing image links (or any block/inline elements in fact). The only callback that is being exposed is MarkdownPipelineBuilder.DocumentProcessed from which you can postprocess a MarkdownDocument.

You can also do this by calling directly the Markdown.Parse, post-process the document and then render it with a HtmlRenderer.

Then you can iterate over the inline links elements like this doc.Descendants().OfType<LinkInline>()

That's the easiest solution for now, though it is not the most efficient one, as the Descendants() method is going to walkthrough all blocks and inlines only to return the ones you are interested in.

While developing markdig, I tried to add some callbacks, but I was not really satisfied with the impact they had (in terms of performance, in terms of verbosity they induce for extension developers...etc.). The problem is that some extensions are sometimes transforming a tree and I was not sure how to handle this nicely, things like:

a block could be created first, we get an event, than another extension replace it by another one: what should we do? Should we send an event that a previous created block doesn't exist anymore?...etc)
would a callback event want to iterate on the tree being build (get the parent of the element...etc.), even if it is still not complete/stable?...etc.

So there is some more thinking/work to be done in order to support efficiently this kind of scenario... Might be able to have a look later this week.

Kryptos-FR commented 8 years ago

I'm also interested in this. Currently I can render a MarkdownDocument into XAML (as text) with my custom renderer.

But for creating a WPF document (i.e. an instance of the FlowDocument class) using a renderer is not ideal: some post-process and transformations are required. Should I work directly on the syntax tree inside the MarkdownDocument?

xoofx commented 8 years ago

@Kryptos-FR not sure that the requested feature here could help your work (selective callback without having to re-visit the tree). In your case, you need to traverse all block and inline elements and create a WPF tree from them. The renderer provides mostly a visitor infrastructure but you can roll-up your own if it doesn't match your process. If you find no way to efficiently do this with the current API or there is just something missing in the renderer API that could be changed to help you, feel free to open another issue, we will look at this problem separately.

yetanotherchris commented 8 years ago

Thanks for the pointer, actually the AST is fine for my needs although maybe a walker (similar to the pattern Antlr uses) might be a good strategy going forward, although the way it works now is fairly intuitive - it just needs a few docs. I'll happily add some examples.

Here's how I got it working for now, I haven't tested it with large documents yet though, but I can't see there being an issue.

class Program
{
    static void Main(string[] args)
    {
        var doc = Markdown.Parse("This [link test](http://www.google.com) is a text with some *emphasis*");

        Walk(doc);

        var builder = new StringBuilder();
        var textwriter = new StringWriter(builder);

        var renderer = new HtmlRenderer(textwriter);
        renderer.Render(doc);

        Console.WriteLine(builder.ToString());
        Console.WriteLine("");
        Console.WriteLine("Press any key...");
        Console.ReadKey();
    }

    static void Walk(MarkdownObject markdownObject)
    {
        foreach (MarkdownObject child in markdownObject.Descendants())
        {
            // LinkInline can be both an image or a <a href="...">
            LinkInline link = child as LinkInline;
            if (link != null)
            {
                HtmlAttributes attributes = link.GetAttributes();
                if (attributes == null)
                {
                    attributes = new HtmlAttributes();
                    attributes.Classes = new List<string>();
                }

                if (attributes.Classes == null)
                {
                    attributes.Classes = new List<string>();
                }

                attributes.Classes.Add("btn");
                attributes.Classes.Add("btn-primary");

                link.SetAttributes(attributes);
                Console.WriteLine(link.Url);
            }
        }
    }
}

Edit by @MihaZupan: Remove the recursive call to Walk that would cause N^2 visits.

jasel-lewis commented 5 years ago

@xoofx First off, LOVE Markdig - THANK YOU!

+1 for me on this topic as well. I'm using Markdig within an ASP.NET MVC app and would like to manipulate the URLs generated for an inline image. The static Markdown content resides within a route construct and I'd like to pass in the controller and action names so that I can just use the image's filename in the Markdown (i.e. ![Alternate Image Title](filename.jpg)*Image Caption*)) and get a full absolute path in the HTML output.

I was super excited when I noticed the GetDynamicUrl property of a LinkInline and I see what AutoIdentifierExtension is doing with it, but my hopes were dashed when I noticed the InlineProcessor does not fire any events such as the Closed event that AutoIdentifierExtension is using on the HeadingBlockParser.

I read your reply above and I understand the complexities involved. I'll probably end up just walking the Descendants as provided in the code sample posted by @yetanotherchris (thanks, @yetanotherchris!!). Nevertheless, it would be SUPER nice to hook into the processors using delegates to manipulate certain properties of the differing Syntaxes.

MihaZupan commented 5 years ago

@jasel-lewis Does the Func<string, string> LinkRewriter exposed on the HtmlRenderer solve your use case? renderer.LinkRewriter = link => "somethingElse/" + link;

I feel that post-processing the MarkdownDocument at the end is more appropriate for such changes.

You should know that now there is a Descendants<T>() method available to make simple modifications easier. It is currently missing an overload where T: Inline, but that is a simple PR change away.

jasel-lewis commented 5 years ago

@MihaZupan Nice find! ...but unfortunately, no. Using LinkRewriter rewrites every link (even header references). I want to solely rewrite image links (because they exist in static-content folders that are physically buried within the MVC construct). There is no way to tell (with LinkRewriter) if the link currently being rewritten belongs to a LinkInline.

I like the way you think, however. I may create a PR which does something similar and adds a LinkRewriter delegate property to the LinkInlineRenderer because you can do something like this: htmlRenderer.ObjectRenderers.Find<LinkInlineRenderer>();.

As an extension to my prior post, this is how I went about modifying @yetanotherchris's solution - just in case any future on-looker cared:

private void Walk(MarkdownObject markdownObject)
{
    var links = markdownObject
        .Descendants()
        .Where(o => o is LinkInline)
        .Cast<LinkInline>()
        .Where(l => l.IsImage && !l.Url.StartsWith("http"));

    foreach (var link in links)
    {
        link.GetDynamicUrl = () =>
            Markdig.Helpers.HtmlHelper.Unescape(this.absoluteUrlPath + link.Url);
    }
}

Note: The !l.Url.StartsWith("http") is to make an educated guess to ensure we didn't already assign a static URL within the Markdown syntax.

Note2: Code was originally a recursive function per the code above from @yetanotherchris. As @MihaZupan pointed out, the Walk(child) causes a huge, and unnecessary, performance issue. Got rid of it and refined the number of objects being inspected per the Linq query (until Decendants<T>() gets fleshed out for Inlines).

MihaZupan commented 5 years ago

Descendants already walks through all the child nodes. Doing it again recursively means you're visiting nodes N^2 times.

JamesQMurphy commented 4 years ago

@jasel-lewis Thank you for posting that code! I'm just wondering if you or @yetanotherchris or anyone else considered hooking into the RendererBase.ObjectWriteBefore event. This event, along with the ObjectWriteAfter event, does make the MarkdownObject available, allowing you to determine if it's a LinkInline or not.

This is the approach I took:

public string RenderHtml(string markdown)
{
    if (markdown == null) throw new ArgumentNullException("markdown");

    var writer = new StringWriter();
    var renderer = new Markdig.Renderers.HtmlRenderer(writer);
    renderer.ObjectWriteBefore += Renderer_ObjectWriteBefore;
    pipeline.Setup(renderer);

    var document = Markdown.Parse(markdown, pipeline);
    renderer.Render(document);
    writer.Flush();
    return writer.ToString();
}

private void Renderer_ObjectWriteBefore(Markdig.Renderers.IMarkdownRenderer arg1, Markdig.Syntax.MarkdownObject obj)
{
    var link = obj as Markdig.Syntax.Inlines.LinkInline;
    if (link != null && link.IsImage && !(link.Url.StartsWith("http")))
    {
        link.Url = Markdig.Helpers.HtmlHelper.Unescape(this.absoluteUrlPath + link.Url);
    }
}

@xoofx Also a big fan of Markdig, so let me echo @jasel-lewis 's thanks! 😃

MihaZupan commented 4 years ago

I haven't concidered it before, but I suppose it should work just as fine.

I personally prefer post-processing the AST prior to rendering like so:

MarkdownDocument document = Markdown.Parse(markdown, pipeline);

foreach (LinkInline link in document.Descendants().OfType<LinkInline>())
{
    if (link.IsImage && !link.Url.StartsWith("http"))
    {
        link.Url = HtmlHelper.Unescape("https://base.com/" + link.Url);
    }
}

renderer.Render(document);

Where document.Descendants().OfType<LinkInline>() can become document.Descendants<LinkInline>() with a trivial PR.