wooorm / markdown-rs

CommonMark compliant markdown parser in Rust with ASTs and extensions
https://docs.rs/markdown/1.0.0-alpha.18/markdown/
MIT License
836 stars 41 forks source link

GFM strikethrough causes nested attention sequences to be considered just text data #84

Closed rawilder closed 7 months ago

rawilder commented 8 months ago

Here is the test to show what I am talking about.

markdown from test ~~foo __*a*__~~ = foo a

assert_eq!(
    to_html_with_options("~~foo __*a*__~~", &Options::gfm())?,
    "<p><del>foo <strong><em>a</em></strong></del></p>",
    "should support emphasis within strong emphasis within strikethrough w/ `*` in `_`"
);

I've already written the code to fix this, but I also refactored portion of the code to help me understand exactly where I was getting it wrong, so I am opening an issue to share findings and discuss. I eventually landed on a reimplementation of the determining of an AttentionSequence's open/close state using something maybe too close to the GFM spec's wording.

The root cause of the above test failing was tildes not being considered punctuation. I added a kind_before_index which copied its sibling (that already considered ascii punctuation as valid, which includes tilde).

After this though, it caused some other tests to break. Fast forward a bit of debugging and I found that my implementation was actually technically correct but Github's UI is able to handle something that their own spec doesn't necessarily support, but was being tested for in this repo I imagine for the exact reason that Github's UI supports it.

markdown from test e ***~~xxx~~***yyy = e xxxyyy other parsers (my ide, and some live example sites) were showing me: e **xxx**\yyy

assert_eq!(
    to_html_with_options("e ***~~xxx~~***yyy", &Options::gfm())?,
    "<p>e <em><strong><del>xxx</del></strong></em>yyy</p>",
    "interplay"
);

Now I went through my new implementation to see if I could get this test to pass while maintaining the test at the top. I tweaked one of the refactor's new functions, is_right_flanking_delimiter_run, to treat preceding tildes as Other, else the new flow that includes ascii punctuation.

Happy to open the PR if asked, just didn't want to surprise anyone with an out of the blue PR.

wooorm commented 8 months ago

Hey! Thanks for your patience!

I eventually landed on a reimplementation of the determining of an AttentionSequence's open/close state using something maybe too close to the GFM spec's wording.

This might, or might not, be useful here! Does it pass all the tests?

In JS, here’s how it works: https://github.com/micromark/micromark/blob/929275e2ccdfc8fd54adb1e1da611020600cc951/packages/micromark-core-commonmark/dev/lib/attention.js#L261-L268 I think I implemented it here a bit differently? Either because I figured out a smarter way. Or perhaps indeed because I missed a bug—this bug.

Github's UI is able to handle something that their own spec doesn't necessarily support, but was being tested for in this repo I imagine for the exact reason that Github's UI supports it.

Yes, CM and GFM are specs that are super limited. As in, pretty useless.

markdown-rs and micromark are reverse engineered with tons of things checked against their actual implementation.

So, which way to go here, depends more on how the CM dingus works and GH works here in comments/readmes, then on what their specs say!

rawilder commented 7 months ago

With that context then, I'll open the simpler PR which simply addresses the bug rather than the refactor that makes it read more like the spec.

rawilder commented 7 months ago

https://github.com/wooorm/markdown-rs/pull/88