wooorm / markdown-rs

CommonMark compliant markdown parser in Rust with ASTs and extensions
https://docs.rs/markdown/1.0.0-alpha.21/markdown/
MIT License
966 stars 53 forks source link

Serialize mdast to markdown #127

Closed bnchi closed 1 month ago

bnchi commented 3 months ago

fixes #64

TODO handlers :

bnchi commented 2 months ago

While working on this, I noticed that the JS implementation relies heavily on regular expressions in some places. I believe we can rewrite some parts of the code in Rust instead of depending on the regex crate. Although the regex crate guarantees linear time complexity, we can eliminate the need for the regex engine in places where simpler Rust code would work. Maybe we can benchmark first but this won't happen until I move over most of the tests into Rust.

what do you think @wooorm ?

One more question what do you think make more sense here for the maintainability of this utility :

  1. To create a new workspace for mdast to markdown ?
  2. To create a separate module to group most of the related code for this util ?
wooorm commented 2 months ago

While the core of markdown-rs is no-std + alloc, that’s for the use case where you only really do markdown -> html, straight. Here we are working with ASTs. I think it’s fine to depend on regexes there. And, both JS and Rust will need to be maintained and kept in sync. So, fine to improve things if it’s fast, just, that has to be backported. I’d recommend going with regexes first. To focus on porting the code. Instead of changing the code. That becomes confusing. Hard to manage.

As for workspace vs module, I have to particular preference currently. You?

bnchi commented 2 months ago

Since we're working with an AST and this is a utility a workspace sounds better to me this will help to isolate the tests when we quickly run cargo test and version this util by it's own too.

It's not that hard to change that decision later on, I updated the PR to use a workspace instead and kept the _nostd constrain.

bnchi commented 2 months ago

I'm working on the definition handler and the handler depends on https://github.com/micromark/micromark/tree/main/packages/micromark-util-decode-string to decode a markdown string, I've looked up the code in this crate and found out you've indeed ported some parts of micromark utils.

I was wondering if it's ok to make the util from within the markdown crate pub in order to leverage functions such as decode_named and decode_numeric and add some needed constants within util/constant that the decode-string relies on. Since this is being developed in a workspace markdown::util needs to be pub.

@wooorm

wooorm commented 2 months ago

Some of the utils are already pub because they‘re used in https://github.com/wooorm/mdxjs-rs. So, having some utils pub is fine. Also, putting them in separate crates is also fine.

bnchi commented 2 months ago

@wooorm This is ready for review more things might need to be done :

I've made it ready for review to see if you think I've missed something before addressing some of the points above. Thanks !

bnchi commented 2 months ago

By looking at the CI it doesn't seem like the tests are running for this project since it's in a separate workspace we probably need to update the CI jobs to include _mdast_util_tomarkdown in the pipeline.

bnchi commented 2 months ago

I updated the CI jobs and added some missing tests I've also did some refactoring, I don't think I'm going to be adding anything new to this PR unless there's some required changes. I might make a follow up PR to improve the performance if possible or if I spotted some bugs.

For un-handled nodes which are GFM nodes like tables and the others not sure if I should add the handlers here or in some other project because the JS implementation doesn't seem to have them.

Thanks !

wooorm commented 2 months ago

Thanks for your contributed work! I remember that there were some questions you still had, about the tracker for example, but can’t seem to find it now. I was still supposed to answer that. As for the tracker: that isn’t used much in the defaults currently. But it is needed to serialize JSX. In the future, with this tracker, mdast-util-to-markdown itself could become smarter and wrap lines at 80 characters for example. That now only works for JSX (from MDX).

As for the Node extensions, and whether they should be supported here: I do think they should be here. In JavaScript, code size is very important. And plugability of extensions into some main thing is easy. In rust, code size is less important, and plugability is very hard. That is why GFM and frontmatter and MDX as syntax extensions are also in markdown-rs, instead of external packages.

bnchi commented 2 months ago

Can you confirm whether you took over all the tests, without change to input/output?

I did move over all the tests for the handlers except for core and round-trips tests.

I made sure to move the tests by hand and didn't trust an LLM to do the conversion as I initially planned. Since I did it by hand I can guarantee that 90% of the tests are an exact match though I might have missed something. I can spend some time to do an audit after a merge or if there's anyone from the community that can help with reviewing the tests that will be very much appreciated!

One handler tests which I didn't have the time to complete is list item I worked on that today and I just moved what seem to be a hit for every code point here : https://github.com/wooorm/markdown-rs/pull/127/commits/9bfa391ca9027913278685ee89af2741e10dccb8#diff-5b1c4c5fdf4be7bbe31465522517f284b165d6370d4c5ff954df594855a9869a

bnchi commented 2 months ago

I remember that there were some questions you still had, about the tracker for example, but can’t seem to find it now. I was still supposed to answer that.

I asked a few questions correct but I removed them before pushing this for review, I figured most of them out while doing the rest of the work and I don't have the time to address most of them later on if any needed more work, I used most of free time on this PR and would love to close it ASAP.


One more thing I wanted to mention, if needed I'll find time later on to do any work related to this feature that is if you don't have the time to address an issue just @ me and I'll be happy to help !

Thank you for keeping up with me to get this to completion.

bnchi commented 1 month ago

I completed copying over the tests so now all of the tests are a match, I also did a quick review everything looks fine to me, defiantly a few performance improvements and refactors can be done in a separate PR but now since the tests are moved over it's easy to make changes without worrying about something breaking.

I've also added comments as proposed for some data structures.

augustkline commented 1 month ago

Just a heads-up, serializing a yaml node to markdown gives an error. I don't know if that was intentional, but if not you might be missing a yaml handler

augustkline commented 1 month ago

Thank you so much for this PR btw! Really needed this feature

bnchi commented 1 month ago

wooorm knows more about this than me, but I think Yaml and Toml are not part of the CommonMark specs they're part of the frontmatter extension this PR ports over the JS implementation here mdast-util-to-markdown

In order for Yaml and Toml to work we need to add support for frontmatter serialization I believe the JS version is here mdast-util-frontmatter

augustkline commented 1 month ago

Ah gotcha! Thanks for the info & links!

ChristianMurphy commented 1 month ago

Some extensions have built in parsers that can be turned on with an option. Including frontmatter! https://github.com/wooorm/markdown-rs?tab=readme-ov-file#extensions

It could make sense to support serialization for those tools. But this PR is pretty sizable already. I'd lean towards having extensions in follow up PRs.

ChristianMurphy commented 1 month ago

Also additional testing, using a fuzzer to to validate that content can round trip.

In other words markdown can be parsed to a tree, be serialized again, and parsed again should have the same tree structure (though positions and exact characters may change).

That technique found a lot of quirks in the original micromark/remark parser and serialized. And could be valuable to repeat here. But is likely outside the scope of this PR.

References: About fuzzing: https://rust-fuzz.github.io/book/ Current parser fuzz test: https://github.com/wooorm/markdown-rs/tree/main/fuzz

wooorm commented 1 month ago

But this PR is pretty sizable already. I'd lean towards having extensions in follow up PRs.

Right yes, that indeed. Somewhere above @Bnchi and I talked about this too and concluded the same.

I believe @Bnchi has used up a lot of the time they have to work on this. So if @augustkline or others are up for follow up PRs, I’d love that!

wooorm commented 1 month ago

If I run cargo clippy --all-features --all-targets --workspace, I get:

error[E0599]: no method named `len` found for struct `regex::Match` in the current scope
  --> mdast_util_to_markdown/src/handle/inline_code.rs:53:47
   |
53 |                         && &value[position..m.len()] == "\n"
   |                                               ^^^ method not found in `Match<'_>`

error[E0599]: no method named `len` found for struct `regex::Match` in the current scope
   --> mdast_util_to_markdown/src/state.rs:131:62
    |
131 |                         .map(|captured_group| captured_group.len())
    |                                                              ^^^ method not found in `Match<'_>`

error[E0599]: no method named `len` found for struct `regex::Match` in the current scope
   --> mdast_util_to_markdown/src/state.rs:498:53
    |
498 |             start = full_match.start() + full_match.len();
    |                                                     ^^^ method not found in `Match<'_>`

What am I missing 🤔

wooorm commented 1 month ago

Ah, seems like that landed in regex 1.8.0

wooorm commented 1 month ago

Only thing perhaps is some docs in this readme here? And, how should this new code be used? Should it be published? As alpha?

Enoumy commented 1 month ago

Thank you so much for writing this PR!

bnchi commented 1 month ago

Only thing perhaps is some docs in this readme here?

Yeah I can add instructions on how to use this code in a readme

Should it be published? As alpha?

Alpha sounds alright, what do you think ?

wooorm commented 1 month ago

Sounds good to me! :+1: