Closed bnchi closed 1 month ago
While working on this, I noticed that the JS implementation relies heavily on regular expressions in some places. I believe we can rewrite some parts of the code in Rust instead of depending on the regex crate. Although the regex crate guarantees linear time complexity, we can eliminate the need for the regex engine in places where simpler Rust code would work. Maybe we can benchmark first but this won't happen until I move over most of the tests into Rust.
what do you think @wooorm ?
One more question what do you think make more sense here for the maintainability of this utility :
While the core of markdown-rs
is no-std + alloc
, that’s for the use case where you only really do markdown -> html, straight.
Here we are working with ASTs. I think it’s fine to depend on regexes there.
And, both JS and Rust will need to be maintained and kept in sync. So, fine to improve things if it’s fast, just, that has to be backported.
I’d recommend going with regexes first. To focus on porting the code. Instead of changing the code. That becomes confusing. Hard to manage.
As for workspace vs module, I have to particular preference currently. You?
Since we're working with an AST and this is a utility a workspace sounds better to me this will help to isolate the tests when we quickly run cargo test and version this util by it's own too.
It's not that hard to change that decision later on, I updated the PR to use a workspace instead and kept the _nostd constrain.
I'm working on the definition handler and the handler depends on https://github.com/micromark/micromark/tree/main/packages/micromark-util-decode-string to decode a markdown string, I've looked up the code in this crate and found out you've indeed ported some parts of micromark utils.
I was wondering if it's ok to make the util from within the markdown crate pub in order to leverage functions such as decode_named
and decode_numeric
and add some needed constants within util/constant
that the decode-string
relies on. Since this is being developed in a workspace markdown::util
needs to be pub.
@wooorm
Some of the utils are already pub because they‘re used in https://github.com/wooorm/mdxjs-rs. So, having some utils pub is fine. Also, putting them in separate crates is also fine.
@wooorm This is ready for review more things might need to be done :
I've made it ready for review to see if you think I've missed something before addressing some of the points above. Thanks !
By looking at the CI it doesn't seem like the tests are running for this project since it's in a separate workspace we probably need to update the CI jobs to include _mdast_util_tomarkdown in the pipeline.
I updated the CI jobs and added some missing tests I've also did some refactoring, I don't think I'm going to be adding anything new to this PR unless there's some required changes. I might make a follow up PR to improve the performance if possible or if I spotted some bugs.
For un-handled nodes which are GFM nodes like tables and the others not sure if I should add the handlers here or in some other project because the JS implementation doesn't seem to have them.
Thanks !
Thanks for your contributed work!
I remember that there were some questions you still had, about the tracker for example, but can’t seem to find it now. I was still supposed to answer that.
As for the tracker: that isn’t used much in the defaults currently. But it is needed to serialize JSX. In the future, with this tracker, mdast-util-to-markdown
itself could become smarter and wrap lines at 80 characters for example. That now only works for JSX (from MDX).
As for the Node extensions, and whether they should be supported here: I do think they should be here.
In JavaScript, code size is very important. And plugability of extensions into some main thing is easy.
In rust, code size is less important, and plugability is very hard.
That is why GFM and frontmatter and MDX as syntax extensions are also in markdown-rs
, instead of external packages.
Can you confirm whether you took over all the tests, without change to input/output?
I did move over all the tests for the handlers except for core and round-trips tests.
I made sure to move the tests by hand and didn't trust an LLM to do the conversion as I initially planned. Since I did it by hand I can guarantee that 90% of the tests are an exact match though I might have missed something. I can spend some time to do an audit after a merge or if there's anyone from the community that can help with reviewing the tests that will be very much appreciated!
One handler tests which I didn't have the time to complete is list item I worked on that today and I just moved what seem to be a hit for every code point here : https://github.com/wooorm/markdown-rs/pull/127/commits/9bfa391ca9027913278685ee89af2741e10dccb8#diff-5b1c4c5fdf4be7bbe31465522517f284b165d6370d4c5ff954df594855a9869a
I remember that there were some questions you still had, about the tracker for example, but can’t seem to find it now. I was still supposed to answer that.
I asked a few questions correct but I removed them before pushing this for review, I figured most of them out while doing the rest of the work and I don't have the time to address most of them later on if any needed more work, I used most of free time on this PR and would love to close it ASAP.
One more thing I wanted to mention, if needed I'll find time later on to do any work related to this feature that is if you don't have the time to address an issue just @ me and I'll be happy to help !
Thank you for keeping up with me to get this to completion.
I completed copying over the tests so now all of the tests are a match, I also did a quick review everything looks fine to me, defiantly a few performance improvements and refactors can be done in a separate PR but now since the tests are moved over it's easy to make changes without worrying about something breaking.
I've also added comments as proposed for some data structures.
Just a heads-up, serializing a yaml node to markdown gives an error. I don't know if that was intentional, but if not you might be missing a yaml handler
Thank you so much for this PR btw! Really needed this feature
wooorm knows more about this than me, but I think Yaml and Toml are not part of the CommonMark specs they're part of the frontmatter extension this PR ports over the JS implementation here mdast-util-to-markdown
In order for Yaml and Toml to work we need to add support for frontmatter serialization I believe the JS version is here mdast-util-frontmatter
Ah gotcha! Thanks for the info & links!
Some extensions have built in parsers that can be turned on with an option. Including frontmatter! https://github.com/wooorm/markdown-rs?tab=readme-ov-file#extensions
It could make sense to support serialization for those tools. But this PR is pretty sizable already. I'd lean towards having extensions in follow up PRs.
Also additional testing, using a fuzzer to to validate that content can round trip.
In other words markdown can be parsed to a tree, be serialized again, and parsed again should have the same tree structure (though positions and exact characters may change).
That technique found a lot of quirks in the original micromark/remark parser and serialized. And could be valuable to repeat here. But is likely outside the scope of this PR.
References: About fuzzing: https://rust-fuzz.github.io/book/ Current parser fuzz test: https://github.com/wooorm/markdown-rs/tree/main/fuzz
But this PR is pretty sizable already. I'd lean towards having extensions in follow up PRs.
Right yes, that indeed. Somewhere above @Bnchi and I talked about this too and concluded the same.
I believe @Bnchi has used up a lot of the time they have to work on this. So if @augustkline or others are up for follow up PRs, I’d love that!
If I run cargo clippy --all-features --all-targets --workspace
, I get:
error[E0599]: no method named `len` found for struct `regex::Match` in the current scope
--> mdast_util_to_markdown/src/handle/inline_code.rs:53:47
|
53 | && &value[position..m.len()] == "\n"
| ^^^ method not found in `Match<'_>`
error[E0599]: no method named `len` found for struct `regex::Match` in the current scope
--> mdast_util_to_markdown/src/state.rs:131:62
|
131 | .map(|captured_group| captured_group.len())
| ^^^ method not found in `Match<'_>`
error[E0599]: no method named `len` found for struct `regex::Match` in the current scope
--> mdast_util_to_markdown/src/state.rs:498:53
|
498 | start = full_match.start() + full_match.len();
| ^^^ method not found in `Match<'_>`
What am I missing 🤔
Ah, seems like that landed in regex 1.8.0
Only thing perhaps is some docs in this readme here? And, how should this new code be used? Should it be published? As alpha?
Thank you so much for writing this PR!
Only thing perhaps is some docs in this readme here?
Yeah I can add instructions on how to use this code in a readme
Should it be published? As alpha?
Alpha sounds alright, what do you think ?
Sounds good to me! :+1:
fixes #64
TODO handlers :