wooorm / markdown-rs

CommonMark compliant markdown parser in Rust with ASTs and extensions
https://docs.rs/markdown/1.0.0-alpha.18/markdown/
MIT License
836 stars 41 forks source link

Performance: larger MDX files are unmanagably slow to parse #113

Open robsimmons opened 2 months ago

robsimmons commented 2 months ago

Bad news @wooorm - micromark/micromark#169, or something like it, is an issue here as well - I haven't yet grokked your edit maps, but either they don't solve the quadratic-complexity parsing problems or there's a separate performance bug in markdown-rs.

The constant factors are better, but the asymptotic complexity means that we're back to 60-second parse times on files that are just an order of magnitude bigger than the ones that caused 60-second parses in micromark-js.

For comparison, the "JS" lines on these graphs show micromark's performance using subtokenize 2.0.1, which picks up the fix in micromark/micromark#171.

parse time (ms) vs  size(3)

log-log parse time (ms) vs  size(3)

Data collection sources

Run with node main.mjs and cargo run --release

package.json

{
  "dependencies": {
    "mdast-util-from-markdown": "^2.0.0",
    "micromark-extension-mdxjs": "^3.0.0"
  }
}

main.mjs

import { fromMarkdown } from "mdast-util-from-markdown";
import { mdxjs } from "micromark-extension-mdxjs";

/* Generate test data */
const NUM_LINES_IN_CODE_SEG = 10000;
function generateTestData(size) {
  let file = [];
  for (let i = 0; i < size; i++) {
    file.push("", "<DummyComponent code={`");
    for (let j = 0; j < NUM_LINES_IN_CODE_SEG; j++) {
      file.push(crypto.randomUUID());
    }
    file.push("`} />", "");
  }
  return file.join("\n");
}

/* Return the number of milliseconds need to parse a test case */
function testPerformance(size) {
  const file = generateTestData(size);
  const start = performance.now();
  fromMarkdown(file, { extensions: [mdxjs()] });
  const end = performance.now();
  return end - start;
}

for (let reps = 0; reps < 3; reps++) {
  for (let size = 4; size < 40; size = Math.floor(size * 1.35)) {
    console.log(`${size}, ${testPerformance(size)}`);
  }
}

Cargo.toml

[package]
name = "tmp"
version = "0.1.0"
edition = "2021"

[dependencies]
markdown = "1.0.0-alpha.16"

[dependencies.uuid]
version = "1.8.0"
features = [
    "v4",                # Lets you generate random UUIDs
    "fast-rng",          # Use a faster (but still sufficiently random) RNG
]

src/main.rs

const NUM_LINES_IN_CODE_SEG: i32 = 10000;

fn main() -> Result<(), String> {
    for _ in 0..3 {
        let mut reps = 3;
        loop {
            reps = ((reps as f64) * 1.35) as i32;
            if reps > 40 {
                break;
            }
            let mut result = String::new();
            for _ in 1..reps {
                result.push_str("\n<DummyComponent code={`\n");
                for _ in 0..NUM_LINES_IN_CODE_SEG {
                    result.push_str(&uuid::Uuid::new_v4().to_string());
                    result.push('\n');
                }
                result.push_str("`}/>\n\n");
            }
            let start = std::time::Instant::now();
            let mdast = markdown::to_mdast(&result, &markdown::ParseOptions::mdx())?;
            let end = std::time::Instant::now();
            let duration = (end - start).as_millis();
            println!("{},{},{:?}", reps, duration, mdast.position());
        }
    }
    Ok(())
}
robsimmons commented 2 months ago

Figured out how to get Rust profiling working, and 99% of the time for that benchmark is spent in the add_impl function of src/util/edit_map.rs - it seems like this loop is creating the same quadratic behavior as the repeated slice operations in the JavaScript implementation. Maybe the edit map concept can be modified to avoid this quadratic repeated-splices, but right now it seems like it's merely delaying the quadratic repeated-splices.

wooorm commented 2 months ago

Definitely possible that there are performance improvements to be made in rust and edit maps too! Cool that you’re investigating! Note that here you are specifically also generating strings that relate to JSX and SWC. Might be that there are things happening there.

robsimmons commented 2 months ago

That was worth investigating. What's SWC?

If I comment out

result.push_str("\n<DummyComponent code={`\n");

and

result.push_str("`}/>\n\n");

and change markdown::ParseOptions::mdx() to markdown::ParseOptions::default()), then I'm literally just parsing a file of line-separated UUIDs. The performance is actually worse, though only by a constant factor:

parse time (ms) vs  size(5) log-log parse time (ms) vs  size(5)

ChristianMurphy commented 2 months ago

What's SWC?

Speedy Web Compiler https://github.com/swc-project/swc It handles much of the JS/JSX parsing inside MDX

robsimmons commented 2 months ago

Thanks for the Speedy Web Response @ChristianMurphy. Yeah, this and the profiling result suggests that it's the edit maps generating the same repeated-splice behavior the JS implementation was dealing with.

wooorm commented 2 months ago

The performance is actually worse, though only by a constant factor:

That makes sense. Because then it has to parse paragraphs that are 10k lines long. There could be many things, links, emphasis, escapes, in there. A JSX element itself is much simpler. The expression you pass is somewhat complex too, as it’s JavaScript, and thus means 2 parsers have to work together.