yogthos / markdown-clj

Markdown parser in Clojure
Eclipse Public License 1.0
540 stars 120 forks source link

Interpret newline as space #165

Open metasoarous opened 4 years ago

metasoarous commented 4 years ago

I was a bit surprised to realize that markdown-clj's default interpretation of a newline is as a no-op, versus a space, in contrast with other markdown processing tools.

While I'd lobby for making newlines behave as spaces, for now is there a way to use parser customizations to achieve the desired result?

Thanks!

metasoarous commented 4 years ago

Hmm... I'm now realizing that in addition, space characters are being trimmed from newlines, and so there doesn't seem to be any way without messing with the parser to add space characters to a md document, let alone compatibility with other parsers.

Would you please consider addressing this?

Thanks again

yogthos commented 4 years ago

Oh yeah, that's one of the limitations of how I wrote the parsing originally where it reads input line by line, and never got around to improving that. I'm definitely open to improving that, but can't promise I'll have the time in the near future. I think the easiest approach would be to handle that here as lines are being read, and to keep reading until a blank line when inside a paragraph. Similar change would be needed for cljs part as well.

If anybody has time to take a look at this, I can help guide the PR and a release.

gsinclair commented 2 years ago

I came here to report this, but found that it has already been reported, so I thought I'd at least contribute my minimal failing example.

(let [s1 "Random text.\nRandom text.\n"]
  (md-to-html-string s1))
(let [s2 "Random text.\n\n    code block\n\nRandom text.\nRandom text.\n"]
  (md-to-html-string s2))
(let [s3 "Random text.\n\n    code block\n     \nRandom text.\nRandom text.\n"]
  (md-to-html-string s3))

If you execute those three forms in the REPL, you will find that in s1 and s2 all paragraphs parse correctly (i.e. newline converts to space) but in s3 the final paragraph does not parse as one would wish. It appears the superfluous space in the blank line after the code block somehow disrupts the subsequent parsing.

That is, the spaces in the line after the code block have an undesirable effect in the parsing of the subsequent paragraph.

Oddsor commented 2 years ago

I took a stab at this issue, in particular the issue mentioned by @gsinclair , with a PR here: https://github.com/yogthos/markdown-clj/pull/178

Reading line by line certainly causes some challenges here as @yogthos says! For example, indented code blocks in markdown usually have trailing newlines trimmed, but I'm not sure we can ensure that happens with the current parsing strategy.

I'm not exactly an expert in writing parsers, so I don't know what would be a better long term solution to handle these cases 😅

yogthos commented 2 years ago

I think the solution is reasonable with the current state of things. :) The tests pass so I'm going to say that's reasonable enough, and if a new issue gets opened then can add a new test and fix it then.