Line breaks in convert to plain text

DDAndyChen commented 6 years ago

By using Markdown.ToPlainText method, the following markdown

"## New Employee Name:\r\n First: John \r\n Last: Smith\r\n\r\n## Start Date:\r\n2018-02-07 \r\n\r\n## Position Title:\r\nCustoms Broker"

is converted to

"New Employee Name:\nFirst: John\nLast: SmithStart Date:2018-02-07Position Title:Customs Broker"

some line breaks are dropped, so "Smith" and "Start" become one word, the same as "2018-02-07" and "Position". I think it the output would be better in this:

"New Employee Name:\nFirst: John\nLast: Smith\n\nStart Date:2018-02-07\n\nPosition Title:\nCustoms Broker"

that is

New Employee Name:
First: John 
Last: Smith

Start Date:
2018-02-07  

Position Title:
Customs Broker

xoofx commented 6 years ago

Yep, good catch, likely a bug

hemantkd commented 6 years ago

Hi @xoofx,

I'm interested in fixing this bug. I have cloned the repository and using Visual Studio 2017.

I'm getting the following Build error: "The specified language targets for uap10.0 is missing. Ensure correct tooling is installed.\r\nMissing File: C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\MSBuild\Microsoft\WindowsXaml\v15.0\Microsoft.Windows.UI.Xaml.CSharp.targets Markdig C:\Users\XXX\.nuget\packages\msbuild.sdk.extras\1.0.9\build\netstandard1.0\MSBuildSdkExtras.Common.targets"

Does the Solution absolutely need the UWP tooling enabled to be able to Build successfully?

xoofx commented 6 years ago

Does the Solution absolutely need the UWP tooling enabled to be able to Build successfully?

Yes

hemantkd commented 6 years ago

Hi @xoofx,

Thanks for earlier.

After spending some time and having to manually install the UAP version required by the project using the Visual Studio Installer, finally managed to build the solution successfully.

Spent some time looking at the code and common mark spec too.

[Removed the test scenario from the comment as I have realised that it was incorrect]

xavierdecoster commented 6 years ago

Any plans on fixing this bug? :) Or is there a work-around?

xoofx commented 6 years ago

Any plans on fixing this bug? :) Or is there a work-around?

No, I don't have any spare time left, PR welcome

xavierdecoster commented 6 years ago

Sorry to hear that.

It's easily reproducible using the following test (the top one for HTML is already there, I just replicated it for Markdown.ToPlainText() to reproduce the issue):

    [TestFixture]
    public class TestConfigureNewLine
    {
        [Test]
        [TestCase(/* newLineForWriting: */ "\n",   /* markdownText: */ "*1*\n*2*\n",     /* expected: */ "<p><em>1</em>\n<em>2</em></p>\n")]
        [TestCase(/* newLineForWriting: */ "\n",   /* markdownText: */ "*1*\r\n*2*\r\n", /* expected: */ "<p><em>1</em>\n<em>2</em></p>\n")]
        [TestCase(/* newLineForWriting: */ "\r\n", /* markdownText: */ "*1*\n*2*\n",     /* expected: */ "<p><em>1</em>\r\n<em>2</em></p>\r\n")]
        [TestCase(/* newLineForWriting: */ "\r\n", /* markdownText: */ "*1*\r\n*2*\r\n", /* expected: */ "<p><em>1</em>\r\n<em>2</em></p>\r\n")]
        [TestCase(/* newLineForWriting: */ "!!!" , /* markdownText: */ "*1*\n*2*\n",     /* expected: */ "<p><em>1</em>!!!<em>2</em></p>!!!")]
        [TestCase(/* newLineForWriting: */ "!!!" , /* markdownText: */ "*1*\r\n*2*\r\n", /* expected: */ "<p><em>1</em>!!!<em>2</em></p>!!!")]
        public void TestHtmlOutputWhenConfiguringNewLine(string newLineForWriting, string markdownText, string expected)
        {
            var pipeline = new MarkdownPipelineBuilder()
                .ConfigureNewLine(newLineForWriting)
                .Build();

            var actual = Markdown.ToHtml(markdownText, pipeline);
            Assert.AreEqual(expected, actual);
        }

        [Test]
        [TestCase(/* newLineForWriting: */ "\n",   /* markdownText: */ "*1*\n*2*\n",     /* expected: */ "1\n2\n")]
        [TestCase(/* newLineForWriting: */ "\n",   /* markdownText: */ "*1*\r\n*2*\r\n", /* expected: */ "1\n2\n")]
        [TestCase(/* newLineForWriting: */ "\r\n", /* markdownText: */ "*1*\n*2*\n",     /* expected: */ "1\r\n2\r\n")]
        [TestCase(/* newLineForWriting: */ "\r\n", /* markdownText: */ "*1*\r\n*2*\r\n", /* expected: */ "1\r\n2\r\n")]
        [TestCase(/* newLineForWriting: */ "!!!", /* markdownText: */ "*1*\n*2*\n",     /* expected: */ "1!!!2!!!")]
        [TestCase(/* newLineForWriting: */ "!!!", /* markdownText: */ "*1*\r\n*2*\r\n", /* expected: */ "1!!!2!!!")]
        public void TestPlainTextOutputWhenConfiguringNewLine(string newLineForWriting, string markdownText, string expected)
        {
            var pipeline = new MarkdownPipelineBuilder()
                .ConfigureNewLine(newLineForWriting)
                .Build();

            var actual = Markdown.ToPlainText(markdownText, pipeline);
            Assert.AreEqual(expected, actual);
        }
    }

You'll notice this new test fails.

I noticed it's parsing a single ParagraphBlock, and the LineReader parses 2 lines (*1* and *2*), whilst dropping/ignoring the CRLF characters at the end. (LineReader.cs ln 55)

I've no idea why the last occurrence of the new line character is discarded when parsing to plain text, whereas it's not when parsing to html... Trying to spot the difference.

If you have any idea/pointers where to look at, please share :)

xoofx commented 6 years ago

The problem is not in the parser but in the renderer. The HTML renderer is used today to render as a text. There are likely a few places in the code where we don't output newline, while the HTML doesn't care, the text version would care. I don't like the idea of using the HTML renderer to output plain text, but for some that was the quickest solution. I don't think it is much work in the codebase, but you need to dig into it (HtmlRenderer is quite simple, so it should not be difficult)

neilha commented 5 years ago

Hi @xoofx - thanks for merging pull request. Do you have any timelines for publishing a new version to nuget? Thanks.

bstoked commented 4 years ago

Wondering if this fix has been published to nuget yet? On a project that definitely needs it. :) Thanks.

MihaZupan commented 4 years ago

Last NuGet is from 3 months ago so definitely yes.

xoofx / markdig

Line breaks in convert to plain text #196