yuin / goldmark

:trophy: A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.
MIT License
3.68k stars 255 forks source link

Parser does not preserve whitespaces when parsing nested code blocks. #177

Closed bwplotka closed 3 years ago

bwplotka commented 3 years ago

Thank you for the amazing project! 🤗

I think I found a small bug, which is a bit annoying in our markdown formatting project. Particular problem is showcased in this draft PR

  1. What version of goldmark are you using? Checked v1.1.24 and latest 6c741ae251abd461bb7b5ce28e7df7a9306bd005
  2. What version of Go are you using? go version go1.15 linux/amd64
  3. What operating system and processor architecture are you using? go version go1.15 linux/amd64
  4. What did you do?

Parsed nested code block (valid markdown):

* Some items with nested code with strict whitespaces.
  ```Makefile
  include .bingo/Variables.mk

  run:
    @$(GOIMPORTS) <args>

(Note strict whitespace in above md, especially line `<space><space>\t@$(GOIMPORTS) <args>`)

5. What did you expect to see? 

goldmark `renderer.Renderer.Render(...)` method's `n ast.Node` has correct structure. However `lines` in `ast.FencedCodeBlock` has wrong whitespace (somehow codeblock being fenced affects things). 

Particularly: Lines in the node should have exactly the same bytes, so for example line 3 should be  `<space><space>\t@$(GOIMPORTS) <args>`

See repro test below.

6. What did you see instead? 

Line 1 and 3 has some semi-random spaces instead what provided in parsed markdown.

Particularly line 3 has  `<space><space>@$(GOIMPORTS) <args>`. See repro test below.

7. Did you confirm your output is different from [CommonMark online demo](https://spec.commonmark.org/dingus/) or other official renderer correspond with an extension?:
YES
![image](https://user-images.githubusercontent.com/6950331/103169278-2918fc00-4832-11eb-9d44-2c32e8208c09.png)

Repro go test:

```go
package markdown

import (
    "bytes"
    "fmt"
    "io"
    "testing"

    "github.com/yuin/goldmark"
    "github.com/yuin/goldmark/ast"
    "github.com/yuin/goldmark/renderer"
)

type testRenderer struct {
    t *testing.T
}

func (t testRenderer) AddOptions(...renderer.Option) { return }
func (t testRenderer) Render(_ io.Writer, source []byte, n ast.Node) error {
    fencedCodeBlock := n.FirstChild().FirstChild().FirstChild().NextSibling().(*ast.FencedCodeBlock)

    line := fencedCodeBlock.Lines().At(0)
    if val := line.Value(source); !bytes.Equal([]byte("include .bingo/Variables.mk\n"), val) {
        t.t.Errorf("not what we expected, got %q", string(val))
    }
    line = fencedCodeBlock.Lines().At(1)
    if val := line.Value(source); !bytes.Equal([]byte("\n"), val) {
        t.t.Errorf("not what we expected, got %q", string(val)) // BUG1: bug_test.go:28: not what we expected, got "  \n"
    }
    line = fencedCodeBlock.Lines().At(2)
    if val := line.Value(source); !bytes.Equal([]byte("run:\n"), val) {
        t.t.Errorf("not what we expected, got %q", string(val))
    }
    line = fencedCodeBlock.Lines().At(3)
    if val := line.Value(source); !bytes.Equal([]byte("\t@$(GOIMPORTS) <args>\n"), val) {
        t.t.Errorf("not what we expected, got %q", string(val)) // BUG 2: bug_test.go:36: not what we expected, got "  @$(GOIMPORTS) <args>\n"  }
    }
    return nil
}

func TestGoldmarkCodeBlockWhitespaces(t *testing.T) {
    var codeBlock = "```"
    mdContent := []byte(fmt.Sprintf(`* Some item with nested code with strict whitespaces.
  %sMakefile
  include .bingo/Variables.mk

  run:
    @$(GOIMPORTS) <args>
  %s`, codeBlock, codeBlock))

    var buf bytes.Buffer
    if err := goldmark.New(goldmark.WithRenderer(&testRenderer{t: t})).Convert(mdContent, &buf); err != nil {
        t.Fatal(err)
    }
}

I am pretty sure it's somehow easy to fix (:

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bwplotka commented 3 years ago

Still valid.

On Tue, 26 Jan 2021 at 17:37, stale[bot] notifications@github.com wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yuin/goldmark/issues/177#issuecomment-767707388, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OYMW7DNMKI3G6HYAQ3S334XFANCNFSM4VKXATUQ .

karelbilek commented 3 years ago

It will actually necessitate huge refactor :(

Basically, it treats tabs as spaces, at one point in program it counts the number of spaces, takes tabs as spaces, and then on other point it puts the spaces back, incorrectly. If I get the flow right.

The way I look at the code, when it reads the text in text/reader.go, it first detects the padding (by util/util.go IndentPosition), and then it puts the padding "back" in text/segment.go Value, where it adds Padding number of spaces to left.

This breaks however in code blocks, where there is a difference between tabs and spaces.

maybe I will be able to convince IndentPosition to ignore the \t paddings? but then it will randomly break when someone uses tabs for indenting the items...

karelbilek commented 3 years ago

however I am a bit confused why that doesn't happen without the item list. IndentPosition never gets run in that case. 🤔

karelbilek commented 3 years ago

oh, the util... gets called just in listItemParser. OK

karelbilek commented 3 years ago

Yeah I got why 2 spaces instead of tab now

Goldmark treats tabs like 4 spaces, but normally in text (not in code block) <space><space><tab> is still treated "like four spaces", because the tab "finishes" the spaces

because normally if you have

<space><space><space><space>something
<space><space><tab>something

those are at same width.

so anyway when you have <space><space><tab> in the goimports stuff, the list item parser sees "hey let's treat this as 4 spaces, but 2 spaces are for me as the list, so the code parser gets the remaining two", and it gives the code parser "<space><space>" instead of tab.

Ugh.

No idea how to even approach this

karelbilek commented 3 years ago

I guess there can be a way for the code block parser to "cheat", and to look if the padding spaces are actually not spaces and they are tabs? We still have the buffer, we can look at the "padding positions" if they aren't actually tabs?

~But still, that would mean creating a new method of Segment, something like func (t *Segment) CodeBlockValue(buffer []byte) []byte {, which can break someone that extends goldmark that implements its own Reader~ uh I am stupid no

karelbilek commented 3 years ago

And btw, what would happen if following happens:

*.list
...*..sublist
......```Makefile
[tb].[t]text
......```

what even should be in the code block? .... I will look at commonmark, how it handles this tab/space mess

bwplotka commented 3 years ago

Thanks for debugging!

So: There is a way to hack markdownfmt to.. hide this? 🤔

karelbilek commented 3 years ago

I don't even know what should the correct behavior be :D

bwplotka commented 3 years ago

Oh that's easy: markdownfmt has to deterministic. It's kind of stupid If I reformat md with markdownfmt it will produce X and then if reformat again it produces Y and then again, it's X 🤦🏽

karelbilek commented 3 years ago

So: There is a way to hack markdownfmt to.. hide this? 🤔

I think it will require change in goldmark, possibly not backwards compatible (requiring new version), I am not sure :D

karelbilek commented 3 years ago

Although I am not sure why it is not deterministic in markdownfmt, but I guess that's a different story xD

(sorry @yuin for all the spam)

yuin commented 3 years ago

I'm afraid to say, I'm up to my neck in work every day. So I can not have time for this project. I promise to see this issue in the future.

karelbilek commented 3 years ago

I fixed it in https://github.com/yuin/goldmark/pull/187

However it is a breaking change because I needed to change Reader interface. So strictly speaking we should increase major version and also import path to /v2

edit: hm, seems I fixed only one of the two bugs.

edit2: opened a second PR for the second bug

yuin commented 3 years ago

@karelbilek @bwplotka I've tried to fix this issue in 2ffadcefcf0c8ffe3a0baa3a96441eb3c9045cea with the way simpler than #187.

Could you confirm this issue is fixed?

karelbilek commented 3 years ago

I think this is still needed?

(see the testcases there)

https://github.com/yuin/goldmark/pull/188

there were actually two independent issues. One is the code starting with tab which you fixed (thx!), other is the empty line which is fixed by #188 (which is simple)

yuin commented 3 years ago

I've tested test cases in #188 with current head(2ffadce). It passes all test cases.

karelbilek commented 3 years ago

OK then! thanks. Then we can close. (or maybe add those test cases too but as you wish).

Thanks again.

On Sun, 7 Feb 2021 at 18:33 Yusuke Inuzuka notifications@github.com wrote:

I've tested test cases in #188 https://github.com/yuin/goldmark/pull/188 with current head(2ffadce https://github.com/yuin/goldmark/commit/2ffadcefcf0c8ffe3a0baa3a96441eb3c9045cea). It passes all test cases.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yuin/goldmark/issues/177#issuecomment-774658615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAZT4KVM2TAHKMGJSZDRE3S5Z275ANCNFSM4VKXATUQ .

yuin commented 3 years ago

@karelbilek, Thanks for your contribution!

bwplotka commented 3 years ago

Amazing!

karelbilek commented 3 years ago

@yuin that branch is still needed.

The tests were wrong, sorry. (It's hard to check as there is whitespace....)

Now I fixed the tests so they are actually testing the bug.