wjdp / htmltest

:white_check_mark: Test generated HTML for problems
MIT License
323 stars 54 forks source link

Crash parsing HTML document #126

Closed retorquere closed 2 years ago

retorquere commented 5 years ago

htmltest is erroring out when I run it:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1111246]

goroutine 1 [running]:
github.com/wjdp/htmltest/htmldoc.(*Document).Parse(0x0)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmldoc/document.go:47 +0x26
github.com/wjdp/htmltest/htmldoc.(*Document).IsHashValid(0x0, 0xc000364b1e, 0x15, 0x127b200)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmldoc/document.go:112 +0x2b
github.com/wjdp/htmltest/htmltest.(*HTMLTest).checkInternalHash(0xc0000c5200, 0xc000161770)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/check-link.go:297 +0xad
github.com/wjdp/htmltest/htmltest.(*HTMLTest).checkInternal(0xc0000c5200, 0xc000161770)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/check-link.go:271 +0xcb
github.com/wjdp/htmltest/htmltest.(*HTMLTest).checkLink(0xc0000c5200, 0xc00010c700, 0xc00012df10)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/check-link.go:85 +0x3f9
github.com/wjdp/htmltest/htmltest.(*HTMLTest).testDocument(0xc0000c5200, 0xc00010c700)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/htmltest.go:203 +0x193
github.com/wjdp/htmltest/htmltest.(*HTMLTest).testDocuments(0xc0000c5200)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/htmltest.go:182 +0x6a
github.com/wjdp/htmltest/htmltest.Test(0xc000075aa0, 0x1, 0x1, 0x49)
    /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/htmltest.go:142 +0x8fb
main.run(0xc000075aa0, 0xc000075aa0)
    /home/travis/gopath/src/github.com/wjdp/htmltest/main.go:159 +0x1e8
main.main()
    /home/travis/gopath/src/github.com/wjdp/htmltest/main.go:66 +0x298

To Reproduce

Steps to reproduce the behaviour:

  1. Run with config and options …
htmltest --log-level 3

.htmltest.yml

Please copy in your config file

DirectoryPath: "public"
EnforceHTTPS: false
CacheExpires: "6h"
CheckExternal: false
IgnoreDirectoryMissingTrailingSlash: true
IgnoreInternalEmptyHash: true
IgnoreDirs:
- Support

Source files

I haven't been able to narrow it down yet -- my request is for htmltest to print the page it's processing to help narrow it down.

Expected behaviour

print each page as it's being processed

Versions

wjdp commented 5 years ago

Hey @retorquere you can run htmltest -l0 to log every file as it's tested.

retorquere commented 5 years ago

I'm an idiot. Of course I wanted level 0, sorry.

The offending page is at https://gist.github.com/6c955708ecfa70ff55d363c485f9eb1e

retorquere commented 5 years ago

No wait -- the log ends at

pull-export/index.html
  DOCTYPE html []
 --- pull-export/index.html --> <nil>
testDocument on test/index.html
panic: runtime error: invalid memory address or nil pointer dereference

so which of these two is likely the culprit? pull-export/index.html or test/index.html?

retorquere commented 5 years ago

I have another file on which it consistently crashes, but if I test only that file, it passes.

wjdp commented 5 years ago

It'll be test/index.html there. The debug message "testDocument on…" is the first call when finished with the last doc and starting the next.

retorquere commented 5 years ago

It crashes on more files now. I've removed test/index.html since, but I still have others. My current run ends with

testDocument on installation/configuration/hidden-preferences/index.html
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1111246]

but it may not be something about that file in particular; if I set DirectoryPath to public/installation/configuration (it's usually set to public), it does not crash.

retorquere commented 5 years ago

My site is available as a tarball on https://0x0.st/z2DV.gz

retorquere commented 5 years ago

(but that tarball was produced on MacOS, which means that Support and support are deemed to be the same file)

wjdp commented 5 years ago

Thanks! I'm on holiday next week but will try to have a look at this at some point in July.

retorquere commented 5 years ago

Thanks! Is there anything I can do in the interim to help debugging this?

retorquere commented 5 years ago

I've tried this on a linux system and it runs without issue there.

wjdp commented 5 years ago

Ah, that's very interesting. I'm no expert on osx (only access I have is the Travis test runners). Right now, without looking at code, unfortunately I don't have any ideas.

retorquere commented 5 years ago

No issue. When you're back I'd be happy to run an instrumented version that may give more insight.

danyill commented 4 years ago

I'm seeing the same on Linux on Ubuntu Focal:

node@791983aec7ee:~/antora-base$ ./bin/htmltest -l0 public
htmltest started at 09:18:46 on public
========================================================================
0: DirectoryPath string = public
1: DirectoryIndex string = index.html
2: FilePath string = 
3: FileExtension string = .html
4: CheckDoctype bool = true
5: CheckAnchors bool = true
6: CheckLinks bool = true
7: CheckImages bool = true
8: CheckScripts bool = true
9: CheckMeta bool = true
10: CheckGeneric bool = true
11: CheckExternal bool = true
12: CheckInternal bool = true
13: CheckInternalHash bool = true
14: CheckMailto bool = true
15: CheckTel bool = true
16: CheckFavicon bool = false
17: CheckMetaRefresh bool = true
18: EnforceHTML5 bool = false
19: EnforceHTTPS bool = false
20: IgnoreURLs []interface {} = []
21: IgnoreDirs []interface {} = []
22: IgnoreInternalEmptyHash bool = false
23: IgnoreEmptyHref bool = false
24: IgnoreCanonicalBrokenLinks bool = true
25: IgnoreExternalBrokenLinks bool = false
26: IgnoreAltMissing bool = false
27: IgnoreDirectoryMissingTrailingSlash bool = false
28: IgnoreSSLVerify bool = false
29: IgnoreTagAttribute string = data-proofer-ignore
30: HTTPHeaders map[interface {}]interface {} = map[Accept:*/* Range:bytes=0-0]
31: TestFilesConcurrently bool = false
32: DocumentConcurrencyLimit int = 128
33: HTTPConcurrencyLimit int = 16
34: LogLevel int = 0
35: LogSort string = document
36: ExternalTimeout int = 15
37: StripQueryString bool = true
38: StripQueryExcludes []string = [fonts.googleapis.com]
39: EnableCache bool = true
40: EnableLog bool = true
41: OutputDir string = tmp/.htmltest
42: OutputCacheFile string = refcache.json
43: OutputLogFile string = htmltest.log
44: CacheExpires string = 336h
45: NoRun bool = false
46: VCREnable bool = false
47: Version string = 0.12.1
testDocument on Home/faq.html
Home/faq.html
  DOCTYPE html []
 --- Home/faq.html --> <nil>
  from cache --- Home/faq.html --> https://docs.tpwiki.com/Home/faq.html
  OK --- Home/faq.html --> https://docs.tpwiki.com/Home/faq.html
  from cache --- Home/faq.html --> https://docs.tpwiki.com
  OK --- Home/faq.html --> https://docs.tpwiki.com
  target does not exist --- Home/faq.html --> /oauth2/sign_out
testDocument on Home/index.html
Home/index.html
  DOCTYPE html []
 --- Home/index.html --> <nil>
  from cache --- Home/index.html --> https://docs.tpwiki.com/Home/index.html
  OK --- Home/index.html --> https://docs.tpwiki.com/Home/index.html
  from cache --- Home/index.html --> https://docs.tpwiki.com
  OK --- Home/index.html --> https://docs.tpwiki.com
  target does not exist --- Home/index.html --> /oauth2/sign_out
testDocument on SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html
SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html
  DOCTYPE html []
 --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> <nil>
  from cache --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://docs.tpwiki.com
  OK --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://docs.tpwiki.com
  target does not exist --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> /oauth2/sign_out
  from cache --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://gitlab.tpwiki.com/standard-designs/arc-flash-protection/SEL751_Arc_Flash_Protection_Settings/tree/master
  OK --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://gitlab.tpwiki.com/standard-designs/arc-flash-protection/SEL751_Arc_Flash_Protection_Settings/tree/master
  from cache --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://gitlab.tpwiki.com/standard-designs/arc-flash-protection/SEL751_Arc_Flash_Protection_Settings/issues
  OK --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://gitlab.tpwiki.com/standard-designs/arc-flash-protection/SEL751_Arc_Flash_Protection_Settings/issues
  from cache --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://gitlab.tpwiki.com/standard-designs/arc-flash-protection/SEL751_Arc_Flash_Protection_Settings/compare/master...master
  OK --- SEL751_Arc_Flash_Protection_Settings/unstable/downloads/Downloads.html --> https://gitlab.tpwiki.com/standard-designs/arc-flash-protection/SEL751_Arc_Flash_Protection_Settings/compare/master...master
testDocument on SEL751_Arc_Flash_Protection_Settings/unstable/setting_guide/Setting_Guide.html
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x51c0d7]

goroutine 1 [running]:
github.com/wjdp/htmltest/htmldoc.(*Document).Parse(0x0)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmldoc/document.go:47 +0x37
github.com/wjdp/htmltest/htmldoc.(*Document).IsHashValid(...)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmldoc/document.go:112
github.com/wjdp/htmltest/htmltest.(*HTMLTest).checkInternalHash(0xc0000ce240, 0xc0003210b0)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/check-link.go:325 +0xb0
github.com/wjdp/htmltest/htmltest.(*HTMLTest).checkInternal(0xc0000ce240, 0xc0003210b0)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/check-link.go:299 +0x15d
github.com/wjdp/htmltest/htmltest.(*HTMLTest).checkLink(0xc0000ce240, 0xc0000fe480, 0xc0001ed0a0)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/check-link.go:97 +0x5ec
github.com/wjdp/htmltest/htmltest.(*HTMLTest).testDocument(0xc0000ce240, 0xc0000fe480)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/htmltest.go:204 +0x18c
github.com/wjdp/htmltest/htmltest.(*HTMLTest).testDocuments(0xc0000ce240)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/htmltest.go:183 +0x65
github.com/wjdp/htmltest/htmltest.Test(0xc000013950, 0xc000010018, 0xc0000f9d48, 0x1)
        /home/travis/gopath/src/github.com/wjdp/htmltest/htmltest/htmltest.go:143 +0x89b
main.run(0xc000013950, 0xc000013950)
        /home/travis/gopath/src/github.com/wjdp/htmltest/main.go:159 +0x207
main.main()
        /home/travis/gopath/src/github.com/wjdp/htmltest/main.go:66 +0x268

My system is:

Linux 791983aec7ee 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 x86_64 x86_64 GNU/Linux

running within a Docker container.

Happy to provide further information. This error is highly consistent and always occurs.

danyill commented 4 years ago

My directory is also public but I tried public2 and 2xxx both of which it also crashed on with the same errors.

wjdp commented 3 years ago

This seems to be an issue with parsing HTML. I know this issue is very old but @danyill do you have a copy of the files that caused the crash?

danyill commented 3 years ago

@wjdp Sorry for the slow response, time is getting away on me. I have a copy of a very similar one which also crashes on the latest version of htmltest. I can't share this publicly but am happy to provide it with you. What is the easiest way to provide this to? Can I email it to your commit address? (1.5 Mb file with embedded images).

Marshevskyy commented 2 years ago

Hi, @wjdp. I was able to replicate this error.

In my case, I have 2 pages, first page has an anchor link to another page page 1 public/docs/dev/index.html

...
<!DOCTYPE html>
<html>
<title>test</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<body>
  <a href="/docs/hello-configuration/#link">
    <code class="language-text">link</code>
  </a>
</body>
</html>
...

page 2 public/docs/hello-configuration/index.html

...
<!DOCTYPE html>
<html>
<title>test</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<body>
  <h2 id="link" style="position:relative">
    <a href="#link">
    </a>link
  </h2>
</body>
</html>
...

.htmltest.yml:

IgnoreDirectoryMissingTrailingSlash: true
DirectoryPath: "public/"
IgnoreAltMissing: true
OutputDir: "tmp/.htmltest"
OutputCacheFile: "refcache.json"
OutputLogFile: "htmltest.log"
IgnoreDirs:
  - hello

Links are valid since when I run it on localhost or server, links work OK. Is there any workaround? Please let me know if you need more details.

UPDATED (27.10.21): once I remove DirectoryPath: "public/" from .htmltest.yml. it seems to be working UPDATED (28.10.21):

markmandel commented 2 years ago

Been digging into this while watching tv 😄

I'm quite sure I've narrowed down the culprit:

https://github.com/wjdp/htmltest/blob/d3ffce77f294b7f9ecf8c115b7fe2059d2cc87a5/htmltest/check-link.go#L336

Debugging shows me that hT.documentStore.ResolveRef(ref) can return a response of (nil, false), but the ok value is never checked.

The way I'm currently fairly sure I can hit this issue is one of two ways:

  1. Aim htmltest at a html page that has links to parent directories
  2. Aim htmltest at a html page, but have valid (at least I think they are valid) links to ignored set of pages covered by IgnoreDirs.

From there, any call to member functions will panic if they reference internal members.

I'll keep digging, but I wanted to report on my progress in case it spurred someone else to see the correct path through to resolving this issue.

markmandel commented 2 years ago

So easy enough fix for the panic, check the ok value returned from ResolveRef (my branch is over here):

https://github.com/markmandel/htmltest/blob/1756c2ea506c42270afb56ca7bed6e9194701a2a/htmltest/check-link.go#L336-L343

The next issue I run into, is that this reference I have should resolve, but it doesn't because (I assume) the reference it points to isn't available in DocumentPathMap since it matches IgnoreDirs 🤔 Now to work out how that gets populated.

markmandel commented 2 years ago

Okay! I think I got it working! I had to keep the list of all Document in DocumentStore and add a property to say if they should be ignored for test or not - that allows for references to be checked against, but can be skipped over for testing.

PR coming shortly!