zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

SelectNodes not matching xpath where attribute name starts-with #518

Closed Kenshinofkin closed 1 year ago

Kenshinofkin commented 1 year ago

Here is what to include in your request to make sure we implement a solution as quickly as possible.

1. Description

SelectNodes is returning null when trying to get nodes where attribute name starts-with function is used. I tested the xpath and HTML on a couple of different sites and was able to get nodes back.

XPATH: //*[@*[starts-with(name(), 'on')]] HTML: <div onclick="alert('test');"></div> Expected Result: div should be returned from SelectNodes.

2. Exception

No Exception

Exception message:
Stack trace:

3. Fiddle or Project

Fiddle

4. Any further technical details

Updated: Include correct xpath used and example fiddle.

elgonzo commented 1 year ago

While i am not one of the authors/maintainers of HtmlAgilityPack (i am just a user), i can tell that there are two issues with you writing an incorrect XPath expression.

But there is also a bug in HtmlAgilityPack. Before i address the incorrectness of the XPath expression posted in the report, i want to address the HAP bug here first, as this is the only thing of interest for the HtmlAgilityPack maintainers. (@Kenshinofkin please leave the issue report therefore open, even though i suggest a workaround below.)

For attribute nodes, the HtmlNodeNavigator.Name property will not return the QName of the attribute node as it should. This makes the XPath function name() misbehave for attribute nodes. When HtmlNodeNavigator navigates to an attribute node, HtmlNodeNavigator.Name will not yield the attribute name but instead return the name of the last navigated element node, which is incorrect.


With the HAP bug explained, lets now address your invalid XPath expression. For HAP maintainers, reading the text below is not necessary, as it only addresses syntax errors in the user's XPath expression.

Contrary to your claim that no exception is being thrown, the SelectNodes method of HtmlAgilityPack 1.11.53 will throw a System.Xml.XPath.XPathException with a syntactically incorrect XPath expression like the one in your report. Please pay closer attention when troubleshooting and debugging your projects.

Note that HtmlAgilityPack relies on .NET's own System.Xml.XPath infrastructure, which is and will remain limited to XPath 1.0. So you will have to follow XPath 1.0 syntax rules with respect to .NET's XPath and therefore HAP's XPath capabilities.

The first issue of your XPath expression is that it doesn't select any element node on which the [...] predicate could be applied. //[...] is incorrect syntax. // is not a valid selector (or node test). It is an alias for /descendant-or-self::node()/; note the trailing slash. //[...] is essentially /descendant-or-self::node()/[...], which is invalid syntax. Therefore //[...] needs to be turned into //*[...] to select any nodes.

The second issue is quite similar. @[...] is also invalid syntax. @ is an alias for attribute::, and @[...] is therefore attribute::[...] which again is invalid syntax. As with the first issue, this XPath sub-expression needs to select some attribute nodes the starts-with predicate can be applied to. Therefore @[...] should here be changed to @*[...], which would apply the starts-with predicate to every attribute of each of the nodes selected with //*.

But even if you fix those two issues, due to the HAP bug i highlighted above, the XPath expression would still not succeed as it uses the name() function, and due to that HAP bugname() will not return attribute names. But the local-name() function will, so using it instead of name() can serve as a workaround until the name() bug in HAP is being fixed.

In summary, applying the two corrections of invalid XPath syntax and the local-name() workaround for HAP's name() bug, you should use the XPath expression:

//*[@*[starts-with(local-name(), 'on')]]


I tested the xpath and HTML on a couple of different sites and was able to get nodes back.

I don't know what sites you used or what XPath expression you actually tested, and i don't really need to know. The important bit here is that if you really tested the XPath expression as written in your report, make sure you test using the XPath 1.0 spec. If these sites then still return nodes given the XPath expression in your report, then don't use these sites anymore (because your XPath expression is not valid XPath 1.0 syntax, and any proper XPath 1.0 tester/validator should only return errors for your XPath expression and nothing else).

Kenshinofkin commented 1 year ago

@elgonzo this was a mistype. I was trying to quickly create an example using the dotnetfiddle.net and was running into issues. The xpath I used was //*[@*[starts-with(name(), 'on')]]. I created a stackoverflow question before hand because I am not familiar with xpath. And someone answered that it was a bug with HAP.

elgonzo commented 1 year ago

@Kenshinofkin ah, if only you had linked to your SO question. Would have saved me from writing all this... :stuck_out_tongue_winking_eye: Sincerely, don't worry, it's all good :grinning:

JonathanMagnan commented 1 year ago

Stack Overflow issue for reference: https://stackoverflow.com/questions/77204747/html-agility-pack-not-matching-xpath-where-attribute-name-starts-with

Kenshinofkin commented 1 year ago

@elgonzo It wasn't a waste. At least not on my end. I did learn a couple of things about xpath and System.Xml.XPath being limited to version XPath 1.0. Thank you,