zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.62k stars 375 forks source link

SelectSingleNode cannot select attributes, like in Xml #311

Open wiz0u opened 5 years ago

wiz0u commented 5 years ago

With a XPath expression ending in /@attributeName,

System.Xml.XmlNode.SelectSingleNode correctly returns an attribute node with Name & Value/InnerText matching the attribute.

HtmlAgilityPack.HtmlNode.SelectSingleNode returns the parent HtmlNode (with its attributes), instead of the attribute itself

The reason is probably because there is no HtmlAttributeNode class yet. I don't know if it's for memory optimization or what, but it might be useful to have these, eventually created on-the-fly when these nodes gets selected.

(I ended up creating this class myself with an extension method SelectSingleNodeOrAttr to workaround this limitation of HtmlAgilityPack)

Hrxn commented 5 years ago

Hey, just came here to ask this, basically..

For others on the search, there is a way to work around this, maybe not the most elegant solution, but it works and is the intended way to do this I assume:

$result is what I get from the XPath expression used with HtmlAgilityPack.HtmlNode.SelectSingleNode

(I'm using HAP from PowerShell)

This works as expected so far:

PS D:\Test> $result.GetType()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     False    HtmlNode                                 System.Object

Or, the full result

PS D:\Test> $result

Attributes           : {type, name, value, checked}
ChildNodes           : {}
Closed               : True
ClosingAttributes    : {}
EndNode              : HtmlAgilityPack.HtmlNode
FirstChild           :
HasAttributes        : True
HasChildNodes        : False
HasClosingAttributes : False
Id                   :
InnerHtml            :
InnerText            :
<--- Snipped the rest --->

So, I have four attributes, and running $result.Attributes returns them correctly.

And now, if I want the value from the attribute called "value", I can do this:

$result.Attributes[2].Value

and I have the correct value.

And by the way:

PS D:\Test> $result.Attributes[2].GetType()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     False    HtmlAttribute                            System.Object

PS D:\Test> $result.Attributes.GetType()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     False    HtmlAttributeCollection                  System.Object

PS D:\Test>

So there already is HtmlAttribute and HtmlAttributeCollection, I think these are the right types, and therefore the classes already exist, @wiz0u ?

Not sure. But it would be nice to access an attribute value directly without resorting to such an array index access (also known as the infamous "off-by-one" error).

wiz0u commented 5 years ago

So there already is HtmlAttribute and HtmlAttributeCollection, I think these are the right types, and therefore the classes already exist, @wiz0u ?

But HtmlAttribute does not derive from HtmlNode (yet?) so it can't be returned by SelectSingleNode

calbucci commented 2 years ago

@wiz0u It has been a couple of years, but do you know if there is a solution to this problem?

I'm creating a generic HTML Parser, and I don't know the attributes' names during compile time. It seems that using Navigator has some pros/cons.

Also, unfortunately, XPathExpression doesn't decompose the XPath to indicate if ends in an attribute or not.

blaisemGH commented 8 months ago

For some future readers, I was able to select an attribute value with the following approaches.

html.SelectSingleNode("//xpath/to/node").Attributes.AttributesWithName("class")

to extract the attribute class from a single node.

If you are doing multiple nodes, you can do

html.SelectNodes("//xpath/to/node").GetAttributeValue("class", "class")

This will get the value for the attribute class I don't understand what the second argument is doing. Tbh, I could enter any value for it, like "xyz", and it still ran, as long as it wasn't null. There is no overload for a single argument, though.

As for the person who mentioned PowerShell, if you're in PowerShell you can easily select any attribute by doing $html.SelectNodes("//xpath/to/node").Attributes | Where-Object name -eq 'class' | Select-Object -ExpandProperty Value. This selects the value for the attribute class like the above code.

Note you can suppress the verbosity in PowerShell with aliases, i.e., you can shorten Where-Object to where or ? , Select-Object to select, and -ExpandProperty to -exp). PS has tools that easily traverse any object you can import into the language. The PSParseHTML module provides the AgilityPack type for PowerShell to wield.

Hrxn commented 8 months ago

@blaisemGH You mean this one? https://www.powershellgallery.com/packages/PSParseHTML/