zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

bug: SelectSingleNode not returning anything #547

Closed ghost closed 6 months ago

ghost commented 6 months ago

Hi

.NET Framework 4.8 HtmlAgilityPack 1.11.60

I don't know if I found a bug, but something is not working in following code. I want to get all the Radio Shows at the right of this website: https://www.trancepodcasts.com/

afbeelding

I use following code but nothing is printed at all. I used similar code on other websites and there it worked perfectly. What is going on here? Using F12 in Mozilla Firefox, I found they should be in there

afbeelding

HtmlWeb web = new HtmlWeb();

HtmlAgilityPack.HtmlDocument document = web.Load("https://www.trancepodcasts.com/");

HtmlNode node = document.DocumentNode.SelectSingleNode("//ul[class='sub-menu mm-listview']");

if (node != null)
{
    // Select all a nodes within the ul
    HtmlNodeCollection links = node.SelectNodes(".//li");

    if (links != null)
    {
        foreach (HtmlNode link in links)
        {
            Console.WriteLine(link.OuterHtml);
        }
    }
}

I checked and "node" is null in the code example. Why is it null?

ghost commented 6 months ago

Another example on the same website that doesn't do anything when using SelectSingleNode()

Trying to get the highest page number (here "11"):

afbeelding

Using F12 in Mozilla Firefox, I found it should be in there using the XPath: /html/body/div[3]/div[2]/div[6]/div[1]/div/div[1]/div/div/div[2]/a[3]

The code below doesn't do anything. It also worked on another website. It should print "11"

HtmlWeb web = new HtmlWeb();

HtmlAgilityPack.HtmlDocument document = web.Load("https://www.trancepodcasts.com/a-dream-radio/");

HtmlNode node = document.DocumentNode.SelectSingleNode("/html/body/div[3]/div[2]/div[6]/div[1]/div/div[1]/div/div/div[2]/a[3]");
if (node != null)
{
    Debug.WriteLine(node.InnerText);
}

I checked and "node" is null in the code example. Why is it null?


I tried on another website, and there similar code works: https://www.markusschulz.com/category/gdjb/gdjbtracklists/ It prints "100"

afbeelding

HtmlWeb web = new HtmlWeb();

HtmlAgilityPack.HtmlDocument document = web.Load("https://www.markusschulz.com/category/gdjb/gdjbtracklists/");

HtmlNode node = document.DocumentNode.SelectSingleNode("/html/body/div[3]/div/div/div[3]/div/div/div[3]/a[7]");
if (node != null)
{
    Debug.WriteLine(node.InnerText);
}
ghost commented 6 months ago

Why does nothing seem to work on https://www.trancepodcasts.com/ while on the other example website it works (using similar code)? Did I find a bug?

JonathanMagnan commented 6 months ago

Hello @trance-babe ,

A little bit like your other issue, what appears on the screen is not what has been loaded by HAP.

If you check the source, you don't find this HTML code: view-source:https://www.trancepodcasts.com/

The HTML code look more like this:

<ul class="sub-menu">
    <li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-126821"><a href="https://www.trancepodcasts.com/a-dream-radio/">A Dream Radio</a></li>
    <li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-104457"><a href="https://www.trancepodcasts.com/a-state-of-trance/">A State Of Trance</a></li>
    <li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-106361"><a href="https://www.trancepodcasts.com/a-world-into-trance/">A World Into Trance</a></li>
...code...
</ul>

Nothing works on this website due to having dynamic HTML or HTML modified after the page is loaded

Best Regards,

Jon

ghost commented 6 months ago

Nothing works on this website due to having dynamic HTML or HTML modified after the page is loaded

So I can't use HtmlAgilityPack at all here? Or can I use it on the HTML you gave me?

The other project you talked about: can you tell me what nuget package I need to install and can I use that library for this specific website? Is there a code example doing the same as what I try to do using HtmlAgilityPack ?

JonathanMagnan commented 6 months ago

HAP is more used to parse HTML than to work with dynamic HTML.

I believe you are asking for Selenium Web Browser: https://riptutorial.com/selenium-webdriver/learn/100000/overview

You can find how to setup in the tutorial we make years ago: https://riptutorial.com/selenium-webdriver/learn/100001/setup-selenium

Unfortunately, my time doesn't permit me to help you with it. However, using ChatGPT should get you started.

Best Regards,

Jon

ghost commented 6 months ago

I don't find ANY Examples and I don't know chatgpt...

Can you please give a very short code example that is doing this?

If I use Selenium Web Browser: do I need to use the HTML as seen in the F12 window of my webbrowser or the HTML as seen in the get source code of my web browser? (right click)

HtmlWeb web = new HtmlWeb();

HtmlAgilityPack.HtmlDocument document = web.Load("https://www.trancepodcasts.com/");

HtmlNode node = document.DocumentNode.SelectSingleNode("//ul[class='sub-menu mm-listview']");

if (node != null)
{
    // Select all a nodes within the ul
    HtmlNodeCollection links = node.SelectNodes(".//li");

    if (links != null)
    {
        foreach (HtmlNode link in links)
        {
            Console.WriteLine(link.OuterHtml);
        }
    }
}
JonathanMagnan commented 6 months ago

I'm really sorry,

I would like to help you more, but as I already said, my time is very limited (I barely have time to myself at this moment). Using selenium is another tool to learn. In short, it opens a new browser, and you can interact with it.

The first thing you should learn in this case is ChatGPT (or other alternative), and even ask your parent to provide a paid subscription: https://chat.openai.com/

This becomes a day-to-day tool for anybody who can use and take advantage of it. It's pretty simple to use; this is a Chat box that provides information (it might be bad or good, but for any common subject, it's accurate enough).

Best Regards,

Jon