zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

How to make the DocumentNode.SelectNodes(XPath) for both text and img content together in the correct sequence? #536

Closed Qsama95 closed 9 months ago

Qsama95 commented 9 months ago

I want to convert html file into text file. In the html file, there are both text and img contents. I would like to keep the sequence of the text and img information from the html file into the text file. However, I can only extract a single file type with DocumentNode.SelectNodes(XPath) method now. Is there are way to approach my result? Here is my current code:

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(htmlContent);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
        {
            if (!string.IsNullOrWhiteSpace(node.InnerText))
            {
                plainText += node.InnerText.Trim(); // Trim extra spaces and add text content
            }
        }

        // Placeholder for image information
        var imageNodes = doc.DocumentNode.SelectNodes("//img");
        if (imageNodes != null)
        {
            foreach (var imageNode in imageNodes)
            {
                plainText += "[Image: " + imageNode.GetAttributeValue("src", "Unknown") + "]\n"; // Placeholder for image info
            }
        }
elgonzo commented 9 months ago

Try using the union operator |:

//text() | //img
JonathanMagnan commented 9 months ago

Hello @Qsama95 ,

Let us know if the @elgonzo solution worked for you.

Best Regards,

Jon

Qsama95 commented 9 months ago

@elgonzo yes it works. Thank you! @JonathanMagnan