Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
I want to convert html file into text file.
In the html file, there are both text and img contents.
I would like to keep the sequence of the text and img information from the html file into the text file.
However, I can only extract a single file type with DocumentNode.SelectNodes(XPath) method now.
Is there are way to approach my result?
Here is my current code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
if (!string.IsNullOrWhiteSpace(node.InnerText))
{
plainText += node.InnerText.Trim(); // Trim extra spaces and add text content
}
}
// Placeholder for image information
var imageNodes = doc.DocumentNode.SelectNodes("//img");
if (imageNodes != null)
{
foreach (var imageNode in imageNodes)
{
plainText += "[Image: " + imageNode.GetAttributeValue("src", "Unknown") + "]\n"; // Placeholder for image info
}
}
I want to convert html file into text file. In the html file, there are both text and img contents. I would like to keep the sequence of the text and img information from the html file into the text file. However, I can only extract a single file type with DocumentNode.SelectNodes(XPath) method now. Is there are way to approach my result? Here is my current code: