zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

[HtmlAgilityPack version 1.11.60] request: add HtmlWeb Load() exception #546

Closed ghost closed 6 months ago

ghost commented 7 months ago

Hi

I would like to request an exception that is thrown by the Load() method of the HtmlWeb , when the URL is not existing. In this way I can catch the exception in my code to know when the URL doesn't exist. Currently no exception is thrown and we don't know if the URL exists or not.

HtmlAgilityPack version 1.11.60

Thanks

JonathanMagnan commented 6 months ago

Hello @trance-babe ,

There is already an error if the domain name doesn't exist. Could you give me an example of URL that doesn't work?

try
{
    var url = "https://html-agility-pack.net/";

    var web1 = new HtmlAgilityPack.HtmlWeb();
    web1.Load(url);
}
catch(Exception ex)
{
    if(ex.Message.StartsWith("The remote name could not be resolved"))
    {
        // The domain name doesn't exists here
    }
}

Are you talking about an invalid URL that redirects to a 404 page by example?

Best Regards,

Jon

ghost commented 6 months ago

Thanks. I found I have to use System.Net.WebException because it doesn't work otherwise.

Is it also possible to catch a 404 and what exception is it (or what exception Message?)

try
{
   var url = "https://html-agility-pack.net/";

   var web1 = new HtmlAgilityPack.HtmlWeb();
   web1.Load(url);
}
catch(System.Net.WebException ex)
{
   if(ex.Message.StartsWith("The remote name could not be resolved"))
   {
    // The domain name doesn't exists here
   }
}
JonathanMagnan commented 6 months ago

Hello @trance-babe ,

If you want to capture this kind of error, I would recommend you to use directly an HttpClient and load later the HTML in HAP instead.

Something like this:

var url = "https://sqlfiddle.com/test"; // Replace with the URL you want to check
var httpClient = new HttpClient();

// Send a GET request to the URL ... I used .Result since my test method was not async
var response = httpClient.GetAsync(url).Result;

// Check if the status code indicates a not found error (404)
if (response.StatusCode == System.Net.HttpStatusCode.NotFound)
{
    Console.WriteLine("Error 404: Page not found.");
}
else
{
    Console.WriteLine("Page found. Status code: " + response.StatusCode);
    // Optionally, load the HTML to HtmlAgilityPack if needed ... I used .Result since my test method was not async
    var htmlContent = response.Content.ReadAsStringAsync().Result;

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(htmlContent);
}
ghost commented 6 months ago

Thanks!

Is there a forum or can this github be used for questions like this too?

JonathanMagnan commented 6 months ago

Depending on your question, Stack Overflow might be more suited: https://stackoverflow.com/questions/tagged/html-agility-pack

We surely try to help whenever we can with questions but at this moment, we barely have time to do the minimum support.

Best Regards,

Jon