zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.64k stars 375 forks source link

Error downloading Html - Exception Message #171

Closed sharathm89 closed 6 years ago

sharathm89 commented 6 years ago

Trying to scrape this Link but unable to do it..

It throws an exception with the message has Error downloading Html

9zq7o

 async public static Task<HtmlDocument> GetDocument()
    {
        HtmlDocument doc = null;
        string url = "https://www.finedininglovers.com/recipes/appetizer/vegan-dishes-white-asparagus/";
        try
        {
            HtmlWeb web = new HtmlWeb();
            doc = await web.LoadFromWebAsync(url);
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
            Console.WriteLine(ex.StackTrace);
        }
        return doc;
    }

Tried setting Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 as the UserAgent but still not working

JonathanMagnan commented 6 years ago

Hello @sharathm89 ,

Unfortunately the server return the following error:

{StatusCode: 500, ReasonPhrase: 'Internal Server Error', Version: 1.1, Content: System.Net.Http.StreamContent, Headers:
{
  x-frame-options: DENY
  X-UA-Compatible: IE=Edge
  X-Iinfo: 8-41929732-41929787 SNNN RT(1523536424411 339) q(0 0 0 -1) r(1 1) U11
  X-CDN: Incapsula
  Transfer-Encoding: chunked
  Cache-Control: private
  Date: Thu, 12 Apr 2018 12:33:42 GMT
  Server: 
  Content-Type: text/html; charset=utf-8
}}

However, the NonAsync version work fine.

HtmlAgilityPack.HtmlDocument doc = null;
string url = "https://www.finedininglovers.com/recipes/appetizer/vegan-dishes-white-asparagus/";

HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
doc = web.Load(url);
var html = doc.DocumentNode.OuterHtml;

So you can use it meanwhile we investigate the issue.

Best Regards,

Jonathan

sharathm89 commented 6 years ago

thanks @JonathanMagnan

JonathanMagnan commented 6 years ago

Hello @sharathm89 ,

The v1.8.1 has been released.

You should no longer have the issue with the Async method.

Best Regards,

Jonathan

sharathm89 commented 6 years ago

@JonathanMagnan still the issue exists with latest v1.8.1 below is the code I tested. Url also mentioned.

Async throws An error occurred while sending the request.

Non Async throws The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF

It used to work earlier but I guess after version upgrade its failing.

   class Program
    {
        const string url = "https://www.finedininglovers.com/recipes/appetizer/vegan-dishes-white-asparagus/";
        static void Main(string[] args)
        {
            try
            {
                GetHtmlDocumentAsync().GetAwaiter().GetResult();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);  // An error occurred while sending the request.
            }

            try
            {
                GetHtmlDocument();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);  // The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF
            }
            Console.ReadLine();
        }

        async public static Task<HtmlDocument> GetHtmlDocumentAsync()
        {
            HtmlWeb web = new HtmlWeb();
            return await web.LoadFromWebAsync(url);
        }

        public static HtmlDocument GetHtmlDocument()
        {
            HtmlWeb web = new HtmlWeb();
            return web.Load(url);
        }
    }

capture

JonathanMagnan commented 6 years ago

Hello @sharathm89 ,

Thank you for the additional info.

We will continue to look at it.

Best Regards,

Jonathan

sharathm89 commented 6 years ago

thanks @JonathanMagnan

JonathanMagnan commented 6 years ago

Hello @sharathm89 ,

We tried your code but everything is working on our side ;(

Could you try it and let us know what we are missing?

HtmlAsync.zip

Best Regards,

Jonathan

sharathm89 commented 6 years ago

@JonathanMagnan I tried the same code but sometimes it happens actually after reporting the issue I tried after 3 hours it worked but again 2 days back when I tried got same error. Now I tried its working...

So its not occurring every-time....

JonathanMagnan commented 6 years ago

Hello @sharathm89 ,

That is probably due to some bot detection that BAN an ip that had made to many requests in a very short delay.

There is nothing we can do at this moment for such error ;(

Best Regards,

Jonathan

sharathm89 commented 6 years ago

@JonathanMagnan Probably so in that case I'll close the issue.