zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

override encoding cannot help when receiving not supported context encoding #326

Open y-code opened 5 years ago

y-code commented 5 years ago

Description

When I tried to load a page from https://www.jamieoliver.com/ by HtmlWeb.Load method, it failed with an ArgumentException.

It turned out to be because the response headers from the site has content-encoding: identity. As per HTTP RFC 2616, identity is used only in the Accept- Encoding header, and SHOULD NOT be used in the Content-Encoding header., so that it is of course that Encoding class does not support identity.

Therefore, next, I specified Encoding.UTF8 to OverrideEncoding property and called HtmlDocument.Load method. However, it didn't make any change and I got the same ArgumentException.

I expected OverrideEncoding property make HtmlWeb class to ignore the Content-Encoding in the response headers from server and to decode content by specified encoding in OverrideEncoding property, but it was not the case.

While it allows overriding the encoding specified by server when the encoding name is valid, it would be ideal that it also worked when the server specified encoding name is invalid.

Exception

Exception message:
System.ArgumentException : 'identity' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
Parameter name: name

Stack trace:
   at System.Text.EncodingTable.GetCodePageFromName(String name)
   at System.Text.Encoding.GetEncoding(String name)
   at HtmlAgilityPack.HtmlWeb.Get(Uri uri, String method, String path, HtmlDocument doc, IWebProxy proxy, ICredentials creds) in /Users/yas/Projects/happyfl/html-agility-pack/src/HtmlAgilityPack.Shared/HtmlWeb.cs:line 1680
   at HtmlAgilityPack.HtmlWeb.LoadUrl(Uri uri, String method, WebProxy proxy, NetworkCredential creds) in /Users/yas/Projects/happyfl/html-agility-pack/src/HtmlAgilityPack.Shared/HtmlWeb.cs:line 2068
   at HtmlAgilityPack.HtmlWeb.Load(Uri uri, String method) in /Users/yas/Projects/happyfl/html-agility-pack/src/HtmlAgilityPack.Shared/HtmlWeb.cs:line 1290
   at HtmlAgilityPack.HtmlWeb.Load(Uri uri) in /Users/yas/Projects/happyfl/html-agility-pack/src/HtmlAgilityPack.Shared/HtmlWeb.cs:line 1189
   at HappyFL.Services.WebSeekers.RecipeSeeker.Scan() in /Users/yas/Projects/happyfl/HappyFL/Services/WebSeekers/RecipeSeeker.cs:line 34
   at HappyFL.Services.WebSeekerService.FindRecipes(Uri url, Nullable`1 cancel, Encoding encode) in /Users/yas/Projects/happyfl/HappyFL/Services/WebSeekerService.cs:line 159
   at HappyFL.Test.WebSeekerServiceTest.TestFindRecipe(String url, ExpectedResultForTestFindRecipe expected) in /Users/yas/Projects/happyfl/HappyFLTest/WebSeekerServiceTest.cs:line 167

Project to reproduce issue

https://github.com/y-code/repro-bug-in-html-agility-pack

Further technical details

JonathanMagnan commented 5 years ago

Hello @y-code ,

Thank you, we will look at this request and your pull

Best Regards,

Jonathan


Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

James231 commented 4 years ago

Hi, are there any updates on this?

This issue is effecting a large number of websites (facebook.com being another example). While wrapping my code in try { } catch{ } does the trick, it is not ideal.

Thanks