zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

The default encoding in HtmlDocument is Encoding.Default which causes problems #480

Closed StefH closed 2 years ago

StefH commented 2 years ago

1. Description

The default encoding in HtmlDocument is Encoding.Default which causes problems, better to just use UTF8 ?

2. Exception

3. Fiddle or Project

var web = new HtmlWeb();
var doc = web.Load("https://www.mstack.nl");

The title should actually be: image Which is in html:

<title>MSTACK – next level consulting and development</title>

But when debugging the internals from HtmlWeb, it's:

<title>MSTACK – next level consulting and development</title>

Which looks like: image

tewuapple commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。   您好,你的邮件已收到。 亮亮

JonathanMagnan commented 2 years ago

Hello @StefH ,

While we 100% understand and can easily reproduce this issue, that's pretty much an easy answer: we will not fix this unless there is a major version.

This library is used by too many people/companies, and from my experience, every time we did a small change, even if it makes sense like this request, it breaks without a doubt some people's code, and we need to revert it.

You can get the right result by overriding the encoding such as:

var web = new HtmlWeb();
web.OverrideEncoding = Encoding.UTF8;
var doc = web.Load("https://www.mstack.nl");

var title = doc.DocumentNode.SelectNodes("//title");

but there is currently no chance it happens by default as explained unless we create a major version that will re-write the HAP library, people expect that it works as it currently works.

Best Regards,

Jon

StefH commented 2 years ago

It's clear, thank you. I'll use your code.