zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.63k stars 375 forks source link

Text with angle brackets parsed improperly #348

Open ShayGuy opened 4 years ago

ShayGuy commented 4 years ago

Description

I am scraping a website that includes a select dropdown where the option elements are unclosed. In the inner text of one of these elements, there is text enclosed in angle brackets. HtmlAgilityPack's parser interprets this text as a start tag, containing all following text up to the next closing tag for a higher element, which happens to be the </select> tag itself. This means that all option elements from the one with the angle brackets on are parsed improperly. Link to minimal fiddle below.

(In fairness, Beautiful Soup seems to handle this page even worse -- without the closing tags, it doesn't even realize any of the option elements have ended. Just nests them until it hits </select>.)

Fiddle

https://dotnetfiddle.net/WBBwNx

JonathanMagnan commented 4 years ago

Hello @ShayGuy ,

Thank you for reporting.

We will look at this and probably apply a solution very similar to the one you suggested.

Best Regards,

Jonathan


Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function