zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

htmlDoc.DocumentNode.InnerText depends on new lines in HTML #561

Closed d668 closed 3 months ago

d668 commented 3 months ago

1. Description

htmlDoc.DocumentNode.InnerText gives inconsistent results whether there is a new line between HTML elements

see the fiddle. both outputs should be the same and should not depend whether there is new line in HTML markup

2. Exception

3. Fiddle or Project

https://dotnetfiddle.net/JOmlX0

4. Any further technical details

JonathanMagnan commented 3 months ago

Hello @d668 ,

Thank you for reporting. However, I do not believe anything will be done now for this.

There is currently too much code to change/understand to make it work correctly for the time we can allow, as even Chrome and Firefox have different behaviors depending on whether there is some empty line between them or not.

The current InnerText in Chrome is: span1\n\np1\n\nspan1 span2\n\np2\n\nspan2

Notice that span1 and span2 are separated by a space while others have a new line. This case looks easy to handle, but it will require way more time to verify all InnerText rules that we currently don't have.

But indeed, HAP doesn't provide the same InnerText as a real browser.

Best Regards,

Jon

d668 commented 3 months ago

Notice that span1 and span2 are separated by a space while others have a new line.

you are right, so HAP is making two mistakes actually, making new line between span1 and span2 and not making new lines in span1p1span1. Bot Chrome and Firefox show it as

 span1

p1
span1 span2

p2
span2 

But indeed, HAP doesn't provide the same InnerText as a real browser.

Oh man and what then? not same but some? It really does look like you just don't have resources to fix it an obvious bug.

JonathanMagnan commented 3 months ago

Hello @d668 ,

Feel free to propose a pull request with the fix ;)

We are currently reviewing/merging this week some other pull requests that have been submitted recently, so that would be a perfect time.

Best Regards,

Jon

d668 commented 3 months ago

If this is your excuse for not maintaining a project you started, that's lame. I am fine with beautifulsoup

d668 commented 3 months ago

man closing the issue with obvious bug?