Closed shravan2x closed 1 year ago
Chrome seems to be able to parse HTML like this correctly.
You seem to be mistaken. I find HtmlAgility's parsing behavior in this case to broadly match the parsing behavior of browsers like Chrome and Firefox.
Chrome 118.0.5993.118 behavior is equivalent to what you observed with HtmlAgilityPack. Loading your example HTML into Chrome yields this result:
Note how the part of the intended Javascript code after the first occurence of </script>
is treated as text content for the HTML body, more or less like HAP does (there might or might not be differences in how the browsers and HAP handle the "stray" secondary </script>
closing element, but that's not really relevant regarding the point i am making here).
Firefox 118.0.2 behaves in the same manner as Chrome and HtmlAgilityPack:
Also, in case you aren't already aware of it, worth of note in this context are also the restrictions the HTML standard places on the content of script elements: https://html.spec.whatwg.org/multipage/scripting.html#restrictions-for-contents-of-script-elements
P.S.: I am just a user and not associated with the HAP project nor its authors/maintainers.
Hello @shravan2x ,
Let me know if the answer from @elgonzo is good.
I tested it also on Firefox and got the same behavior as him and HAP
.
Best Regards,
Jon
When an HTML script tag contains JS that includes , the HTML script tag is closed instead. For example,
Here the website has JS like
e.innerHTML="<script></script>"
. HAP interprets this closing HTML script tag to close the HTML script tag instead. This results in broken parsing like this for thebody
tag:Chrome seems to be able to parse HTML like this correctly. Is there something the library could do as well?