zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

Closing tag in script is recognized as HTML #521

Closed shravan2x closed 1 year ago

shravan2x commented 1 year ago

When an HTML script tag contains JS that includes , the HTML script tag is closed instead. For example,

<html lang="en-US" prefix="og: https://ogp.me/ns#" class="" itemtype="https://schema.org/Blog" itemscope="" style="margin-top: 69.1979px !important">
   <body class="aitool-template-default single single-aitool postid-26456 wp-custom-logo wp-embed-responsive footer-on-bottom hide-focus-outline link-style-no-underline content-title-style-normal content-width-normal content-style-boxed content-vertical-padding-show non-transparent-header mobile-non-transparent-header kadence-elementor-colors e-lazyload elementor-default elementor-kit-1339 elementor-page-20666 trigger-position-right quick-links-position-left trigger-size-small trigger-color-blue mysticky-welcomebar-apper e--ua-blink e--ua-chrome e--ua-webkit cookies-not-set" data-elementor-device-mode="desktop">
      <script type="text/javascript">!function(e){fucript"===a?(e=l.createElement("div"),e.innerHTML="<script>
      </script>",e=e.removeChild(e.firstChild)):s)}]));</script>
   </body>
</html>

Here the website has JS like e.innerHTML="<script></script>". HAP interprets this closing HTML script tag to close the HTML script tag instead. This results in broken parsing like this for the body tag:

[
  "Text(\n      )",
  "Element(script)",
  "Text(\u0022,e=e.removeChild(e.firstChild)):s)}]));)",
  "Text(\n   )"
]

Chrome seems to be able to parse HTML like this correctly. Is there something the library could do as well?

elgonzo commented 1 year ago

Chrome seems to be able to parse HTML like this correctly.

You seem to be mistaken. I find HtmlAgility's parsing behavior in this case to broadly match the parsing behavior of browsers like Chrome and Firefox.

Chrome 118.0.5993.118 behavior is equivalent to what you observed with HtmlAgilityPack. Loading your example HTML into Chrome yields this result: Ch Note how the part of the intended Javascript code after the first occurence of </script> is treated as text content for the HTML body, more or less like HAP does (there might or might not be differences in how the browsers and HAP handle the "stray" secondary </script> closing element, but that's not really relevant regarding the point i am making here).

Firefox 118.0.2 behaves in the same manner as Chrome and HtmlAgilityPack: ff

Also, in case you aren't already aware of it, worth of note in this context are also the restrictions the HTML standard places on the content of script elements: https://html.spec.whatwg.org/multipage/scripting.html#restrictions-for-contents-of-script-elements


P.S.: I am just a user and not associated with the HAP project nor its authors/maintainers.

JonathanMagnan commented 1 year ago

Hello @shravan2x ,

Let me know if the answer from @elgonzo is good.

I tested it also on Firefox and got the same behavior as him and HAP.

Best Regards,

Jon