y21 / tl

Fast, zero-copy HTML Parser written in Rust
https://docs.rs/tl
MIT License
336 stars 19 forks source link

wrongly parses tags inside `<script>` element #62

Open jaredh159 opened 11 months ago

jaredh159 commented 11 months ago

thanks for the library! truly appreciate it!

the parser currently incorrectly finds tags inside of <script> elements, so for example:

<script>
// a comment <p>foo</p>
</script>

will produce a <p> html tag inside the script tag.

dgjustice commented 9 months ago

I ran into this myself. It appears to be eat any < as the beginning of a tag. I'll try and dig into the source when I can find time.

    <script>
        $(function() {
            var imgs = document.getElementsByTagName('img');
            var errors = 0;
            for(var i=0,j=imgs.length;i<j;i++){
                imgs[i].onerror = function(e){
                    this.src = "assets/error.gif";
                    this.parentElement.parentElement.className = "danger";
                    this.onerror = function(e){};
                };
                errors++;
            }
        });
    </script>
Tag(HTMLTag { _name: Bytes("script"), _attributes: Attributes { raw: InlineHashMap(InlineHashMap<0 items>), id: None, class: None }, _children: InlineVec(InlineVec<3 items>), _raw: Bytes("<script>") })
Raw(Bytes("\n        $(function() {\n            var imgs = document.getElementsByTagName('img');\n            var errors = 0;\n            for(var i=0,j=imgs.length;i"))
Raw(Bytes("script>\n"))
y21 commented 9 months ago

Yeah, tl currently doesn't special case the script and style tags, but they do need some special handling to make this work properly.

I initially thought that supporting <script> tags would be pretty complicated to parse, given that we couldn't "just" seek to the next </script>, since they may appear within strings or other constructs, which might not truly be the end of the script tag (or so I thought).

For instance, 1 </script>/ is valid javascript (parsed as "1 less than the regular expression /script>/"), which made me think that this should not be counted as the end of <script>, and that we would have to implement a (subset of) javascript in the html parser.

But (at least according to some testing in the browser) turns out that this isn't really needed. Even <script> "</script>" </script> is not "supported" (as in, will interpret the first </script> within a string to be the end, and doesn't try anything fancy like keeping track of string literal depth), so this might not be so hard to implement :)