y21 / tl

Fast, zero-copy HTML Parser written in Rust
https://docs.rs/tl
MIT License
336 stars 19 forks source link

tl does not properly handle "void" closing tags. #37

Closed LDSpits closed 2 years ago

LDSpits commented 2 years ago

When I was attempting to parse a html page. I received a puzzling result from the parser. where the head tag would terminate early on a child tag. And the closing html tag would be omitted.

I have found that this is due to how "void" tags are ignored. And would like to get your opinion on my deduction of the problem.

let's take the following sample html:

<html>
    <head>
        <base href='single_quoted_item'></base>
        <link rel="stylesheet" type="text/css" href="non-exising"/>
    </head>
</html>

the output of the parser currently is

<html>
    <head>
        <base href="single_quoted_item"></base></head>
        <link rel="stylesheet" href="non-exising" type="text/css"></link>
    </html>

as you can see, the head closing tag is abnormally placed after the base closing tag.

after some debugging with tests i found the issue in the parser code, the tag parser properly ignores start tags. but if you hit a "void" closing tag it just treats the previously found tag as the closed tag. causing it to terminate the parent tag early.

I have written a fix for this by parsing the closing tag, and then not pushing the result if the closing tag is one of the "void" tags. Would you accept a pull request for this bug?

y21 commented 2 years ago

Ah right, good catch. I suspect this might affect performance quite a lot if we always have to check if the closing tag is a void tag every time (although I could be wrong). I wonder if we can simply check that the last tag name in the stack matches with the current one being parsed (one string comparison as opposed to comparing it to all 15 strings that we would have to do). In your example, when the parser sees the closing tag </base> , it would notice that the last tag in the stack is <head> and not <base>, and just not pop it. I have some more ideas for making things in the parser faster so it's not the end of the world if we lose a bit of performance here because of this.

LDSpits commented 2 years ago

the only case that wouldn't work is nested "voided" tags. however, the tags are never pushed to the parser stack. So there is no chance of them becoming nested as far as I'm aware. This does make me question how they become visible in the output. Are they treated as raw tags?

I'll adjust my fix to use the method you just mentioned and I'll open a pull request.