y21 / tl

Fast, zero-copy HTML Parser written in Rust
https://docs.rs/tl
MIT License
336 stars 19 forks source link

HTMLTag::raw returns bytes excluding the last byte #31

Closed trrk closed 2 years ago

trrk commented 2 years ago

I was using tl and found a behavior that might be strange. HTMLTag::raw seems to return incomplete bytes. The last character > is missing.

let input = "<p>abcd</p>";

let vdom = tl::parse(input, Default::default()).unwrap();
let first_tag = vdom.children()[0]
    .get(vdom.parser())
    .unwrap()
    .as_tag()
    .unwrap();

let from_raw = first_tag.raw().try_as_utf8_str().unwrap();

println!("{:?}", from_raw);
// => "<p>abcd</p"
y21 commented 2 years ago

seems like a small logic error in the parser when reconstructing the range of bytes after reading the closing tag. skipping the final > token happens at the end of the function, after slicing: https://github.com/y21/tl/blob/1b42467eb8afd7bf89cc9636b7823c03ef0a87fe/src/parser/base.rs#L242

https://github.com/y21/tl/blob/1b42467eb8afd7bf89cc9636b7823c03ef0a87fe/src/parser/base.rs#L269

this should happen right after reading the identifier following </, before doing the slicing

y21 commented 2 years ago

published 0.7.3 to crates.io with a fix for this.

trrk commented 2 years ago

Thank you, 0.7.3 works fine.