y21 / tl

Fast, zero-copy HTML Parser written in Rust
https://docs.rs/tl
MIT License
336 stars 19 forks source link

Access the range of the index in the original string that resulted a node? #25

Closed chaoxu closed 2 years ago

chaoxu commented 2 years ago

Once we parsed the input string, we locate a particular node, now I want to know the indexes range of the string that generated this node.

For example.

<div><p>haha</p></div>

The node containing <p>haha</p> has index 5 to 15.

Is there a way to do this?

My use case is then I need to do some string replacements to the original string.

y21 commented 2 years ago

There is nothing in tl that exposes the indices of nodes directly, but you can work around this pretty easily (or, at least as long as your DOM is immutable).

Given that everything is borrowed from the input string, you can find out the start of the node by taking the memory address of the node span and subtracting it with the address of the input string. The end part of the range is the element length + starting index. Example:

fn main() {
    let input = "<div><p>haha</p></div>";
    let dom = tl::parse(input, Default::default()).unwrap();

    // get the p tag (can mostly ignore this)
    let element: &tl::HTMLTag = dom
        .query_selector("p") // create the query selector
        .unwrap()
        .next() // get the next (first) matching node
        .unwrap()
        .get(dom.parser()) // resolve the handle
        .unwrap()
        .as_tag() // "upcast" the Node to an HTMLTag
        .unwrap();

    // get a reference to the underlying bytes that references a substring of the input string
    let raw = element.raw().as_bytes();

    // `raw.as_ptr() - input.as_ptr()` gives you the offset (starting index)
    let start = raw.as_ptr() as usize - input.as_ptr() as usize;
    let end = start + raw.len();

    // as expected, the range is (5, 15)
    assert_eq!((start, end), (5, 15));
    assert_eq!(&input[start..=end], "<p>haha</p>");
}

I think it might be a good idea to expose a method for this.

y21 commented 2 years ago

Published a new version that adds HTMLTag::boundaries, which returns the (start, end) position like in the example above.