y21 / tl

Fast, zero-copy HTML Parser written in Rust
https://docs.rs/tl
MIT License
317 stars 18 forks source link

query_selector not selecting children? #22

Open steamyalt opened 2 years ago

steamyalt commented 2 years ago

I'm pretty new to this so I might be misunderstanding how the query selector works. I have the following HTML code:

<li class="test">
  <a href="example.com></a>
</li>

I'd now expect the selector li.test a to return the inner a element. However, it returns nothing. Rust code:

fn main() {
    let input = r#"<li class="test"><a href="example.com"></a></li>"#;
    let dom = tl::parse(input, tl::ParserOptions::default()).unwrap();
    let query = dom.query_selector("li.test a").unwrap().next();
    assert!(query.is_some()); // Panics!
}

Opening the same HTML code in Firefox and running $$("li.test a") in the console returns a one-element array with the a node.

y21 commented 2 years ago

You're right, the query selector implementation is incomplete. In particular, .foo .bar exists and is parsed properly: https://github.com/y21/tl/blob/9a1538a7a0e8f95d157d7e429b1a25e57076bf98/src/queryselector/selector.rs#L18-L19 But Selector::matches doesn't handle that case (as well as .foo > .bar) and simply returns false. https://github.com/y21/tl/blob/9a1538a7a0e8f95d157d7e429b1a25e57076bf98/src/queryselector/selector.rs#L72 I believe this was skipped in the implementation because of complexity at the time, and should be mentioned in the docs. Nodes don't have a reference or a handle to the parent, so implementing .foo > .bar is difficult. Also, the way the query selector works probably needs to be changed in some way to make this work, because right now there's not really a way to return a matched subnode from within the method. I guess a workaround for now until that's implemented is to run a separate query selector on those matches (or if a query selector is overkill, one could iterate over the children and find the a tag using query.children().all().find(...) with a tag name comparison).


On a more-or-less unrelated note, I found an off-by-one bug in the query selector API for HTML tags while looking into this. We've got lots of tests for all of the different methods of the public API and I'm surprised they didn't catch this. I just published 0.6.1 with a quick fix for this (see #23).

steamyalt commented 2 years ago

As a temporary workaround, I currently have the following code:

    let items = dom.query_selector("li.example-class").unwrap();
    let links: Vec<_> = items
        .filter_map(|e| e.get(&parser).and_then(tl::Node::children))
        .filter_map(|e| e.all(&parser)[1].as_tag())
        .map(tl::HTMLTag::attributes)
        .filter_map(|e| e.get("href").flatten().and_then(tl::Bytes::try_as_utf8_str))
        .collect();

What I want to do is find the a tag inside all li tags with the specified class and get the link it's pointing to. Is there any way to write this more concisely/idiomatically?

kelko commented 2 years ago

Ran also into the same problem for a fun project of mine. So I took a look at the code, thinking about maybe I can contribute a PR. But I'm not sure the current CSS selector parsing is prepared for this. E.g. the "OR" in the current implementation seems to 'local' while (at least in my understanding) in CSS it's an OR for the whole of the selector list.

So, in my understanding it should be:

a b,c -> (a AND b) OR (c)

But I think the code would parse it as:

a b,c -> (a) AND (b OR c)

Correct me if I'm wrong.

In the end I decided to put an index on top of your tl VDom, which caches all those relationships (https://github.com/kelko/html-streaming-editor/blob/main/src/html/mod.rs) and an own CSS selector parsing (using PEG, https://github.com/kelko/html-streaming-editor/blob/main/src/parsing/mod.rs#L66) & execution (https://github.com/kelko/html-streaming-editor/blob/main/src/css/mod.rs)

I don't say it's pretty, but it seems to do the job - at least for my project. And maybe there is an idea or two in there which might help you as well.

ThatXliner commented 1 year ago

What's the status on this?