y21 / tl

Fast, zero-copy HTML Parser written in Rust
https://docs.rs/tl
MIT License
336 stars 19 forks source link

Convenience functions for traversing DOM? #18

Open dist1ll opened 2 years ago

dist1ll commented 2 years ago

Assume we have the following structure

<div class="x">
  <div class="y">
    <div class="z"></div>
    <div class="z"></div>
    <div class="z"></div>
  </div>
</div>

<div class="z"></div>
<div class="z"></div>
<div class="z"></div>

Let's say I want to get all nested divs of class z that are in x, but not the outside ones. Right now that's quite cumbersome, because query_selector() only works on VDom. Then I have to implement a DFS, using Node::children continuously.

Suggestion: NodeHandle should also have a query_selector method, so that we can more easily traverse nested divs. I think it makes sense from an API perspective, because a node essentially spawns a subtree, so it should have the same functions that a VDom has.

Alternative: An alternative solution would be to extend the syntax of the query_selector method to search for nested divs. Something like query_selector("div.x > div.y > div.z").

dist1ll commented 2 years ago

It seems like you already planned to select descendants https://github.com/y21/tl/blob/1b8fab3021cb6c8ab9645e5e3e4d44d307b618d5/src/queryselector/selector.rs#L18

The match case is just not implemented in matches. This would probably be a good first step, right?

y21 commented 2 years ago

The match case is just not implemented in matches. This would probably be a good first step, right?

Yep, though I'm not sure about the complexity of adding this to the query selector. It's probably worth mentioning somewhere in the docs that not all of the query selector API is fully implemented yet (Selector::Descendant and Selector::Parent is missing).

Suggestion: NodeHandle should also have a query_selector method,

I agree, Node (or HTMLTag) should have a query_selector function like VDom does, and that seems fairly straight forward to implement. In fact, I've been working on a minor "rewrite" that mostly changes how the parser works (internally), and I've already got the query_selector function implemented for HTMLTags along with a few other things (like being able to parse a query selector independent of a parser, so it can be reused multiple times for different VDoms or Nodes). I will try to get this done soon.

dist1ll commented 2 years ago

Yep, though I'm not sure about the complexity of adding this to the query selector. It's probably worth mentioning somewhere in the docs that not all of the query selector API is fully implemented yet (Selector::Descendant and Selector::Parent is missing).

Oh right, the matches has to decide if a given node fulfils the requirement. That's of course more difficult then (since nodes/htmltags don't have a parent reference). Also the fact we're using an iterator makes this tricky.

I agree, Node (or HTMLTag) should have a query_selector function like VDom does, and that seems fairly straight forward to implement. [...] I've already got the query_selector function implemented for HTMLTags along with a few other things

That's great! If Node/HTMLTag has that functionality, that already gives us powerful composability.

dist1ll commented 2 years ago

Have you considered putting a &VDom reference into the NodeHandler struct? Because we need to give the parser to the select_query function every time, and this makes writing chainable APIs difficult. You can find some extensions that I wrote here:

https://github.com/dist1ll/hltv-rust/blob/main/src/tl_extensions.rs

I created a RichNode struct there

#[derive(Clone, Copy)]
pub struct RichNode<'a> {
    pub d: &'a VDom<'a>,
    pub n: Option<NodeHandle>,
}

I've used this extension to create an ergonomic find function:

// Example of using .find and .get_attr
fn parse_team(h: RichNode, team_id: &str) -> Option<Team> {
    Some(Team{
        // get node as tag, and parse attribute to correct type
        id:  h.get_attr(team_id).unwrap_or(None)?,
        // use .find to find children with the given class name
        name: h.find(team_id).find("matchTeamName").inner_text()?,
    })
}
  1. What do you think about putting a VDom reference into the NodeHandler?

  2. The .find in my example only matches class strings right now (and doesn't do a proper query select like your API). I think it would be cool if you could make the module tl::queryselector public, so I could reuse your logic for this abstraction.

y21 commented 2 years ago

What do you think about putting a VDom reference into the NodeHandler?

I've thought of this before and I'm not sure if this would work (if I understood this correctly). This sounds like we would end up with self referential structs if Node or NodeHandle had a reference to the VDom. (Also, I guess we'd need to store a &Parser if anything, since a VDom does not exist at parsing/node creation time)

struct Node<'input, 'parser> {
  raw: &'input [u8]
  children: Vec<NodeHandle<'input, 'parser>>
}
struct NodeHandle<'input, 'parser> {
  idx: usize,
  parser: &'parser VDom<'input>
}
struct Parser<'input> {
  nodes: Vec<Node<'input, '???>> // can't annotate the 'parser ('self) lifetime
}
struct VDom<'input> {
  parser: Parser<'input>
}

The reason why NodeHandle exists in the first place is to avoid a situation where Node stores its children as references to other Nodes (or a reference to anything from the parser/vdom for that matter), and to make it "detached" (no lifetime to the source string or parser, so it can be freely stored without running into the problem where you can't name the lifetime). Instead, the parser has a Vec of all nodes and each node stores children as indices (NodeHandle) into the vector, which is why methods on the HTMLTag struct need a &Parser. From what I've seen it seems like a common strategy to model these kinds of "graphs" by using vector indices, and it makes it fairly trivial to implement.

struct Node<'input> {
  raw: &'input [u8]
  children: Vec<NodeHandle>
}
struct NodeHandle {
  idx: usize,
}
struct Parser<'input> {
  nodes: Vec<Node<'input>>
}
struct VDom<'input> {
  parser: Parser<'input>
}

I agree that this makes chaining rather awkward and possibly unidiomatic, but I'm not sure there is a nice and easy solution. I think a good long term solution would be to make the parser generic over a node sink (see #21), with some default sink implementations (e.g. one that remembers id and class attributes, and stores them in a map for get_element_by_id to run at constant time) and sink combinators. This would make it possible for users of the library to implement their own VDom (kind of) or extend it in any way you want (and also effectively creating their own Nodes). Users could make changes to the sink struct as the parser streams nodes and tell the parser to break at any point, which could be much more efficient in certain cases where you only need to extract one specific part of the HTML document and don't need anything else.

I think it would be cool if you could make the module tl::queryselector public

yes, that module will be public in the next release

dist1ll commented 2 years ago

Thanks for explaining your thoughts, I see now how that approach creates some problems. I really like the idea of a sink though

y21 commented 2 years ago

0.6.0 is on crates.io with some of your requested features:

"#, Default::default()).unwrap();

let x_element = dom .get_elements_by_class_name("x") .next() .unwrap() .get(dom.parser()) .unwrap() .as_tag() .unwrap();

let zs = x_element.query_selector(".z").unwrap();

assert_eq!(zs.count(), 3);


- [`queryselector`](https://docs.rs/tl/0.6.0/tl/queryselector/index.html) module is fully public now
----
I have also refactored the HTMLTag children API a bit. Instead of having `HTMLTag::children()` return an iterator over the *direct* subnodes (not nested ones, which made it a little less useful than it could be), it now returns a struct ([`Children`](https://docs.rs/tl/0.6.0/tl/struct.Children.html)) containing a few helper methods:
- `Children::all`: returns a slice containing *all* subnodes (including nested tags). Calling `::all()` on `.x` in the HTML above would return `[div.y, div.z, div.z, div.z]` (ignoring the whitespaces)
- `Children::top`: returns a slice containing only the "topmost", direct subnodes (which is what old `HTMLTag::children` did, before). Calling `::top()` on `.x` in the HTML above would return `[div.y]`  (ignoring the whitespaces), because that's the only immediate subnode 
- `Children::{start, end, boundaries}`: lower-level methods for getting the starting/ending boundary (indices) of the children. This is used by `Children::all` to build up the slice
dist1ll commented 2 years ago

wow good stuff! I think those are great improvements! I'll try out the new features, cheers!