Open dist1ll opened 2 years ago
It seems like you already planned to select descendants https://github.com/y21/tl/blob/1b8fab3021cb6c8ab9645e5e3e4d44d307b618d5/src/queryselector/selector.rs#L18
The match case is just not implemented in matches
. This would probably be a good first step, right?
The match case is just not implemented in matches. This would probably be a good first step, right?
Yep, though I'm not sure about the complexity of adding this to the query selector. It's probably worth mentioning somewhere in the docs that not all of the query selector API is fully implemented yet (Selector::Descendant
and Selector::Parent
is missing).
Suggestion: NodeHandle should also have a query_selector method,
I agree, Node
(or HTMLTag
) should have a query_selector
function like VDom
does, and that seems fairly straight forward to implement. In fact, I've been working on a minor "rewrite" that mostly changes how the parser works (internally), and I've already got the query_selector function implemented for HTMLTag
s along with a few other things (like being able to parse a query selector independent of a parser, so it can be reused multiple times for different VDom
s or Node
s). I will try to get this done soon.
Yep, though I'm not sure about the complexity of adding this to the query selector. It's probably worth mentioning somewhere in the docs that not all of the query selector API is fully implemented yet (
Selector::Descendant
andSelector::Parent
is missing).
Oh right, the matches
has to decide if a given node fulfils the requirement. That's of course more difficult then (since nodes/htmltags don't have a parent reference). Also the fact we're using an iterator makes this tricky.
I agree, Node (or HTMLTag) should have a query_selector function like VDom does, and that seems fairly straight forward to implement. [...] I've already got the query_selector function implemented for HTMLTags along with a few other things
That's great! If Node/HTMLTag has that functionality, that already gives us powerful composability.
Have you considered putting a &VDom
reference into the NodeHandler struct? Because we need to give the parser to the select_query
function every time, and this makes writing chainable APIs difficult. You can find some extensions that I wrote here:
https://github.com/dist1ll/hltv-rust/blob/main/src/tl_extensions.rs
I created a RichNode
struct there
#[derive(Clone, Copy)]
pub struct RichNode<'a> {
pub d: &'a VDom<'a>,
pub n: Option<NodeHandle>,
}
I've used this extension to create an ergonomic find function:
// Example of using .find and .get_attr
fn parse_team(h: RichNode, team_id: &str) -> Option<Team> {
Some(Team{
// get node as tag, and parse attribute to correct type
id: h.get_attr(team_id).unwrap_or(None)?,
// use .find to find children with the given class name
name: h.find(team_id).find("matchTeamName").inner_text()?,
})
}
What do you think about putting a VDom reference into the NodeHandler?
The .find
in my example only matches class strings right now (and doesn't do a proper query select like your API). I think it would be cool if you could make the module tl::queryselector
public, so I could reuse your logic for this abstraction.
What do you think about putting a VDom reference into the NodeHandler?
I've thought of this before and I'm not sure if this would work (if I understood this correctly). This sounds like we would end up with self referential structs if Node
or NodeHandle
had a reference to the VDom. (Also, I guess we'd need to store a &Parser
if anything, since a VDom
does not exist at parsing/node creation time)
struct Node<'input, 'parser> {
raw: &'input [u8]
children: Vec<NodeHandle<'input, 'parser>>
}
struct NodeHandle<'input, 'parser> {
idx: usize,
parser: &'parser VDom<'input>
}
struct Parser<'input> {
nodes: Vec<Node<'input, '???>> // can't annotate the 'parser ('self) lifetime
}
struct VDom<'input> {
parser: Parser<'input>
}
The reason why NodeHandle
exists in the first place is to avoid a situation where Node
stores its children as references to other Node
s (or a reference to anything from the parser/vdom for that matter), and to make it "detached" (no lifetime to the source string or parser, so it can be freely stored without running into the problem where you can't name the lifetime). Instead, the parser has a Vec
of all nodes and each node stores children as indices (NodeHandle
) into the vector, which is why methods on the HTMLTag
struct need a &Parser
. From what I've seen it seems like a common strategy to model these kinds of "graphs" by using vector indices, and it makes it fairly trivial to implement.
struct Node<'input> {
raw: &'input [u8]
children: Vec<NodeHandle>
}
struct NodeHandle {
idx: usize,
}
struct Parser<'input> {
nodes: Vec<Node<'input>>
}
struct VDom<'input> {
parser: Parser<'input>
}
I agree that this makes chaining rather awkward and possibly unidiomatic, but I'm not sure there is a nice and easy solution.
I think a good long term solution would be to make the parser generic over a node sink (see #21), with some default sink implementations (e.g. one that remembers id and class attributes, and stores them in a map for get_element_by_id
to run at constant time) and sink combinators. This would make it possible for users of the library to implement their own VDom (kind of) or extend it in any way you want (and also effectively creating their own Node
s). Users could make changes to the sink struct as the parser streams nodes and tell the parser to break at any point, which could be much more efficient in certain cases where you only need to extract one specific part of the HTML document and don't need anything else.
I think it would be cool if you could make the module tl::queryselector public
yes, that module will be public in the next release
Thanks for explaining your thoughts, I see now how that approach creates some problems. I really like the idea of a sink though
0.6.0 is on crates.io with some of your requested features:
HTMLTag::query_selector
Given your example HTML;
let dom = tl::parse(r#"
<div class="x">
<div class="y">
<div class="z"></div>
<div class="z"></div>
<div class="z"></div>
</div>
</div>
"#, Default::default()).unwrap();
let x_element = dom .get_elements_by_class_name("x") .next() .unwrap() .get(dom.parser()) .unwrap() .as_tag() .unwrap();
let zs = x_element.query_selector(".z").unwrap();
assert_eq!(zs.count(), 3);
- [`queryselector`](https://docs.rs/tl/0.6.0/tl/queryselector/index.html) module is fully public now
----
I have also refactored the HTMLTag children API a bit. Instead of having `HTMLTag::children()` return an iterator over the *direct* subnodes (not nested ones, which made it a little less useful than it could be), it now returns a struct ([`Children`](https://docs.rs/tl/0.6.0/tl/struct.Children.html)) containing a few helper methods:
- `Children::all`: returns a slice containing *all* subnodes (including nested tags). Calling `::all()` on `.x` in the HTML above would return `[div.y, div.z, div.z, div.z]` (ignoring the whitespaces)
- `Children::top`: returns a slice containing only the "topmost", direct subnodes (which is what old `HTMLTag::children` did, before). Calling `::top()` on `.x` in the HTML above would return `[div.y]` (ignoring the whitespaces), because that's the only immediate subnode
- `Children::{start, end, boundaries}`: lower-level methods for getting the starting/ending boundary (indices) of the children. This is used by `Children::all` to build up the slice
wow good stuff! I think those are great improvements! I'll try out the new features, cheers!
Assume we have the following structure
Let's say I want to get all nested divs of class
z
that are inx
, but not the outside ones. Right now that's quite cumbersome, becausequery_selector()
only works onVDom
. Then I have to implement a DFS, usingNode::children
continuously.Suggestion:
NodeHandle
should also have aquery_selector
method, so that we can more easily traverse nested divs. I think it makes sense from an API perspective, because a node essentially spawns a subtree, so it should have the same functions that a VDom has.Alternative: An alternative solution would be to extend the syntax of the
query_selector
method to search for nested divs. Something likequery_selector("div.x > div.y > div.z")
.