webis-de / scriptor

Plug-and-play reproducible web analysis.
MIT License
5 stars 2 forks source link

snapshot: currently visible elements #20

Open johanneskiesel opened 2 years ago

johanneskiesel commented 2 years ago

Probably as configuration option to the node snapshot. Possibility to turn everything off except the xpath.

johanneskiesel commented 2 years ago

Probably it makes sense to add a numeric value of how much of the element is visible on screen.

But I'm not sure anymore whether such an option makes sense. Straightforward things I come up with can actually be done afterwards based on the information already provided in the snapshot, so that would be rather a downstream feature.

There seems to be a way to get "all text currently visible" using Ranges... but that would likely require a rather sophisticated algorithm that works line-based (or we assume that there is no horizontal scrolling... even then this is not straightforward).

johanneskiesel commented 2 years ago

Ok, probably getting currently visible text works by using ranges and considering all text nodes. Things will get a bit simpler once caretRangeFromPoint or caretPositionFromPoint are no longer experimental. Before that, one can still try and test by creating ranges (and using setStart / setEnd) and checking with getBoundingClientRect (also experimental, but it seems widely supported)... maybe also getClientRects, but that seems to work on Elements, not nodes?!

johanneskiesel commented 2 years ago

Of course, one could just make a range for each character of a page and store that (character, position, font, font size, color, node XPath, node offset, ...)