webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
596 stars 79 forks source link

Specifying selectors for extracting links. #217

Open ttaomae opened 1 year ago

ttaomae commented 1 year ago

I came across a site which uses an <area> tag with an href attribute to create links with a non-standard shape. I don't know if this is the correct way to approach this, but I was able to capture these links by implementing the following custom driver.

module.exports = async ({data, page, crawler}) => {
  await crawler.loadPage(page, data, [
      {selector:"a[href]", extract:"href", isAttribute:false},
      {selector:"area[href]", extract:"href", isAttribute:false}
  ]);
};

However, I did not see anything in the documentation hinting at this and it required reading through the source code to even determine that the driver is what I should be looking into.

Furthermore, I've noticed that defaultDriver.js has changed significantly over time, so it is not clear to me whether this approach will remain valid in the long run. And to emphasize that point, it is worth mentioning that this driver works in 0.7.1 but breaks in 0.8.0-beta.1 (though I realize that fixing it just requires changing module.exports = to export default).

Would you consider implementing an easier way to configure the link extraction selectors? Or, if a custom driver is the recommended approach, is this documented somewhere?

ikreymer commented 1 year ago

I just recommended using a custom driver in the other issue! Yes, these are all good points! You're right, there's not an example of driver usage in the current readme, which is a bit of an oversight.

The example you have is the current best option, however, it would be fairly easy to add a custom selector via cmdline, perhaps --selector a[href]:href --selector area[href]:href that is then passed to the driver in the same way as you have there. I think that'd be a pretty simple thing to add (just want to be careful with the syntax).

These are all good suggestions - for now you can use the driver script you have, we'll add to this ticket once we have a chance to add this!

The tool is still pre 1.0.0 release, so a few things are changing, like the switch to ESM modules, but we hope to have a stable driver format in place soon!

benoit74 commented 2 months ago

We are impacted by this issue as well at Kiwix, we have a website to ZIM relying on <area> as well.

Should we also develop a custom driver or would you recommend that we make a PR to add selectors via cmdline as suggested?

tw4l commented 2 months ago

We are impacted by this issue as well at Kiwix, we have a website to ZIM relying on <area> as well.

Should we also develop a custom driver or would you recommend that we make a PR to add selectors via cmdline as suggested?

Hi @benoit74, I'd suggest that perhaps a PR to add selectors via a cmdline argument would be the better/more flexible approach here. It shouldn't be too difficult, it would just be a matter of checking if the argument was provided (perhaps as a json string) and if so, applying the settings by overwriting the selectors default argument to extractLinks. Might want to add some validation as well to ensure that the string being passed in is valid.

benoit74 commented 1 month ago

Thank you @tw4l for the detailed suggestions.

Just for the record, the work on this from Kiwix has been postponed to "later", and since it might mean "months", should someone want to contribute to this issue, feel free, we will not collide on this. Should we start to work on this I will notify here first.