webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
579 stars 72 forks source link

Automatically add exclusion rules based on `robots.txt` #631

Open benoit74 opened 5 days ago

benoit74 commented 5 days ago

It would be nice if the crawler could automatically fetch rules from robots.txt and add exclusion rules for every rule present in the robots.txt file.

I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.

At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a robots.txt is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^

rgaudin commented 5 days ago

Despite its name, robots.txt's usage is to prevent (well just give directions actually) indexation robots from exploring resources. browsertrix-crawler is a technical bot, but it acts as a user and certainly not an indexing bot.

I don't see value in such a feature but I can imagine there are scenarios where it can be useful. @benoit74 do you have one to share?

Without further information, I'd advise on not having this (non existent yet) feature on by default as it changes the browser's behavior while I think this project uses explicit flags for this.

benoit74 commented 5 days ago

First use case is https://forums.gentoo.org/robots.txt where the robots.txt content indicate with a certain fidelity what we should exclude from a crawl of https://forums.gentoo.org/ website.

Disallow: /cgi-bin/
Disallow: /search.php
Disallow: /admin/
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /statistics.php
Disallow: /profile.php
Disallow: /privmsg.php
Disallow: /login.php
Disallow: /posting.php

The idea behind automatically using robots.txt is helping lazy / not knowledgeable users have a first version of a WARC/ZIM which is lickely to contain only useful content rather than wasting time and resources (ours and upstream server) building a WARC/ZIM with too many unneeded pages.

Currently in self-service mode, users tend to simply input the URL https://forums.gentoo.org/ and say "Zimit!". And this is true for "young" Kiwix editors as well.

After that initial run, it might prove interesting in this case to still include /profile.php (user profiles) in the crawl. At least such a choice probably needs to be discussed by humans. But this kind of refinement can be done in a second step once we realize we miss this.

If we do not automate something on this, it means the self-service approach is mostly doomed to produce only bad archives, which is a bit sad.

rgaudin commented 5 days ago

This confirms that it can be useful in zimit, via an option (that you'd turn on)