Open benoit74 opened 5 days ago
Despite its name, robots.txt
's usage is to prevent (well just give directions actually) indexation robots from exploring resources. browsertrix-crawler is a technical bot, but it acts as a user and certainly not an indexing bot.
I don't see value in such a feature but I can imagine there are scenarios where it can be useful. @benoit74 do you have one to share?
Without further information, I'd advise on not having this (non existent yet) feature on by default as it changes the browser's behavior while I think this project uses explicit flags for this.
First use case is https://forums.gentoo.org/robots.txt where the robots.txt
content indicate with a certain fidelity what we should exclude from a crawl of https://forums.gentoo.org/ website.
Disallow: /cgi-bin/
Disallow: /search.php
Disallow: /admin/
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /statistics.php
Disallow: /profile.php
Disallow: /privmsg.php
Disallow: /login.php
Disallow: /posting.php
The idea behind automatically using robots.txt
is helping lazy / not knowledgeable users have a first version of a WARC/ZIM which is lickely to contain only useful content rather than wasting time and resources (ours and upstream server) building a WARC/ZIM with too many unneeded pages.
Currently in self-service mode, users tend to simply input the URL https://forums.gentoo.org/
and say "Zimit!". And this is true for "young" Kiwix editors as well.
After that initial run, it might prove interesting in this case to still include /profile.php
(user profiles) in the crawl. At least such a choice probably needs to be discussed by humans. But this kind of refinement can be done in a second step once we realize we miss this.
If we do not automate something on this, it means the self-service approach is mostly doomed to produce only bad archives, which is a bit sad.
This confirms that it can be useful in zimit, via an option (that you'd turn on)
It would be nice if the crawler could automatically fetch rules from
robots.txt
and addexclusion
rules for every rule present in therobots.txt
file.I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.
At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a
robots.txt
is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^