Open loxal opened 6 years ago
The internal robots.txt parser is far from being RFC complaint. There are several issue around (#256, #107) and probably more. I asked to another project (https://github.com/crawler-commons/crawler-commons/issues/197) with a better robots.txt parser to split in modules to allow to reuse it in crawler4j.
If you have patch they are welcome in the meanwhile.
...
Disallow: /*?config
. Hencehttps://www.welt.de/test?config
should be allowed but it is not.
According to Google's spec this URL is not allowed (*
is a glob character, while ?
is not). The RFC draft does not support any glob characters in allow/disallow statements (see robotstxt.org):
Note also that globbing and regular expression are not supported ... Specifically, you cannot have lines like ... "Disallow: *.gif".
crawler-commons follows here the Google spec which is widely used - I suppose that even the webmaster of welt.de wants to disallow /test?config
.
According to https://en.wikipedia.org/wiki/Glob_(programming) both *
and ?
are glob characters. RFC drafts are nice but ultimately Google sets the standards (even if they are not RFC-compliant). And according to Google *
is supported as stated in Google's spec.
Agreed, Google sets the standard and this means ?
has no glob meaning in the Disallow:
statement which " support a limited form of "wildcards" for path values".
@loxal, or did I misunderstand you: Shall /*?config
match /test?config
or not? If yes, it's disallowed, right?
@sebastian-nagel whatever Google considers correct. But ?
does not need any glob meaning to support
Disallow: /*?config
and disallow /test?config
? In this case *
is a glob character and is supported, whereas ?
is just a regular character.
I think the current implementation considers also ?
as a glob character and therefore has problems to evaluate *?
.
Ok, we agree. I haven't tested crawler4j but just wanted to make sure that crawler-commons behaves as expected: /test?config
is forbidden but eg. /test/config
is allowed. Thanks!
I just checked with https://www.google.com/webmasters/tools/robots-testing-tool
No matter what Google's documentation says, the actual implementation speaks another language:
Entries like Disallow: /*?config
result in /test?config
being ignored. One of those - not too rare - case where documentation is misleading.
Entries like Disallow: /*?config result in /test?config being ignored.
What does "being ignored" mean? Is it crawled or not?
If it means that Googlebot does not visit /test?config
this is in accordance with the documentation.
The crawler ignores it. /test?config
is not crawled. This is what Google's robots Tool says.
Yes, but then the behavior and documentation agree: the wildcard *
matches test
and all remaining characters (including ?
) are matched literally. It's a disallow rule, so bots shall not crawl everything matching /*?config
, right?
Yes.
I'll keep this on the radar, and will add a unit test to crawler-commons' robots.txt parser, just to make sure that it continues to work. Thanks!
Thank you.
On Thu, Mar 22, 2018 at 11:27 AM, Sebastian Nagel notifications@github.com wrote:
I'll keep this on the radar, and will add a unit test to crawler-commons' robots.txt parser, just to make sure that it continues to work. Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/304#issuecomment-375231908, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW1ALA3rpmrjB8xS0nJo6K1-FNmSFks5tg26egaJpZM4StjHr .
In
https://www.welt.de/robots.txt
there are?
containing entries likeDisallow: /*?config
. Hencehttps://www.welt.de/test?config
should be allowed but it is not. Whereas entries likeDisallow: /*.xmli
work properly and disallowhttps://www.welt.de/test.xmli
. After my investigation I figured out that?
is the problematic character.I use
RobotstxtServer#allow("https://www.welt.de/test?config")
for testing.