yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.54k stars 1.93k forks source link

Problems with "?" in robots.txt #304

Open loxal opened 6 years ago

loxal commented 6 years ago

In https://www.welt.de/robots.txt there are ? containing entries like Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not. Whereas entries like Disallow: /*.xmli work properly and disallow https://www.welt.de/test.xmli. After my investigation I figured out that ? is the problematic character.

I use RobotstxtServer#allow("https://www.welt.de/test?config") for testing.

s17t commented 6 years ago

The internal robots.txt parser is far from being RFC complaint. There are several issue around (#256, #107) and probably more. I asked to another project (https://github.com/crawler-commons/crawler-commons/issues/197) with a better robots.txt parser to split in modules to allow to reuse it in crawler4j.

If you have patch they are welcome in the meanwhile.

sebastian-nagel commented 6 years ago

... Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not.

According to Google's spec this URL is not allowed (* is a glob character, while ? is not). The RFC draft does not support any glob characters in allow/disallow statements (see robotstxt.org):

Note also that globbing and regular expression are not supported ... Specifically, you cannot have lines like ... "Disallow: *.gif".

crawler-commons follows here the Google spec which is widely used - I suppose that even the webmaster of welt.de wants to disallow /test?config.

loxal commented 6 years ago

According to https://en.wikipedia.org/wiki/Glob_(programming) both * and ? are glob characters. RFC drafts are nice but ultimately Google sets the standards (even if they are not RFC-compliant). And according to Google * is supported as stated in Google's spec.

sebastian-nagel commented 6 years ago

Agreed, Google sets the standard and this means ? has no glob meaning in the Disallow: statement which " support a limited form of "wildcards" for path values".

sebastian-nagel commented 6 years ago

@loxal, or did I misunderstand you: Shall /*?config match /test?configor not? If yes, it's disallowed, right?

loxal commented 6 years ago

@sebastian-nagel whatever Google considers correct. But ? does not need any glob meaning to support Disallow: /*?config and disallow /test?config? In this case * is a glob character and is supported, whereas ? is just a regular character.

I think the current implementation considers also ? as a glob character and therefore has problems to evaluate *?.

sebastian-nagel commented 6 years ago

Ok, we agree. I haven't tested crawler4j but just wanted to make sure that crawler-commons behaves as expected: /test?config is forbidden but eg. /test/config is allowed. Thanks!

loxal commented 6 years ago

I just checked with https://www.google.com/webmasters/tools/robots-testing-tool No matter what Google's documentation says, the actual implementation speaks another language: Entries like Disallow: /*?config result in /test?config being ignored. One of those - not too rare - case where documentation is misleading.

sebastian-nagel commented 6 years ago

Entries like Disallow: /*?config result in /test?config being ignored.

What does "being ignored" mean? Is it crawled or not?

sebastian-nagel commented 6 years ago

If it means that Googlebot does not visit /test?config this is in accordance with the documentation.

loxal commented 6 years ago

The crawler ignores it. /test?config is not crawled. This is what Google's robots Tool says.

sebastian-nagel commented 6 years ago

Yes, but then the behavior and documentation agree: the wildcard * matches test and all remaining characters (including ?) are matched literally. It's a disallow rule, so bots shall not crawl everything matching /*?config, right?

loxal commented 6 years ago

Yes.

sebastian-nagel commented 6 years ago

I'll keep this on the radar, and will add a unit test to crawler-commons' robots.txt parser, just to make sure that it continues to work. Thanks!

Chaiavi commented 6 years ago

Thank you.

On Thu, Mar 22, 2018 at 11:27 AM, Sebastian Nagel notifications@github.com wrote:

I'll keep this on the radar, and will add a unit test to crawler-commons' robots.txt parser, just to make sure that it continues to work. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/304#issuecomment-375231908, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW1ALA3rpmrjB8xS0nJo6K1-FNmSFks5tg26egaJpZM4StjHr .