Support matching against query parameters in robots.txt rules

GoogleCodeExporter commented 9 years ago

Apparently Googlebot will use the query portion of a URL when matching against 
rules.

See http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

It's unclear to me whether Googlebot will attempt a match without the query 
parameter portion of the URL, if no more specific match is found. E.g. will 
Disallow: /*.htm$ match against http://domain.com/page.htm?session=23

I would guess yes, but it would be best to validate using the "Fetch as Google" 
tool - see 
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=158587

Original issue reported on code.google.com by kkrugler...@transpac.com on 17 Mar 2013 at 6:24

GoogleCodeExporter commented 9 years ago

See also 
http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion
-protocol.html

Original comment by kkrugler...@transpac.com on 17 Mar 2013 at 6:26

GoogleCodeExporter commented 9 years ago

Fuad also reported finding robots.txt files that contain just a query portion:

Disallow: /?field=

Unclear whether that would disallow ANY url that has a query parameter of field 
at any position. Something to try with Google's tool?

Original comment by kkrugler...@transpac.com on 17 Mar 2013 at 6:32

GoogleCodeExporter commented 9 years ago

See also 
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Original comment by kkrugler...@transpac.com on 17 Mar 2013 at 6:42

GoogleCodeExporter commented 9 years ago

Hi kkrugler_lists,

I have tested the situations you mentioned on Google Webmasters Tool, and 
sending the results in the attachment as a screenshot.

As you can see, "Disallow: /*.htm$" does NOT match against 
"http://domain.com/page.htm?session=23" but matches with 
"http://domain.com/page.htm".

And "Disallow: /?field=" or "Disallow: /*?field=" does NOT disallow ANY url 
that has a query parameter of "field" at any position. "Disallow: /*?field=" 
only matches if the "field" parameter is at the beginning of all parameters. 
(Ex:"http://crawltest.blogspot.com.tr/page.htm?field=23")

Original comment by alparsla...@gmail.com on 6 Mar 2014 at 12:35

Attachments:

google_robotstxt_test.png

GoogleCodeExporter commented 9 years ago

I want to contribute to the project by implementing this feature. I will make 
changes in SimpleRobotRules class and upload a patch as soon as possible.

Do you have any comments to mention?

Original comment by alparsla...@gmail.com on 6 Mar 2014 at 12:39

GoogleCodeExporter commented 9 years ago

Hi alparslanavci,

Thanks for working on this issue - it's important, and something I've put off 
for way too long :)

As far as comments go, just put in lots of test cases...the main challenge with 
this type of change is avoiding introducing some new problem in the URL 
matching logic. Webmasters get _very_ upset if a crawler doesn't obey one of 
their rules, even if their robots.txt isn't completely valid.

Original comment by kkrugler...@transpac.com on 8 Mar 2014 at 1:13

GoogleCodeExporter commented 9 years ago

One other comment - it would be good to include issue #17 in the scope of this 
change, since that also covers regular expressions in URL patterns.

Original comment by kkrugler...@transpac.com on 8 Mar 2014 at 1:14

GoogleCodeExporter commented 9 years ago

Hi Ken,

I am uploading a patch for both issues, #17 and #18. However, as you can see in 
"Robots.txt Directives" in 
http://googlewebmastercentral.blogspot.com.tr/2008/06/improving-on-robots-exclus
ion-protocol.html, there is only ($) and (*) wildcard support. I have prepared 
the patch according to this and added some tests using "Example path matches" 
in https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt.

Would you please check the patch and comment?

Original comment by alparsla...@gmail.com on 10 Mar 2014 at 12:18

Attachments:

issue17_18.patch

GoogleCodeExporter commented 9 years ago

Moreover, I have also made some performance tests and seen that the way I used 
in the patch is about three times faster than the regular expression matching.

Original comment by alparsla...@gmail.com on 10 Mar 2014 at 12:21

GoogleCodeExporter commented 9 years ago

I should be able to check this out later today, or tomorrow by the latest.

Original comment by kkrugler...@transpac.com on 10 Mar 2014 at 2:27

GoogleCodeExporter commented 9 years ago

I took a look at the patch.

The hand-rolled 'regex-ish' functionality scares me a bit :) 

What was the performance issue with using a regular regex? I note your comment 
above, where you say that your approach is 3x faster, but it also seems like 
your approach would generate a lot more object allocations (splitting & 
substringing lines of text).

Original comment by krugl...@gmail.com on 11 Mar 2014 at 10:36

GoogleCodeExporter commented 9 years ago

Actually, the pattern matching for robots rules contains only ($) and (*) 
wildcard support, not all the regex rules. So, we do not need to use regular 
regex ops. Moreover; the rule syntax is not the same as regex syntax, so a 
parsing (which requires some object allocations) has to be done firstly in 
order to use regular regex ops. And lastly; in Java7, String.split() method has 
also developed into a non-regular  expression way in one-char splitting in 
order to make it light-weight. You can see the bug report here: 
http://bugs.java.com/view_bug.do?bug_id=6840246. I have also used new split 
implementation in the patch.

IMHO, this way is faster and more stable.

Original comment by alparsla...@gmail.com on 12 Mar 2014 at 8:35

GoogleCodeExporter commented 9 years ago

OK, took another look, and I agree with what you say above - it's not a regular 
regex, so the hand-rolled version is a good idea.

The one change I'd like to see is avoiding the text.substring() call (and thus 
allocations) by using text.indexOf(card, curOffset) and walking the offset 
forward...this is in the ruleMatches() method.

I'm happy to make that change, just out of time for that currently.

Original comment by kkrugler...@transpac.com on 13 Mar 2014 at 3:06

GoogleCodeExporter commented 9 years ago

I've gone ahead and made the change I suggested above, and also added sorting 
to the rules (this then gets the expected result when checking URLs against the 
Amazon.com robots.txt file). Changes made in r113 and r114.

Original comment by kkrugler...@transpac.com on 13 Mar 2014 at 11:51

GoogleCodeExporter commented 9 years ago

Original comment by kkrugler...@transpac.com on 13 Mar 2014 at 11:52

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks for the comment and changes Ken!

Original comment by alparsla...@gmail.com on 14 Mar 2014 at 7:24

ztx1491 / crawler-commons

Support matching against query parameters in robots.txt rules #18