Disable Robots not working correclty?

GoogleCodeExporter commented 9 years ago

(This question has been asked in StackOverflow 
http://stackoverflow.com/questions/25306704/disable-robotserver-in-crawler4j)

I have disabled the robots config as follows:

RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, 
pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, 
robotstxtServer);
...

I have expected that now, ignore tags 'noindex' and 'nofollow', but it does not 
and still this pages are skipped. I am not sure if it is a real issue or maybe 
a problem of configuration, but reading the documentation seems that 
'robotstxtConfig.setEnabled(false);' is enough to disable it. 

Tested with version 3.5 and 3.6-SNAPSHOT

Original issue reported on code.google.com by jorgehor...@gmail.com on 19 Aug 2014 at 7:54

GoogleCodeExporter commented 9 years ago

Thank you George.

Can you give me a specific URL you were trying to crawl and although you 
expected links to be crawled they were ignored?

Original comment by avrah...@gmail.com on 19 Aug 2014 at 7:57

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Yes, of course. For example:
https://formulier.denhaag.nl/Tripleforms/formulierenoverzicht/DefaultEnvironment
.aspx

Must return ~336 pages crawled. But now are ignored. 

Note: it is a https page, I have used the fix showed in 
https://code.google.com/p/crawler4j/issues/detail?id=174

Original comment by jorgehor...@gmail.com on 19 Aug 2014 at 8:05

GoogleCodeExporter commented 9 years ago

I will check it, but as I have not yet approved and / or inserted the code in 
issue 174, we might have different results.

Anyway, this is a good use case and I will check it out.

This issue might take longer as I have several other issues on my plate which 
are planned to be fixed before this issue.

Original comment by avrah...@gmail.com on 19 Aug 2014 at 8:07

GoogleCodeExporter commented 9 years ago

Do not worry, for now as workaround I can store these pages myself.

Original comment by jorgehor...@gmail.com on 19 Aug 2014 at 8:13

GoogleCodeExporter commented 9 years ago

After some digging I have concluded the following:
1. This site needs a fix to issue 174
2. After implementing the fix suggested at issue 174, still only a few pages 
get crawled, as the crawler thinks he already crawled those pages!

This is a weird bug.
It seems that the pageFetcher's httpClient fetches the wrong statusCode (302 
instead of 200).
HttpResponse response = httpClient.execute(get);

All of that code needs serious upgrading, as many methods there are deprecated.
I have opened a new issue for that: Issue 302

I hope solving that issue will help solving this bug

Original comment by avrah...@gmail.com on 2 Sep 2014 at 12:38

GoogleCodeExporter commented 9 years ago

Thanks for the feedback. I will be waiting for this fix.

Original comment by jorgehor...@gmail.com on 2 Sep 2014 at 1:25

GoogleCodeExporter commented 9 years ago

I have fixed issue 174 using a different approach.

But still this problem remains.

I have asked help for issue 302, I hope someone will help us there and maybe it 
will fix this issue.

Original comment by avrah...@gmail.com on 15 Sep 2014 at 2:35

GoogleCodeExporter commented 9 years ago

Other issues have been fixed and this one still remains.

A careful debug is in place.

Original comment by avrah...@gmail.com on 17 Sep 2014 at 2:15

GoogleCodeExporter commented 9 years ago

I see this issue is still present in crawler4j-4.1. An example where I get 302 
instead of 200 is http://www.kmobgyn.com, but I've found a bunch more.

Original comment by and...@olariu.org on 30 Mar 2015 at 2:35

xrma / crawler4j

Disable Robots not working correclty? #286