Open GoogleCodeExporter opened 9 years ago
I made a patch for this issue.
It parses the delay in the same manner other settings are parsed from
robots.txt. HostDirectives stores the time in milliseconds. HostDirectives also
now doesn't store the previously accessed time but the previously given time
for an access.
It calculates if the current time is higher than the previously given access
time added with the delay. If current time is higher, it returns that and sets
it to the previously given access time. If not, it adds the delay to the access
time, stores the new one and returns it.
After this RobotstxtServer calculates how long WebCrawler should sleep and
returns the value in milliseconds to it. Which, if the value is higher than 0,
then sleeps the amount before fetching the new Page.
This isn't the most elegant solution as I'm not exactly sure where you wanted
the call to be made from. But it works great.
The major issue with this solution still is optimization. If you have multiple
threads and they all try to access a host that has set the crawl delay to over
a minute you will have them wait a long time instead of going to check urls of
other hosts.
One solution could be making a separate WorkQueues object for each host and
then cycle them with each request. Another could be having the crawler cycle
through its current list in hopes there is a link to a different host.
Original comment by janne.pa...@documill.com
on 20 Jul 2011 at 10:09
Attachments:
Ah, the previous version of HostDirectives.java I attached in the comment above
was missing a line. Here's a fixed, properly working copy.
Original comment by janne.pa...@documill.com
on 20 Jul 2011 at 12:23
Attachments:
Is there any chance to handle CRAWL-DELAY future in nearest future ?
Original comment by marcing...@gmail.com
on 14 Apr 2014 at 10:14
Original comment by avrah...@gmail.com
on 18 Aug 2014 at 3:07
Original comment by avrah...@gmail.com
on 18 Aug 2014 at 3:10
Original comment by avrah...@gmail.com
on 18 Aug 2014 at 3:11
https://code.google.com/r/marcingosk-crawler4j/source/list
Original comment by avrah...@gmail.com
on 23 Sep 2014 at 1:59
Original issue reported on code.google.com by
janne.pa...@documill.com
on 20 Jul 2011 at 7:10