Closed GoogleCodeExporter closed 9 years ago
for your information,
Heritrix has Robotstxt class.
heritrix-3.1.0-src\heritrix-3.1.0\modules\src\main\java\org\archive\modules\net\
Robotstxt.java
in example, this supports "Crawl-delay" directive but crawler4j doesn't support
it.
this is in apache license 2.0 so you can bring it to crawler4j maybe.
(Please make sure by yourself if you do)
Original comment by pikote...@gmail.com
on 25 Feb 2012 at 9:32
it was my mistake, I'm sorry.
mime type of my robots.txt was "text/html" but not "text/plain".
robots.txt should be as text/plain.
but it might be good to think crawler4j supports other mime types.
especially "text/*".
http://www.nextthing.org/archives/2007/03/12/robotstxt-adventure
Original comment by pikote...@gmail.com
on 26 Feb 2012 at 3:32
As you mentioned, it's by design.
-Yasser
Original comment by ganjisaffar@gmail.com
on 28 Feb 2012 at 5:47
Original issue reported on code.google.com by
pikote...@gmail.com
on 25 Feb 2012 at 8:08