Errors during crawling (maybe regarding robots.txt)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Crawl the site www.imdb.com with the example from the site
2.
3.

What is the expected output? What do you see instead?
I should not see any errors. Instead, I see that:
java.lang.NullPointerException
        at java.lang.String.<init>(String.java:556)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives(RobotstxtServer.java:98)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows(RobotstxtServer.java:73)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:341)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:220)
        at java.lang.Thread.run(Thread.java:724)

What version of the product are you using?
3.5

Please provide any additional information below.
Thanks :-)

Original issue reported on code.google.com by av...@shevo.co.il on 22 Dec 2013 at 9:35

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:47

Changed state: Accepted
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I have just crawled imdb, and it gets crawled.

I don't get to any NullPointerException.

Please try again and report so we can work on this problem together

Original comment by avrah...@gmail.com on 20 Aug 2014 at 12:45

GoogleCodeExporter commented 9 years ago

Closed due to inactivity and no good scenario

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:11

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

hi 
I'm also getting only  this same error while crawling a website.
java.lang.NullPointerException
        at java.lang.String.<init>(String.java:556)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives(RobotstxtServer.java:98)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows(RobotstxtServer.java:73)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:341)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:220)
        at java.lang.Thread.run(Thread.java:745)

Original comment by dkkashya...@gmail.com on 29 Sep 2014 at 12:53

GoogleCodeExporter commented 9 years ago

This is interesting as this is the exact stacktrace.

Which version of the crawler are you using ? (v3.5 ?  Latest from trunk ?)

Which site are you trying to crawl ?

Original comment by avrah...@gmail.com on 29 Sep 2014 at 1:23

Changed state: Started

GoogleCodeExporter commented 9 years ago

I'm using 3.5 in maven project and I'm trying to  crawl songspk.name

Original comment by dkkashya...@gmail.com on 29 Sep 2014 at 5:36

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

hi
Atlast my crawler stops crawling the links and I see only this
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException
java.lang.NullPointerException

nothing else.

Original comment by dkkashya...@gmail.com on 30 Sep 2014 at 6:42

GoogleCodeExporter commented 9 years ago

I have checked it.

It works for me.

I have changed the code there in the last months so I probably fixed that bug.

You will need the latest code though, so please use the latest from trunk 
instead of the Maven jar.

We will have a release in a month or two max I believe.

Avi.

Original comment by avrah...@gmail.com on 2 Oct 2014 at 12:41

GoogleCodeExporter commented 9 years ago

hi
I used version 3.5 from  trunk still I'm getting same output.
java.lang.NullPointerException
        at java.lang.String.<init>(String.java:556)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives(RobotstxtServer.java:98)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows(RobotstxtServer.java:73)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:341)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:220)
        at java.lang.Thread.run(Thread.java:745)
java.lang.NullPointerException
        at java.lang.String.<init>(String.java:481)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives(RobotstxtServer.java:100)
        at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows(RobotstxtServer.java:73)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:341)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:220)
        at java.lang.Thread.run(Thread.java:745)

Original comment by dkkashya...@gmail.com on 5 Oct 2014 at 8:21

GoogleCodeExporter commented 9 years ago

Which version exactly did you use ?

Did you take a fresh checkout from the repository (v3.6 SNAPSHOT) this week ?
Did you use something else ?

Original comment by avrah...@gmail.com on 5 Oct 2014 at 8:28

GoogleCodeExporter commented 9 years ago

i'm using 3.5 from here https://code.google.com/p/crawler4j/downloads/list
I didnot take any fresh checkout.Can you please give me link for that?
thanks

Original comment by dkkashya...@gmail.com on 5 Oct 2014 at 8:40

GoogleCodeExporter commented 9 years ago

Till we will have a new release, the way to take the latest is clone our trunk:
https://code.google.com/p/crawler4j/source/checkout

It is a bit more complicated, but I have implemented many many fixes so I think 
it is well worth it

Original comment by avrah...@gmail.com on 5 Oct 2014 at 8:55

GoogleCodeExporter commented 9 years ago

Till we will have a new release, the way to take the latest is clone our
trunk:
https://code.google.com/p/crawler4j/source/checkout

It is a bit more complicated, but I have implemented many many fixes so I
think it is well worth it

Original comment by avrah...@gmail.com on 5 Oct 2014 at 8:55

GoogleCodeExporter commented 9 years ago

Hi,
I used this code from trunk but now crawler is really slow.

Original comment by dkkashya...@gmail.com on 6 Oct 2014 at 2:43

GoogleCodeExporter commented 9 years ago

Try commenting out the following lines from parser/Parser.java:

LanguageIdentifier languageIdentifier = new 
LanguageIdentifier(parseData.getText());
page.setLanguage(languageIdentifier.getLanguage());

Original comment by avrah...@gmail.com on 6 Oct 2014 at 2:57

GoogleCodeExporter commented 9 years ago

I did as you told but still it is slow  and i see only this:
Oct 07, 2014 9:20:45 AM org.apache.http.client.protocol.ResponseProcessCookies 
processCookies
WARNING: Cookie rejected [SETTINGS.LOCALE="en%5Fus", version:0, 
domain:.adobe.com, path:/cfusion/, expiry:Thu Sep 29 09:20:45 CEST 2044] 
Illegal  path attribute "/cfusion/". Path of origin: "/robots.txt"

Oct 07, 2014 9:22:59 AM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (org.apache.http.NoHttpResponseException) caught when 
processing request to {}->http://247wallst.com:80: The target server failed to 
respond

Oct 07, 2014 9:22:59 AM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {}->http://247wallst.com:80

Original comment by dkkashya...@gmail.com on 7 Oct 2014 at 11:33

GoogleCodeExporter commented 9 years ago

i resolved that but crawler is still slow.what can i do now??

Original comment by dkkashya...@gmail.com on 8 Oct 2014 at 7:55

GoogleCodeExporter commented 9 years ago

hmmm, I need to profile the crawler to see what I changed which made it slower 
and fix it.

It will take a couple of days though...

Original comment by avrah...@gmail.com on 10 Oct 2014 at 10:21

GoogleCodeExporter commented 9 years ago

I have released v4.0

Profiled v3.5 vs v4.0 and v4.0 is faster!

Original comment by avrah...@gmail.com on 22 Jan 2015 at 11:45

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 22 Jan 2015 at 2:59

Changed state: Invalid

xrma / crawler4j

Errors during crawling (maybe regarding robots.txt) #247