xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

PageFetcher.discardContentIfNotConsumed throws a lot of errors every time. #132

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.  Just start a crawl for any site...
2.
3.

What is the expected output? What do you see instead?
process the page as per webcrawler process metho

What version of the product are you using?
3.0

Please provide any additional information below.
Whenever i start crawling any site, i see a lot of log4j errors being thorwn 
from PageFetcher -

java.lang.ArrayIndexOutOfBoundsException
        at java.net.SocketInputStream.read(Unknown Source)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:111)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:264)
        at org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(LoggingSessionInputBuffer.java:115)
        at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:246)
        at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:204)
        at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:132)
        at java.util.zip.CheckedInputStream.read(Unknown Source)
        at java.util.zip.GZIPInputStream.readUByte(Unknown Source)
        at java.util.zip.GZIPInputStream.readUShort(Unknown Source)
        at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at java.util.zip.GZIPInputStream.readUShort(Unknown Source)
        at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at edu.uci.ics.crawler4j.fetcher.PageFetcher$GzipDecompressingEntity.getContent(PageFetcher.java:278)
        at org.apache.http.conn.BasicManagedEntity.getContent(BasicManagedEntity.java:88)
        at org.apache.http.util.EntityUtils.consume(EntityUtils.java:65)
        at edu.uci.ics.crawler4j.fetcher.PageFetcher.discardContentIfNotConsumed(PageFetcher.java:243)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:295)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:178)
        at java.lang.Thread.run(Unknown Source)

Original issue reported on code.google.com by svu...@gmail.com on 4 Mar 2012 at 4:53

GoogleCodeExporter commented 9 years ago
As mentioned on the homepage, this version is deprecated because of this bug. 
Please upgrade to the latest version.

-Yasser

Original comment by ganjisaffar@gmail.com on 4 Mar 2012 at 5:04

GoogleCodeExporter commented 9 years ago
Thanks Yasser, great job on this product.  I have suggestion with regards
to few more exception handling points where in you are using
e.printStackTrace().  I was wondering if you could change them to
logger.error() messages.  I think I saw them in PageFetcher and WebCrawler.

Original comment by svu...@gmail.com on 4 Mar 2012 at 4:39