xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Fatal Transport Error in New Version #108

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi Yasser,

I found some problems in version 3.0.

What steps will reproduce the problem?
1. I upgraded my crawler version from 2.6 to 3.0.

What is the expected output? What do you see instead?
At the beginning it fetches some links and then starts giving error message
"ERROR [Crawler 6] Fatal transport error: Read timed out while fetching 
http://..." for almost all links. If a new PageFetcher instance is created, 
links are fetched normally. There is something wrong in version 3.0.

What version of the product are you using?
I am using 3.0. Now I am fetching successfully with PageFetcher of version 2.6.

Thanks in advance.

Original issue reported on code.google.com by mansur.u...@gmail.com on 18 Jan 2012 at 2:24

GoogleCodeExporter commented 9 years ago
As I had mentioned in http://code.google.com/p/crawler4j/ there has been a bug 
in 3.0 and this version should not be used. I have provided the fix in 3.1

Thanks,
Yasser

Original comment by ganjisaffar@gmail.com on 18 Jan 2012 at 6:26

GoogleCodeExporter commented 9 years ago
Thanks a lot for information!

Original comment by mansur.u...@gmail.com on 18 Jan 2012 at 7:36

GoogleCodeExporter commented 9 years ago
Hi Yasser,

As I understood, you fixed concurrency issues (for multiple crawlers) in 
PageFetcher.
However, my problem was different and I was using only one crawler. As I 
mentioned above, at the beginning it fetches some links and then starts giving 
error message
"ERROR [Crawler 6] Fatal transport error: Read timed out while fetching 
http://..." for almost all links.

If you test the "http://www.ics.uci.edu/" with 2.6 and 3.1 version, you will 
see the significant difference.
That means the issue still exists in 3.1 too.

Thanks & Regards,
Mansur

Original comment by mansur.u...@gmail.com on 18 Jan 2012 at 9:11

GoogleCodeExporter commented 9 years ago
This doesn't repro for me. Can you share you controller code. Notice that when 
you get time outs after crawling for a while it might be because the target 
server is blocking your ip because of flooding requests.

-Yasser

Original comment by ganjisaffar@gmail.com on 18 Jan 2012 at 10:53

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi Yasser,

Here are my controller and crawler classes: 

public class Controller {         

public static void main(String[] args) throws Exception {

        String crawlStorageFolder = "/data/crawl/root";
        int numberOfCrawlers = 7;
        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(crawlStorageFolder);

        config.setProxyHost("proxy");
        config.setProxyPort(8080);

        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller;
        try {
            controller = new CrawlController(config, pageFetcher, robotstxtServer);

            controller.addSeed("http://www.ics.uci.edu/");                  

            controller.start(MyCrawler.class, numberOfCrawlers);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } 

}

and my crawler class:

    public class MyCrawler extends WebCrawler {

    private static final Logger logger = Logger.getLogger(MyCrawler.class.getName());

    private final static Pattern FILTERS = Pattern
            .compile(".*(\\.(css|js|bmp|gif|jpe?g"
                    + "|png|tiff?|mid|mp2|mp3|mp4"
                    + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                    + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    /**
     * * You should implement this function to specify whether * the given url
     * should be crawled or not (based on your * crawling logic).
     */
    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        /*return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu/");*/

        //If the following 2 lines are deleted, the issue does not appear
        PageFetchResult fetchResult = getMyController().getPageFetcher().fetchHeader(url);
        int statusCode = fetchResult.getStatusCode();

        if(statusCode != PageFetchStatus.OK && statusCode != PageFetchStatus.Moved 
                && statusCode != PageFetchStatus.PageTooBig) {//&& statusCode != PageFetchStatus.PageIsBinary) { //if not fetched nor moved
            logger.info("Broken URL: " + url.getURL());
            return false;
        }

        //logger.info("Fetched URL: " + url.getURL());

        return true;

    }

    /**
     * * This function is called when a page is fetched and ready * to be
     * processed by your program.
     */
    @Override
    public void visit(Page page) {

        logger.info("Fetched URL: " + page.getWebURL().getURL());

    }
}

If I don't call page fetcher in shouldVisit() everything works fine, otherwise 
the previous issue appears "fatal error ...".
My assumption is that there is something wrong with "mutex" of PageFetcher and 
the next request to the target server is carried out before timeout finishes. 
If you use old PageFetcher with static methods and fields, everything works 
fine.

Thanks & Regards,
Mansur

Original comment by mansur.u...@gmail.com on 19 Jan 2012 at 2:17

GoogleCodeExporter commented 9 years ago
Hi Mansur
Even i am looking to get the info about the 404s/301/302 errors.. But based on 
your code, I understood that you are re-fetching the headers again to check for 
broken links.. Which means 1 fetch by crawler4j and another by your code... I 
feel this is an expensive operation.

I have raised a suggestion to yasser, to somehow provide a method in WebURL to 
get the statuscode. That makes more sense.. What do you say?

Regs

Original comment by w3engine...@gmail.com on 19 Jan 2012 at 3:10

GoogleCodeExporter commented 9 years ago
Hi,

that would be really good idea. It will probably solve my problem.

Tkx & Regs

Original comment by mansur.u...@gmail.com on 20 Jan 2012 at 8:23

GoogleCodeExporter commented 9 years ago
The problem happens when you continuously call getPageFetcher() and do not do a 
subsequent fetchResult.discardContentIfNotConsumed() when finished with the 
result. The connections will keep on building until there are no more capacity 
for HTTP Client to create new connections. This causes the time out.

Original comment by jtgeo...@gmail.com on 16 Jan 2013 at 1:44

GoogleCodeExporter commented 9 years ago
Hi,
I am using crawler4j version 3.1 in example basic. I check link following code 

controller.addSeed("http://muaban.net/ho-chi-minh/raovat/43/ban-nha/index.html")
;

The result always :
INFO [Crawler 1] Exception while fetching content for: 
http://muaban.net/ho-chi-minh/raovat/43/ban-nha/index.html [Unexpected end of 
ZLIB input stream]

How can I fix this problem. Please help me in this situation

Original comment by lethanht...@gmail.com on 18 Feb 2013 at 4:29

GoogleCodeExporter commented 9 years ago
I have broblem the same you. Please help me! Thank a lot.

http://www.biframework.com.vn/page/products/id/9/phan-mem-quan-ly-doanh-nghiep.h
tml

Original comment by phamda...@gmail.com on 13 Sep 2013 at 11:25

GoogleCodeExporter commented 9 years ago
I got it, thanks for your help.

Br,

Original comment by tt44...@gmail.com on 13 Sep 2013 at 2:58

GoogleCodeExporter commented 9 years ago
Help me !!!
Dear brother / sister: do you have the Import file list on the software vendor 
Biframework but when I import into the tax code column is blank Excel file 
outside though they have full filled MST. He / She helped me with.

http://biframeworks.com/phan-mem/phan-mem-quan-ly-doanh-nghiep-erp/4

Original comment by nhanhpt...@gmail.com on 4 Mar 2014 at 6:25