yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.53k stars 1.93k forks source link

Error occurred while fetching (robots) url #119

Open IVANOPT opened 8 years ago

IVANOPT commented 8 years ago

We are using crawler4j to grab some informs from web pages, according to the official documents, I accomplished the following example, :

ArticleCrawler.java

public class ArticleCrawler extends WebCrawler
{
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page in
     * which we have discovered this new url and the second parameter is the new
     * url. You should implement this function to specify whether the given url
     * should be crawled or not (based on your crawling logic). In this example,
     * we are instructing the crawler to ignore urls that have css, js, git, ...
     * extensions and to only accept urls that start with
     * "http://www.ics.uci.edu/". In this case, we didn't need the referringPage
     * parameter to make the decision.
     */
    @Override
    public boolean shouldVisit(Page referringPage, WebURL url)
    {
        String href = url.getURL().toLowerCase();
        return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
    }

    /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    public void visit(Page page)
    {
        String url = page.getWebURL().getURL();
        log.info("ArticleCrawler: crawlers cover url {}", url);
    }
}

Controller.java

public class Controller
{
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "/";
        int numberOfCrawlers = 7;

        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("http://www.ics.uci.edu/~welling/");
        controller.addSeed("http://www.ics.uci.edu/~lopes/");
        controller.addSeed("http://www.ics.uci.edu/");

        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(ArticleCrawler.class, numberOfCrawlers);
    }
}

And got the errors:

ERROR [RobotstxtServer:128] 2016-04-12 17:38:59,672 - Error occurred while fetching (robots) url: http://www.ics.uci.edu/robots.txt org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237) at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives(RobotstxtServer.java:100) at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows(RobotstxtServer.java:80) at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(CrawlController.java:427) at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(CrawlController.java:381) at com.waijule.common.crawler.article.Controller.main(Controller.java:31) Caused by: org.apache.http.HttpException: Unsupported cookie policy: default at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:150) at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:132) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ... 8 more INFO [CrawlController:230] 2016-04-12 17:38:59,699 - Crawler 1 started INFO [CrawlController:230] 2016-04-12 17:38:59,700 - Crawler 2 started INFO [CrawlController:230] 2016-04-12 17:38:59,700 - Crawler 3 started INFO [CrawlController:230] 2016-04-12 17:38:59,701 - Crawler 4 started INFO [CrawlController:230] 2016-04-12 17:38:59,701 - Crawler 5 started INFO [CrawlController:230] 2016-04-12 17:38:59,701 - Crawler 6 started INFO [CrawlController:230] 2016-04-12 17:38:59,701 - Crawler 7 started WARN [WebCrawler:412] 2016-04-12 17:38:59,864 - Unhandled exception while fetching http://www.ics.uci.edu/~welling/: null INFO [WebCrawler:357] 2016-04-12 17:38:59,864 - Stacktrace: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:323) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:274) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.HttpException: Unsupported cookie policy: default at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:150) at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:132) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ... 6 more WARN [WebCrawler:412] 2016-04-12 17:39:00,071 - Unhandled exception while fetching http://www.ics.uci.edu/~lopes/: null INFO [WebCrawler:357] 2016-04-12 17:39:00,071 - Stacktrace: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:323) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:274) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.HttpException: Unsupported cookie policy: default at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:150) at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:132) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ... 6 more WARN [WebCrawler:412] 2016-04-12 17:39:00,273 - Unhandled exception while fetching http://www.ics.uci.edu/: null INFO [WebCrawler:357] 2016-04-12 17:39:00,274 - Stacktrace: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:323) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:274) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.HttpException: Unsupported cookie policy: default at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:150) at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:132) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)

Thanks.

IVANOPT commented 8 years ago

I've solved it, it is caused by 4.2 version itself unstable, check it to 4.0 or below.