yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.55k stars 1.93k forks source link

Crawler thread seems not running #72

Open liudonghua123 opened 9 years ago

liudonghua123 commented 9 years ago

I use this excellent library to scrape some site, but sometimes it stopped unexpected without any exceptions or errors. my custom webcrawler is belows

public class YNUWebCrawler extends WebCrawler {

  private Log logger = LogFactory.getLog(YNUWebCrawler.class);

  @Autowired
  public static CrawlerRepository crawlerRepository;

  private final static Pattern FILTERS = Pattern.compile(".*(\\.("
      + "css|js|gif|jpg|jpeg|png|mp3|mp4|avi|flv|zip|gz|apk|ipa|exe|bin|doc|docx|xls|xlsx|ppt|pptx"
      + "))$");

  @Override
  public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    try {
      return !FILTERS.matcher(href).matches() && Utils.getDomainName(href).contains("ynu.edu.cn");
    } catch (URISyntaxException e) {
      e.printStackTrace();
    }
    return false;
  }

  @Override
  public void visit(Page page) {
    String url = page.getWebURL().getURL();

    if (page.getParseData() instanceof HtmlParseData) {
      HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
      String renderedHtml = htmlParseData.getHtml();

      logger.info("process crawler data with url " + url);
      Crawler crawlerPersisted = crawlerRepository.findByUrl(url);
      if (crawlerPersisted == null) {
        Crawler crawlerCreated = new Crawler(url, DigestUtils.md5Hex(renderedHtml),
            Crawler.STATUS_CREATE);
        crawlerCreated.setModifiedFlag(true);
        logger.info("add crawler data with url " + url);
        crawlerRepository.save(crawlerCreated);
      } else {
        String newContentMD5 = DigestUtils.md5Hex(renderedHtml);
        if (!newContentMD5.equals(crawlerPersisted.getMd5())) {
          logger.info("update crawler data with url " + url);
          crawlerPersisted.setMd5(newContentMD5);
          crawlerPersisted.setStatus(Crawler.STATUS_UPDATE);
          crawlerPersisted.setModifiedFlag(true);
          logger.info("update crawler data with url " + url);
          crawlerRepository.save(crawlerPersisted);
        }
      }
    }
  }
}
  private void doCrawlerProcessing() {
    new Thread(() -> {
      logger.trace("starting doCrawlerProcessing");
      isCrawlerProcessing = true;
      String crawlStorageFolder = environment.getProperty(PROPERTY_KEY_CRAW_STORAGE_FOLDER, "tmp");
      int numberOfCrawlers = Integer.parseInt(environment.getProperty(
          PROPERTY_KEY_NUMBER_OF_CARWLERS, "10"));

      CrawlConfig config = new CrawlConfig();
      config.setCrawlStorageFolder(crawlStorageFolder);
      config.setPolitenessDelay(Integer.parseInt(environment.getProperty(
          PROPERTY_KEY_POLITENESS_DELAY, "50")));

      PageFetcher pageFetcher = new PageFetcher(config);
      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller;
      try {
        controller = new CrawlController(config, pageFetcher, robotstxtServer);
        controller.addSeed(environment.getProperty(PROPERTY_KEY_ROOT_SEED_URL, "http://www.ynu.edu.cn/"));
        // 目前暂时无法在YNUWebCrawler中注入crawlerRepository,先临时用这种方式
        YNUWebCrawler.crawlerRepository = crawlerRepository;
        controller.start(YNUWebCrawler.class, numberOfCrawlers);
        logger.info("ended doCrawlerProcessing");

        // 查找所有modifiedFlag标记过的,记录其中的URL到change文件,并且转换状态
        List<Crawler> modifiedCrawler = crawlerRepository.findByModifiedFlag(true);
        List<String> modifiedUrl = new ArrayList<>();
        modifiedCrawler.stream().forEach(crawler -> {
          modifiedUrl.add(crawler.getUrl());
          crawler.setModifiedFlag(false);
          crawler.setStatus(Crawler.STATUS_NORMAL);
        });
        logger.trace("starting update crawler data");
        crawlerRepository.save(modifiedCrawler);
        logger.trace("ended update crawler data");
        logger.trace("starting writeChangeFile");
        Utils.writeChangeFile(modifiedUrl);
        logger.trace("ended writeChangeFile");
        isCrawlerProcessing = false;
      } catch (Exception e) {
        logger.error(e.getMessage());
      }
    }).start();
  }

The following is some log message where it stopped

2015-05-27 19:46:31.597  INFO 11200 --- [Crawler 187] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://bbs.ynu.edu.cn/forum.php?action=reply&extra&fid=40&mod=post&page=1&repquote=319662&tid=27039
2015-05-27 19:46:31.626  INFO 11200 --- [Crawler 187] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://bbs.ynu.edu.cn/forum.php?action=reply&extra&fid=40&mod=post&page=1&repquote=319662&tid=27039
2015-05-27 19:46:31.698  INFO 11200 --- [Crawler 203] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.sds.ynu.edu.cn/kxyj/kyxm/6053.htm
2015-05-27 19:46:31.698  INFO 11200 --- [Crawler 422] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.swrq.ynu.edu.cn/tzgg/28978.htm
2015-05-27 19:46:31.722  INFO 11200 --- [Crawler 285] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.sds.ynu.edu.cn/kxyj/yjcg/22474.htm
2015-05-27 19:46:31.730  INFO 11200 --- [Crawler 203] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.sds.ynu.edu.cn/kxyj/kyxm/6053.htm
2015-05-27 19:46:31.730  INFO 11200 --- [Crawler 422] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.swrq.ynu.edu.cn/tzgg/28978.htm
2015-05-27 19:46:31.732  INFO 11200 --- [Crawler 435] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.sds.ynu.edu.cn/xsgz/xshd/5488.htm
2015-05-27 19:46:31.744  WARN 11200 --- [Crawler 309] e.uci.ics.crawler4j.crawler.WebCrawler   : Skipping a URL: http://www.sds.ynu.edu.cn/docs/2011-11/20111105161115180568.rar which was bigger ( 10000000 ) than max allowed size
2015-05-27 19:46:31.750  INFO 11200 --- [Crawler 285] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.sds.ynu.edu.cn/kxyj/yjcg/22474.htm
2015-05-27 19:46:31.761  INFO 11200 --- [Crawler 435] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.sds.ynu.edu.cn/xsgz/xshd/5488.htm

or

2015-05-28 16:54:05.240  INFO 40786 --- [    Crawler 280] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.ynusky.ynu.edu.cn/news/515.aspx
2015-05-28 16:54:05.283  INFO 40786 --- [    Crawler 280] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.ynusky.ynu.edu.cn/news/515.aspx
2015-05-28 16:54:06.927  INFO 40786 --- [    Crawler 562] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.ynusky.ynu.edu.cn/news/545/1.aspx
2015-05-28 16:54:06.956  INFO 40786 --- [    Crawler 562] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.ynusky.ynu.edu.cn/news/545/1.aspx
2015-05-28 16:54:08.241  INFO 40786 --- [    Crawler 881] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.ynusky.ynu.edu.cn/news/show-1228.aspx
2015-05-28 16:54:08.251  INFO 40786 --- [    Crawler 371] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.lib.ynu.edu.cn/intrduce/491
2015-05-28 16:54:08.256  INFO 40786 --- [    Crawler 382] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.lib.ynu.edu.cn/node/257
2015-05-28 16:54:08.270  INFO 40786 --- [    Crawler 881] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.ynusky.ynu.edu.cn/news/show-1228.aspx
2015-05-28 16:54:08.280  INFO 40786 --- [    Crawler 371] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.lib.ynu.edu.cn/intrduce/491
2015-05-28 16:54:08.280  WARN 40786 --- [    Crawler 827] e.uci.ics.crawler4j.crawler.WebCrawler   : Skipping URL: http://www.ynu.edu.cn/info/2011-05-27/0-2-3821.html, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
2015-05-28 16:54:08.287  INFO 40786 --- [    Crawler 382] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.lib.ynu.edu.cn/node/257
2015-05-28 16:54:08.302  INFO 40786 --- [    Crawler 309] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.dj.ynu.edu.cn/wdxz/6522.htm
2015-05-28 16:54:08.308  INFO 40786 --- [    Crawler 569] c.e.y.n.crawler.core.YNUWebCrawler       : process crawler data with url http://www.dj.ynu.edu.cn/ywdd/35459.htm
2015-05-28 16:54:08.330  INFO 40786 --- [    Crawler 309] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.dj.ynu.edu.cn/wdxz/6522.htm
2015-05-28 16:54:08.337  INFO 40786 --- [    Crawler 569] c.e.y.n.crawler.core.YNUWebCrawler       : add crawler data with url http://www.dj.ynu.edu.cn/ywdd/35459.htm

I walk through the code, but find nothing useful information about this strange problem.

ps, I tried to set different number of crawlers(5,10,100,500,1000) and ran both Windows and Linux OS! The problem occured when crawler4j crawled about 10000+ pages (numberOfCrawlers set to 10), or when crawler4j crawled about 60000+ pages (numberOfCrawlers set to 1000)

I tried to crawl some small sized website, no such problems shown.

albert0815 commented 9 years ago

Hi,

I am pretty sure the problem is the same as I described in issue #52 - the solution for me was to write my own PageFetcher class overwriting the fetchPage(WebUrl) method and change the part with check for maximum size as follows:

          CloseableHttpResponse response = httpClient.execute(get);
          ...
            // Checking maximum size
            if (fetchResult.getEntity() != null) {
              long size = fetchResult.getEntity().getContentLength();
              if (size > config.getMaxDownloadSize()) {
                //fix issue #52 - close response!
                response.close();
                throw new PageBiggerThanMaxSizeException(size);
              }
            }

Sadly nobody commented on my issue. But please give it a try and let me know if it solves your problem.

Best regards Albert