yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.53k stars 1.93k forks source link

Cannot fetch content of some website but python can. #469

Open ryan701212 opened 2 years ago

ryan701212 commented 2 years ago

I wanna fetch the website "https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt", but failed. I spent one day to solve, but still not worked. Can somebody help? Thanks. My code as follows:

public class Controller {

public void Run() throws Exception
{
String crawlStorageFolder = "h:";
    int numberOfCrawlers = 3;

    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);
    config.setUserAgentString("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36");
    // Instantiate the controller for this crawl.
    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    // For each crawl, you need to add some seed urls. These are the first
    // URLs that are fetched and then the crawler starts following links
    // which are found in these pages
    controller.addSeed("https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt");
    //controller.addSeed("https://github.com/hemin1003/java-spider");

    // The factory which creates instances of crawlers.
    CrawlController.WebCrawlerFactory<ArrowWebCrawler> factory = ArrowWebCrawler::new;

    // Start the crawl. This is a blocking operation, meaning that your code
    // will reach the line after this only when crawling is finished.
    controller.start(factory, numberOfCrawlers);
}

}

The message is in the following. 2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1) 2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default" 2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316) 2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false ) 2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt 2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt 2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started 2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started 2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started 2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt 2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt 2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure... 2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure... 2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process... 2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...

liukuan1 commented 2 years ago

您的邮件我已收到,我将及时查看!谢谢!