public void Run() throws Exception
{
String crawlStorageFolder = "h:";
int numberOfCrawlers = 3;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setUserAgentString("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36");
// Instantiate the controller for this crawl.
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
// For each crawl, you need to add some seed urls. These are the first
// URLs that are fetched and then the crawler starts following links
// which are found in these pages
controller.addSeed("https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt");
//controller.addSeed("https://github.com/hemin1003/java-spider");
// The factory which creates instances of crawlers.
CrawlController.WebCrawlerFactory<ArrowWebCrawler> factory = ArrowWebCrawler::new;
// Start the crawl. This is a blocking operation, meaning that your code
// will reach the line after this only when crawling is finished.
controller.start(factory, numberOfCrawlers);
}
}
The message is in the following.
2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1)
2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default"
2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316)
2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false )
2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt
2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started
2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure...
2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...
I wanna fetch the website "https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt", but failed. I spent one day to solve, but still not worked. Can somebody help? Thanks. My code as follows:
public class Controller {
}
The message is in the following. 2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1) 2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default" 2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316) 2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false ) 2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt 2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt 2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started 2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started 2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started 2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt 2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt 2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure... 2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure... 2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process... 2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...