yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.54k stars 1.93k forks source link

Selenium basic integration #444

Open dgoiko opened 4 years ago

dgoiko commented 4 years ago

Very basic selenium integration. This is not intended to be a full selenium crawler like Nutch, the main goal is to provide a simple way to crawl full-js pages without directly calling the REST APIs. If you're trying to navigate simple HTML pages with, lets say, a POST form, I'd recommend #419 .

For those who don't know Selenium, it is a browser automation tool designed for developers to automatically test their websites on different browsers. Some crawlers like Nutch integrate Selenium in order to provide full JS render capabilities to the crawler. This PR implements a very naive selenium crawling using JBrowserDriver as a headless browser. Please, note that JBrowserDriver does NOT support Java 8 or higher. HtmlUnitDriver provides Java 11 compatibility, but it is too sassy with JS that is not perfectly formed (while normal browsers accept it and manage to render). Even google.com failed to load with HtmlUnitDriver because there's a catch without brackets somewhere.

Connections stablished through selenium are not counted in the same pool than those opened with HttpClient, so limitations are not taken into consideration. Further commits will attemp to resolve this issue, but it is not straight-forward.

Selenium request will NOT be intercepter by the credentials interceptors. FormLogin shoulld work, though

You can define inclussions / exclussions on the new SeleniumCrawlConfig class to determine which URLs will be visited using Selenium and which won't. Starting with a Selenium seed is not possible at the moment (although it would be possible using the new functions created in my POST CAPABNILITIES MR, which allow to pass WebURLs to addSeed methods.

Please, note that Selenium API does NOT provide headers information, so they won't be available in the Page class. Selenium is also best-guessing the content encoding, blocking us from directly extracting the byte array. For compatibility, output String is converted to a UTF8 byte array, however, is Selenium did not properly detect the encoding charset will be messed-up

A full selenium integration would require to modify the crawler too deeply, but right now, the active selenium headless browser window is available through page#getFetchedResult. This is a bit unconvenient as it forces you to perform instanceof verifications in order to access it.

I'd recommend providing this as a optional artifact, so users who will never use this feature don't need to include selenium dependencies into their projects. The only thing that needs to be in the main package is the "selenium" flag for WebUrls and the extracted interfaces for Parser, Fetcher and FetchResult, which would allow to use the custom ones created for Selenium

I'll be adding more features and configurations for Selenium in further commits.

Some of the extracted interfaces in commits are not really necesary, but I needed Parser and PageFetcher interfaces, so decided to start from my existing branch of separated interfaces.

dgoiko commented 4 years ago

This is just an starting point. IF someone really wanted to integrate Selenium with crawler4j, studying the way Nutch actually does it would be a good starting point.

Plugin protocol Selenium Lib selenium Protocol interactive-selenium

I'll get into this as soon as I've some spare time for it, right now the current PR is a quick and dirty solution I urgently needed to implement for a project, and I thought someone would find it usefull, so I extracted it from my codebase and prepared this PR.