yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.56k stars 1.93k forks source link

Configurable inmediate redirection #433

Open dgoiko opened 4 years ago

dgoiko commented 4 years ago

WebURLs and WebCrawler now supports for individual URLs to be followed right away even if they were already visited. They will not be scheduled, but processed.

I needed to implement this because I found a site that used a common URL for redirections and based content on it's internal session or something I could't figure out.

Even if I managed to schedule visited URLs again, after scheduling all'em showed the same content: the one referenced in the last "previous page" visited. After allowing the crawler to visit sites inmediatly, the problem was solved.

Since this can generate non-desired infinite redirection loop, there's a maximum automatic redirection depth that can be configured on WebURLs: maxInmediateRedirects.

By default, this vehaviour is disabled. The creator of the WebURL is responsible of enabling it on a per-URL basis