Open ikreymer opened 7 months ago
when do you expect to implement this feature? At our site it's necessary for all paywall restricted sites such as newssites and similar ip restricted webplatforms - we try to archive with Heritrix today.
Hi @tuehlarsen, it's something we hope to get to before too long, but we don't currently have it roadmapped for a particular release. We will update the issue when it is being worked on.
@tuehlarsen You'll be happy to know that @vnznznz is working on our first implementation of crawling through SOCKS5 proxies now. At first stage it will likely just be country-specific, but later on we should be able to support local proxies to crawl from specific IPs.
Context
There are several scenarios where it may be beneficial to crawl through a more distributed network of nodes, besides the ones where the crawl is running. Distributing K8s infrastructure is tricky, however we can easily have the crawler crawl through designated SOCKS5 proxies.
What change would you like to see?
While there are many possible use cases, the initial goal is to support a user that wants to:
Additionally, we may also want to support a user that wants to:
Requirements
The core requirements if for:
Additional requirements will be added for use cases 2-4 as needed.
3) will require options to configure custom proxy settings. 4), if implemented, may require using Tor support in Brave and/or running separate Tor proxy. 2) may involve additional rate limiting detection, or other failover mechanisms, and will require some R&D.
Todo
The initial focus of this will be to support use case 1), crawling through a list of preconfigured proxies. Will involve: