webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
141 stars 29 forks source link

[Feature]: Support crawling through pre-configured SOCKS5 proxies #1354

Open ikreymer opened 7 months ago

ikreymer commented 7 months ago

Context

There are several scenarios where it may be beneficial to crawl through a more distributed network of nodes, besides the ones where the crawl is running. Distributing K8s infrastructure is tricky, however we can easily have the crawler crawl through designated SOCKS5 proxies.

What change would you like to see?

While there are many possible use cases, the initial goal is to support a user that wants to:

  1. crawl content as seen from a particular geographic region, due to geolocation restrictions or other requirements.

Additionally, we may also want to support a user that wants to:

  1. Get around rate-limiting, which may happen form extensive crawling from a single IP.
  2. Crawl through a designated IP on their own infrastructure with provided credentials / ssh key
  3. Crawl through a randomly assigned IP via the Tor network and/or crawl Tor content.

Requirements

The core requirements if for:

Additional requirements will be added for use cases 2-4 as needed.

3) will require options to configure custom proxy settings. 4), if implemented, may require using Tor support in Brave and/or running separate Tor proxy. 2) may involve additional rate limiting detection, or other failover mechanisms, and will require some R&D.

Todo

The initial focus of this will be to support use case 1), crawling through a list of preconfigured proxies. Will involve:

tuehlarsen commented 2 weeks ago

when do you expect to implement this feature? At our site it's necessary for all paywall restricted sites such as newssites and similar ip restricted webplatforms - we try to archive with Heritrix today.

tw4l commented 2 weeks ago

Hi @tuehlarsen, it's something we hope to get to before too long, but we don't currently have it roadmapped for a particular release. We will update the issue when it is being worked on.

tw4l commented 1 week ago

@tuehlarsen You'll be happy to know that @vnznznz is working on our first implementation of crawling through SOCKS5 proxies now. At first stage it will likely just be country-specific, but later on we should be able to support local proxies to crawl from specific IPs.