webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
201 stars 34 forks source link

[Feature]: Improved usability / efficiency for large URL lists #1107

Open ikreymer opened 1 year ago

ikreymer commented 1 year ago

Context

The URL list crawl type well for a small number, tens, hundreds URLs, but there may be potential issues when entering thousands of URLs, including:

User Story Sentence

As a user, I'd like to crawl 30,000 URLs, or 100,000 URLs in a single page crawl.

Requirements

  1. User should have a way to upload that many URLs (either as a text file or pasting into the text box) OR User should be told that there is a limit on how many URLs they can add to one crawl config.
  2. User should be able to validate a large list of URLs, and receive good error messages on which URLs in a large list are invalid.
  3. Large URL seed lists shouldn't be too taxing on the DB.
  4. Should have some way to search the seed list (eventually) even if not in the DB.

Questions

  1. Decide if we want to limit URL list sizes as a first step
  2. Decide on priority of possible improvements around URL list crawls, perhaps based on additional user feedback.

Tasks

ricardobasiliopt commented 1 year ago

As a user, I don't mind doing the recording in parts, for example, with 10,000 URLs at a time. In this case, for 80,000 seeds I would only have to make 8 crawls, which I would add to a "collection" (which is one of the interesting features you've created). Perfect.

But the feature I miss the most is the one you mention above "2) User should be able to validate a large list of URLs, and receive good error messages on which URLs in a large list are invalid."

If I know precisely which URLs are considered invalid, I can correct them manually or delete them from the list. But in a list with thousands of lines it's very difficult for me, as an ordinary user, to find the error.

About the list limit: I made a recording of 4000 URLs (auto-scroll, block ads) and got a WACZ with 44 GB. If I did the same with 10,000 I'd get about 100 GB. If I followed "Any link on the page" I'd get a bigger WACZ. As a user I find it useful to have a limit.

image

ikreymer commented 1 year ago

Will track changes/tasks related to URL list handling here: