[Feature]: Improved usability / efficiency for large URL lists

ikreymer commented 1 year ago

Context

The URL list crawl type well for a small number, tens, hundreds URLs, but there may be potential issues when entering thousands of URLs, including:

The client-side validation may be taking too long / no indicator that the URL is being validated
The server-side validation isn't precise enough to tell which URL is invalid amongst thousands of URLs.
There may be a better way to store a large URL list then directly in mongodb, such as via a file in the S3 bucket with config referencing the seed list, as such as a URL list may otherwise add several MBs of data in mongo. This is especially important as crawl workflows are currently not deletable and are versioned. This may make URL seed search more complicated (not something supported yet), but will be much less taxing on the DB.
Should there be a limit to how many URLs can be in a URL list crawl?
Alternate uploads for large lists (just upload a text file?)

User Story Sentence

As a user, I'd like to crawl 30,000 URLs, or 100,000 URLs in a single page crawl.

Requirements

User should have a way to upload that many URLs (either as a text file or pasting into the text box) OR User should be told that there is a limit on how many URLs they can add to one crawl config.
User should be able to validate a large list of URLs, and receive good error messages on which URLs in a large list are invalid.
Large URL seed lists shouldn't be too taxing on the DB.
Should have some way to search the seed list (eventually) even if not in the DB.

Questions

Decide if we want to limit URL list sizes as a first step
Decide on priority of possible improvements around URL list crawls, perhaps based on additional user feedback.

Tasks

[x] https://github.com/webrecorder/browsertrix-cloud/issues/1190
[x] https://github.com/webrecorder/browsertrix-cloud/issues/1185
[ ] #1221
[x] #1222

ricardobasiliopt commented 1 year ago

As a user, I don't mind doing the recording in parts, for example, with 10,000 URLs at a time. In this case, for 80,000 seeds I would only have to make 8 crawls, which I would add to a "collection" (which is one of the interesting features you've created). Perfect.

But the feature I miss the most is the one you mention above "2) User should be able to validate a large list of URLs, and receive good error messages on which URLs in a large list are invalid."

If I know precisely which URLs are considered invalid, I can correct them manually or delete them from the list. But in a list with thousands of lines it's very difficult for me, as an ordinary user, to find the error.

About the list limit: I made a recording of 4000 URLs (auto-scroll, block ads) and got a WACZ with 44 GB. If I did the same with 10,000 I'd get about 100 GB. If I followed "Any link on the page" I'd get a bigger WACZ. As a user I find it useful to have a limit.

ikreymer commented 1 year ago

Will track changes/tasks related to URL list handling here:

[x] Optimization of loading large URL lists via API (#1185)
[ ] Making URL lists expandable in the UI (crawl/workflow details pages)
[ ] Improved validation for editing URL lists, showing which URLs are invalid.
[ ] Improved storage of URL lists on backend (perhaps store as file instead of in DB if number of URLs exceeds threshold)
[ ] Also account for 'Additional URLs' on seeded crawl view, as is also a URL list.

webrecorder / browsertrix