[Feature]: Only Archive New URLs

Shrinks99 commented 7 months ago

Context

Prior reading: https://anjackson.net/2023/06/09/what-makes-a-large-website-large/

The simplest way to deal with this risk of temporal incoherence is to have two crawls. A shallow and frequent crawl to get the most recent material, with a longer-running deeper crawl to gather the rest.

Large websites are difficult to crawl completely on a regular schedule. Some websites are simply too large to capture in their entirety every time they are crawled. Large websites also have lots of data that doesn't change and, if predictably the case, the value of re-capturing a certain page many times may be quite low.

Users have a limited amount of disk space and execution minutes, but their crawl workflows often capture the same content multiple times, broad crawls are wasting both resources to achieve the narrow goal of capturing updated content.

What change would you like to see?

As a user, I want to be able to only capture new URLs so my scheduled crawl workflows don't run for longer than they need to re-capturing content I have already archived completely.

As a user, when I need to edit a crawl workflow to add an additional page prefex in scope (for sites that might have example.com and cdn.example.com), I don't want to have to re-crawl all of example.com just to get the links I missed the first time when I know the rest of those pages are good.

User stories

One of our customers wants to capture a news website daily to keep a record of new stories of the day. They currently have recurring crawls set up, however these crawls take a lot of execution time to complete and take up a lot of disk space full of duplicate content. News websites (this one included) typically feature multiple index pages, sorted by topic, where new stories are listed. Existing stories are linked to as references in stories or as recommendations below them. These are what our customer doesn't want to crawl more than once.
- This comes with the caveat that this system is impercise and doesn't have an understanding that page content might be updated with corrections or additional info. For this use case to be properly served, we would need to implement support for RSS / Atom feeds that do include this data, something this customer has also requested.

Requirements

An option for "Only Archive New URLs" is available for both seeded and URL list workflows (when Include any linked page is toggled on for seeded workflows)
When crawling, if "Only Archive New URLs" is toggled on, the crawler should not archive pages it comes across if they have been marked as previouslyvisited.
URLs in the List of URLs and Crawl Start URL fields should always be saved and archived in order to give the user agency over which URLs should be saved to disk, even if they were visited in previous runs of the workflow.

BONUS: Optimize the crawling algorithm to track which previous URLs often have new URLs present on them, and prioritize them in the crawl queue to give time-limited workflows a better chance at crawling new content

Todo

Save the crawl queue results to the workflow for each crawl

Shrinks99 commented 7 months ago

From today's call:

Ilya: Instead of manually specifying URLs that should be crawled every time, could this be accomplished with an extra hops function instead?

Hank: This seems like it would have the tradeoff of only capturing new content if it is not very deep within the site OR having to re-crawl a lot of pages to obtain that result. Part of the goal here is enabling the crawling of "large websites" and not wasting execution time to visit lots of already captured content that the user doesn't wish to re-capture. The upside to this method might be that it's simpler than having to figure out all the pages that are "index pages" on a site where new content appears. I think that this added complexity is advantageous.

Emma: Might be useful to attach an experation date after which if visited pages are found again, they'll be re-crawled?

Hank: I like this a lot!

tw4l commented 7 months ago

I think it would be advantageous to enable both options:

Seeded crawl with extra hops > 1, to recrawl the entire site to certain depth and extract links from all pages but only save new pages to WARC
URL list crawl that captures new pages linked from specified set of URLs that haven't already been written to WARC in a previous crawl in the same workflow

Shrinks99 commented 5 months ago

Blocked by https://github.com/webrecorder/browsertrix-cloud/issues/1502

Shrinks99 commented 2 months ago

@anjackson Pinging you here for (personal) thoughts on this method of enabling continuous crawling of sites that change content often with index pages that list new content.

dla-kramski commented 2 months ago

I would like to expressly support this feature request.

As a literature archive, we will regularly crawl blogs and journal-type web sources 2-4 times a year, and limiting capture to new or updated pages would really be a great help for large websites.

As I understand it, this is also about timestamps, whereas #1753 would only require a large number of URLs to be distributed across multiple crawls.

webrecorder / browsertrix