webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
624 stars 81 forks source link

Crawl resumed from saved state revisits already done pages #491

Closed ato closed 6 months ago

ato commented 6 months ago

If I run a crawl and then send the node.js crawl process a SIGINT it writes a state YAML file into the crawls/ directory. The README states that:

The idea is that this crawl state YAML file can then be used as --config option to restart the crawl from where it was left of previously.

However the state file only seems to contain the list of queued URLs and does not include any that are already done:

state:
  done: 2
  queued:
    - '{"added":"2024-03-11T05:36:42.324Z","url":"https://site.example/page3","seedId":0,"depth":1}'
    - '{"added":"2024-03-11T05:36:42.324Z","url":"https://site.example/page4","seedId":0,"depth":1}'
  pending: []
  failed: []
  errors: []

When passing a state file to the --config option browsertrix-crawler seems to recrawl the entire site in a slightly different order. As far as I can tell the second run doesn't know which pages were done in the first run so it just queues them up again as soon as it encounters a link to them.

My expectation was that stopping and resuming from a state file should be roughly equivalent in terms of captured data to a crawl that was just never stopped.

Example:

$ podman run -it --rm -v $PWD:/crawls/ webrecorder/browsertrix-crawler:1.0.0-beta.7 crawl --id test -c test --combinewarc --generatecdx --url https://www.meshy.org/

In a different terminal stop the crawl by sending SIGINT to the crawl process. (Pressing CTRL+C doesn't exit gracefully as it kills the browser. Maybe that's podman/docker difference.)

$ pkill -INT -f /bin/crawl

Resume from state file

$ cat collections/test/crawls/crawl-20240311062844-test.yaml
state:
  done: 3
  queued:
    - '{"added":"2024-03-11T06:28:34.511Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
  pending: []
  failed: []
  errors: []
$ podman run -it --rm -v $PWD:/crawls/ webrecorder/browsertrix-crawler:1.0.0-beta.7 crawl --id test -c test --combinewarc --generatecdx --url https://www.meshy.org/ --config collections/test/crawls/crawl-20240311062844-test.yaml

Confirm that the same page was captured twice, once in each run:

$ grep '^org,meshy)/blog/outbackcdx-replication ' collections/test/indexes/index.cdxj
org,meshy)/blog/outbackcdx-replication 20240311062915 {"url": "https://www.meshy.org/blog/outbackcdx-replication/", "mime": "text/html", "status": "200", "digest": "sha256:3c91aad9db5f772528c32ffae302fc059d3e31de78c9235ce149d93bceac3c38", "length": "2155", "offset": "28193", "filename": "rec-b8254df4d6ed-20240311062902253-0.warc.gz"}
org,meshy)/blog/outbackcdx-replication 20240311062840 {"url": "https://www.meshy.org/blog/outbackcdx-replication/", "mime": "text/html", "status": "200", "digest": "sha256:3c91aad9db5f772528c32ffae302fc059d3e31de78c9235ce149d93bceac3c38", "length": "2160", "offset": "7558", "filename": "rec-dc8c7a02b0f6-20240311062832904-0.warc.gz"}
ato commented 6 months ago

It seems the state parser will accept a list of URLs in the done section instead of a count. So the following workaround seems to allow resuming a crawl without revisiting already done pages.

We just take the done URLs from pages.jsonl and add them into done: section in the state file, like so:

state:
  done: 
    - '{"url":"https://www.meshy.org/"}'
    - '{"url":"https://www.meshy.org/blog/oracle-unicode/"}'
  queued:
    - '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/outbackcdx-replication/","seedId":0,"depth":1,"extraHops":0}'
    - '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
  pending: []
  failed: []
  errors: []

Example script:

#!/usr/bin/python3
import json, sys, yaml

if len(sys.argv) < 3: sys.exit("Usage: workaround.py pages.jsonl crawl-state.yaml > crawl-state-done.yaml")

pages_file = sys.argv[1]
state_file = sys.argv[2]

urls = set()
with open(pages_file) as f:
    f.readline() # skip header
    for line in f:
        urls.add(json.loads(line)['url'])

with open(state_file) as f:
    data = yaml.safe_load(f)
    data['state']['done'] = [json.dumps({'url': url}) for url in urls]
    yaml.safe_dump(data,default_flow_style=False, stream=sys.stdout)
ikreymer commented 6 months ago

Thanks for reporting, indeed, this was an oversight in a previous refactor. The done array was no longer being kept to save memory, but of course need to have successfully finished / done / seen set to avoid recrawling previous URLs.

495 fixes this by recomputing the finished list of pages URLs (taking seen set subtracting the queued and failed URLs).

ikreymer commented 6 months ago

@ato is it ok for this to be in 1.0.0 release? Which version are you using?

ato commented 6 months ago

We're on 0.12.4 but I already implemented the workaround I described above and that's working good enough for now. So yeah if the fix is in 1.0.0 that's fine, I'll just delete the workaround when we upgrade. :-)

Really nice that it adds extra seeds on redirects too. That's actually something I was thinking about how we'd have to handle when switching more of our crawls from Heritrix.

ikreymer commented 6 months ago

@ato great! If you have a chance to test the 1.0.0, would welcome additional feedback! In 1.0.0, we use CDP entirely for capture and includes various other fixes (generally should work better!)