Closed ato closed 6 months ago
It seems the state parser will accept a list of URLs in the done
section instead of a count. So the following workaround seems to allow resuming a crawl without revisiting already done pages.
We just take the done URLs from pages.jsonl and add them into done:
section in the state file, like so:
state:
done:
- '{"url":"https://www.meshy.org/"}'
- '{"url":"https://www.meshy.org/blog/oracle-unicode/"}'
queued:
- '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/outbackcdx-replication/","seedId":0,"depth":1,"extraHops":0}'
- '{"added":"2024-03-11T07:25:07.665Z","url":"https://www.meshy.org/blog/pywb-migration/","seedId":0,"depth":1,"extraHops":0}'
pending: []
failed: []
errors: []
Example script:
#!/usr/bin/python3
import json, sys, yaml
if len(sys.argv) < 3: sys.exit("Usage: workaround.py pages.jsonl crawl-state.yaml > crawl-state-done.yaml")
pages_file = sys.argv[1]
state_file = sys.argv[2]
urls = set()
with open(pages_file) as f:
f.readline() # skip header
for line in f:
urls.add(json.loads(line)['url'])
with open(state_file) as f:
data = yaml.safe_load(f)
data['state']['done'] = [json.dumps({'url': url}) for url in urls]
yaml.safe_dump(data,default_flow_style=False, stream=sys.stdout)
Thanks for reporting, indeed, this was an oversight in a previous refactor. The done array was no longer being kept to save memory, but of course need to have successfully finished / done / seen set to avoid recrawling previous URLs.
@ato is it ok for this to be in 1.0.0 release? Which version are you using?
We're on 0.12.4 but I already implemented the workaround I described above and that's working good enough for now. So yeah if the fix is in 1.0.0 that's fine, I'll just delete the workaround when we upgrade. :-)
Really nice that it adds extra seeds on redirects too. That's actually something I was thinking about how we'd have to handle when switching more of our crawls from Heritrix.
@ato great! If you have a chance to test the 1.0.0, would welcome additional feedback! In 1.0.0, we use CDP entirely for capture and includes various other fixes (generally should work better!)
If I run a crawl and then send the node.js crawl process a SIGINT it writes a state YAML file into the
crawls/
directory. The README states that:However the state file only seems to contain the list of queued URLs and does not include any that are already done:
When passing a state file to the --config option browsertrix-crawler seems to recrawl the entire site in a slightly different order. As far as I can tell the second run doesn't know which pages were done in the first run so it just queues them up again as soon as it encounters a link to them.
My expectation was that stopping and resuming from a state file should be roughly equivalent in terms of captured data to a crawl that was just never stopped.
Example:
In a different terminal stop the crawl by sending SIGINT to the crawl process. (Pressing CTRL+C doesn't exit gracefully as it kills the browser. Maybe that's podman/docker difference.)
Resume from state file
Confirm that the same page was captured twice, once in each run: