webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
653 stars 83 forks source link

Better indicate the interruption reason #584

Open benoit74 opened 5 months ago

benoit74 commented 5 months ago

We have three things which can stop the crawler in the middle of a run:

As can be seen in the flag names, the disk one is not named Limit and this shows that it's different.

We understand the size and time limits as requests by the user to stop (crawling) when reaching that point.

We understand the diskUtilization one as a technical safety net.

Currently, all these two limits + technical safety net + the browser disconnection leads to an exit code 11, which makes it hard to diagnose / automate for users (especially zimit ^^)

Would it make sense from your PoV to implement different return code for each limit / technical safety net / browser disconnection?

I can work on this issue if ok for you.

tw4l commented 5 months ago

Hi @benoit74 , will follow up further tomorrow but some of the rationale for the 11 exit code is here: https://github.com/webrecorder/browsertrix-crawler/issues/549.

Essentially, it's useful to have exit codes that Browsertrix can pick up on to know whether or not to restart crawler pods. Of course, this could be done through looking for several exit codes and in general we could use a better rationalization of what exit code is given when, so I think you're right that there is room for improvement here!

benoit74 commented 5 months ago

Yep, using the exit code for zimit is also our goal, but we realize we need more fine-grained details than only one "general" 11 exit code. Especially since exit code 11 is now returned for far more than the original --timeLimit and --sizeLimit. I'm not sure this was totally intentional, or at least this is was cause us some trouble (we shouldn't try to create a ZIM when the --diskUtilization is already above limit or when the browser connection has been lost).

Issue #549 makes me realize that this part of the documentation seems to have been lost when transitioning to MkDocs, this issue should probably also add this back somewhere.

All that been said, no rush, better to well define the plan than rushing into something which will not make it in the end.

benoit74 commented 4 days ago

After some thought, I propose that:

Proposed new stats format:

{
  "crawled": xx,
  "total": xx,
  "pending": xx,
  "failed": xx,
  "limit": {
    "max": xx,
    "hit": true/false
  },
  "sizeLimit": {
    "max": xx,
    "hit": true/false
  },
  “timeLimit": {
    "max": xx,
    "hit": true/false
  },
  "diskUtilization": {
    "max": xx,
    "hit": true/false
  },
  "browser_disconnected": true/false,
  "final_status": "done"/"canceled"/"interrupted"/"failed",
  "pendingPages": [
    ...
  ]
}

Are you OK with this idea? May I propose a PR?