Open benoit74 opened 5 months ago
Hi @benoit74 , will follow up further tomorrow but some of the rationale for the 11 exit code is here: https://github.com/webrecorder/browsertrix-crawler/issues/549.
Essentially, it's useful to have exit codes that Browsertrix can pick up on to know whether or not to restart crawler pods. Of course, this could be done through looking for several exit codes and in general we could use a better rationalization of what exit code is given when, so I think you're right that there is room for improvement here!
Yep, using the exit code for zimit is also our goal, but we realize we need more fine-grained details than only one "general" 11 exit code. Especially since exit code 11 is now returned for far more than the original --timeLimit
and --sizeLimit
. I'm not sure this was totally intentional, or at least this is was cause us some trouble (we shouldn't try to create a ZIM when the --diskUtilization
is already above limit or when the browser connection has been lost).
Issue #549 makes me realize that this part of the documentation seems to have been lost when transitioning to MkDocs, this issue should probably also add this back somewhere.
All that been said, no rush, better to well define the plan than rushing into something which will not make it in the end.
After some thought, I propose that:
--limit
details) so that anyone can take whatever decision he wants with fine details on what happenedProposed new stats format:
{
"crawled": xx,
"total": xx,
"pending": xx,
"failed": xx,
"limit": {
"max": xx,
"hit": true/false
},
"sizeLimit": {
"max": xx,
"hit": true/false
},
“timeLimit": {
"max": xx,
"hit": true/false
},
"diskUtilization": {
"max": xx,
"hit": true/false
},
"browser_disconnected": true/false,
"final_status": "done"/"canceled"/"interrupted"/"failed",
"pendingPages": [
...
]
}
Are you OK with this idea? May I propose a PR?
We have three things which can stop the crawler in the middle of a run:
--sizeLimit
: the maximum warc size--timeLimit
: the maximum duration of the crawl--diskUtilization
: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reachedAs can be seen in the flag names, the disk one is not named Limit and this shows that it's different.
We understand the size and time limits as requests by the user to stop (crawling) when reaching that point.
We understand the diskUtilization one as a technical safety net.
Currently, all these two limits + technical safety net + the browser disconnection leads to an exit code 11, which makes it hard to diagnose / automate for users (especially zimit ^^)
Would it make sense from your PoV to implement different return code for each limit / technical safety net / browser disconnection?
I can work on this issue if ok for you.