woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.07k stars 351 forks source link

Inconsistent state after woodpecker container restart during ongoing build #178

Closed pboguslawski closed 2 years ago

pboguslawski commented 3 years ago

When long building is in progress, restarting woodpecker containers (agent then server) using docker-compose similar to

https://woodpecker.laszlo.cloud/server-setup/

focres docker daemon to kill agent container (then server is stopped also but building container is not). After next woodpecker containers start, build task has still Running status and one cannot see building container output (building container finishes its work in the background but its status is not updated in woodpecker; pipeline service containers are left running orphaned till host/docker restart).

Agent logs after restart initialization show only

ctrl+c received, terminating process

and agent does not cancel running task.

Checked in woodpecker compiled from b52e404f93ccea05dc783aa929770c4a0fad2e74.

When receiving term signal (i.e. host reboot) agent process should cancel all ongoing tasks and terminate itself ASAP. This should leave task database in consistent state after next start.

Regards, Paweł

laszlocph commented 3 years ago

Thanks for the report. While I haven't reproduced the issue from your report, I've seen similar behavior before.

In my large scale Woodpecker operation I did restarts when no build was running. To make that manageable, you can stop the work queue from picking a new task, and also watch for running tasks to finish. See these API endpoints: https://github.com/laszlocph/woodpecker/blob/7e1c81c25c8556950faeb9cfb4212f6a8312d688/router/router.go#L143

pboguslawski commented 3 years ago

Graceful restarts (agent waiting for its jobs to finish) should be an option (admin must do additional config like queue tricks and ensuring agent won't be killed by docker on its shutdown) but default setting in agent should be to cancel all running tasks and exit ASAP. Woodpecker state should be consistent after in both scenarios.

laszlocph commented 3 years ago

I agree in principle. The fix is needed in the core of Woodpecker, therefor won't be fast I think.

In the meantime you can try using the queue operations that are available today.

6543 commented 2 years ago

could be that we should register more signals:

https://github.com/woodpecker-ci/woodpecker/blob/b3d40024a99a0a0f3f2420817db17342ee7a5b60/cmd/agent/signal.go#L30

agent itself look gracefully ... but we should do proper testing

6543 commented 2 years ago

will be resolved by #536