web3-storage / backup

A tool to backup all data for disaster recovery.
0 stars 0 forks source link

feat: add healthcheck #9

Closed olizilla closed 1 year ago

olizilla commented 1 year ago

adds an http server to the backup process that provides a healthcheck status. We're seeing the backup process just lock up in prod with no logging but also no process failure or other error to help debug. The plan is to wire this in to ECS so that it restarts the container if it fails healthcheck.

if it's been more than 2 mins since last log, assume we are stalled and return 500.

update dockerfile to use curl to test the healthcheck url

License: MIT

olizilla commented 1 year ago

Added a check to set health ok permanently once completed, otherwise we'd force completed containers to keep restarting.

I don't know why the existing completed tasks aren't stopping once completed tho... we see them log backup complete 🎉 but they persist as a live task in the backup service. This is kinda handy so we can see what's going on, but i think ideally we'd be shutting them down as they complete.

This PR adds an http server, so they are definitely not gonna shutdown automatically now, but that's easy to fix if we want it. The bind here is that until i figure our why they are not cleanly shutting down i have to leave the healthcheck http server running, otherwise they will be marked as unhealthy once completed and restarted due to the healthcheck failing.

So, as the tasks currently don't shutdown, i am making it so they are healthy once completed and getting this rolled out.