Closed alfre2v closed 8 years ago
Instead of "500 Internal Server Error", I would recommend using "503 Service Unavailable". This way it's easier to distinguish between unhealthy systems and unexpected exceptions
1) processes being killed due to timeout is fine, i would not consider this a reason for being unhealthy. This is quite often a problem e.g. with security groups on AWS which does not in timeout/connection refused but mostly in package drop.
2) basically it is about making sure that a larger portion of task/child processes is executing the main loop, polling for redis tasks.
3) actually a first start would be just get any response, as we right now only have ec2 health check that makes close to no sense.
Thanks @harti2006 , changed to 503 Service Unavailable :)
@Jan-M , regarding your comment 1: I would say an unusually high rate of process killed and/or restarted could be an indication that something is so wrong that the system itself is unhealthy... but how much is too high is kind or subjective... We may not implement it, but I wanted to mention it and see what you think.
Branch feature/health_endpoint has a first version of the feature, it runs the limited logic defined in the description above (it needs more testing though).
I completely refactored the process controller logic, to clean it and make it easier to track information about each process. I also added other small improvements, for example I got rid of the annoying undeterministic result of tests/test_worker.py
Done and rolled out.
We want to create a /health/ endpoint in our master cherrypy process that reflects the status of the system.
Background: The master process, which spawns all the workers, contains a cherrypy HTTP server and an RPC server for internal communication with its child processes. Each child worker process has a Main thread which runs the ZMON checks, and a Reactor thread which react in special circumstances and report it to the master via RPC call. Currently the only functionality the Reactor thread has is for detecting when the main thread is stuck in a long check and triggering an RPC call for the master process to terminate this child worker. We want to expand the Reactor Thread to periodically report its health status to the master process. The master process will aggregate the health feedback it receives from all child workers in a way that it can be presented it in a HTTP endpoint that reflects when the whole system is malfunctioning.
Proposed specs:
endpoint: /health return:
Criteria for unhealthy system:
what else...?