This change adds a HTTP healthcheck for a services target group. The reason for this change is to avoid downtime during deploys, as at present the TCP healthcheck will only wait for nginx to be able to open a connection although the underlying service may not yet have started.
[!NOTE]
The healthcheck endpoint does not detect if the elasticsearch client is able to successfully make connections, so does not fully report service health, we should add this in a future PR.
A request to /management/healthcheck should result in a HTTP 200 response with the body:
The /management/healthcheck path was chosen to be in line with other services that do currently provide healthcheck endpoints, and config is surfaced in order to add a little further utility to this endpoint so it can be used to check the setup quickly.
This change requires a ./run_terraform.sh apply from the infrastructure directory:
[x] Run locally, can you reach the healthcheck endpoint?
[x] Deploy to the staging environment, test whether the healthchecks are respected.
How can we measure success?
No false alarms during deployment, the healthcheck properly reports the state of the service.
Have we considered potential risks?
Changing the health-checks changes the failure modes for the API, we should test thoroughly in stage before deploying to prod, consider and document the impact of extending the health check to fail in other situations (e.g. elasticsearch is unavailable).
What does this change?
This change adds a HTTP healthcheck for a services target group. The reason for this change is to avoid downtime during deploys, as at present the TCP healthcheck will only wait for nginx to be able to open a connection although the underlying service may not yet have started.
A request to
/management/healthcheck
should result in a HTTP 200 response with the body:The
/management/healthcheck
path was chosen to be in line with other services that do currently provide healthcheck endpoints, and config is surfaced in order to add a little further utility to this endpoint so it can be used to check the setup quickly.This change requires a
./run_terraform.sh apply
from the infrastructure directory:How to test
How can we measure success?
No false alarms during deployment, the healthcheck properly reports the state of the service.
Have we considered potential risks?
Changing the health-checks changes the failure modes for the API, we should test thoroughly in stage before deploying to prod, consider and document the impact of extending the health check to fail in other situations (e.g. elasticsearch is unavailable).