wellcomecollection / catalogue-api

:crystal_ball: The API for searching the Wellcome Collection catalogue.
https://developers.wellcomecollection.org
MIT License
4 stars 0 forks source link

Adds a simple healthcheck endpoint for the items service #739

Closed kenoir closed 7 months ago

kenoir commented 7 months ago

What does this change?

This change follows https://github.com/wellcomecollection/catalogue-api/pull/736, and adds an HTTP healthcheck to the items API to ensure the scala service has started before it is registered healthy at the NLB and starts serving requests.

How to test?

How can we measure success?

No downtime during deployments resulting in a better experience for visitors to the site, and fewer errors that we cannot effectively respond to in the alerts channel.

Currently we do see only items errors during deployment of the catalogue API, see from #wc-platform-alerts in slack during a deployment following updating the search :

Screenshot 2024-01-14 at 15 05 34

service healthcheck:

Have we considered potential risks?

Changing the health-checks changes the failure modes for the API, although we have tested this principle successfully in https://github.com/wellcomecollection/catalogue-api/pull/736 so the risk is reduced.

agnesgaroux commented 7 months ago

Not sure how this works. Does it check /management/healthcheck before hitting /works every time? Then if the healthcheck fails (for whatever reason, could be something other than the instance being currently in deployment) the LB what? tries another instance?

kenoir commented 7 months ago

Not sure how this works. Does it check /management/healthcheck before hitting /works every time? Then if the healthcheck fails (for whatever reason, could be something other than the instance being currently in deployment) the LB what? tries another instance?

This change only provides a new endpoint at /management/healthcheck that serves the following json and doesn't do anything else:

{
  "message": "ok"
}

The load balancer (NLB) uses this endpoint to determine if the instance is healthy, and if it is allowed to serve requests. At present the NLB uses a TCP healthcheck that only relies on the nginx sidecar that proxies requests to the app to be available. Nginx comes up very quick while the slowpoke scala app is still yawning and blinking itself awake.

This change makes sure the scala app is up by forcing it to serve requests before the load balancer determines it to be healthy. It doesn't do any more sophisticated checks as to whether it can actually serve works, there are some musings about that in slack.