wellcomecollection / content-api

📖 The API + ETL pipeline for searching the Wellcome Collection Prismic Repository.
MIT License
0 stars 0 forks source link

Adds a simple healthcheck endpoint for content-api #94

Closed kenoir closed 8 months ago

kenoir commented 8 months ago

What does this change?

This change adds a HTTP healthcheck for a services target group. The reason for this change is to avoid downtime during deploys, as at present the TCP healthcheck will only wait for nginx to be able to open a connection although the underlying service may not yet have started.

[!NOTE] The healthcheck endpoint does not detect if the elasticsearch client is able to successfully make connections, so does not fully report service health, we should add this in a future PR.

A request to /management/healthcheck should result in a HTTP 200 response with the body:

{
    "status": "ok",
    "config": {
        "pipelineDate": "2023-03-24",
        "articlesIndex": "articles",
        "eventsIndex": "events",
        "publicRootUrl": "https://api.wellcomecollection.org/content/v0"
    }
}

The /management/healthcheck path was chosen to be in line with other services that do currently provide healthcheck endpoints, and config is surfaced in order to add a little further utility to this endpoint so it can be used to check the setup quickly.

This change requires a ./run_terraform.sh apply from the infrastructure directory:

Terraform will perform the following actions:

  # module.content_api_prod.aws_lb_target_group.content_api will be updated in-place
  ~ resource "aws_lb_target_group" "content_api" {
        id                                 = "arn:aws:elasticloadbalancing:eu-west-1:756629837203:targetgroup/content-api-prod/59790434c8023181"
        name                               = "content-api-prod"
        tags                               = {}
        # (15 unchanged attributes hidden)

      ~ health_check {
          + matcher             = "200"
          + path                = "/management/healthcheck"
          ~ protocol            = "TCP" -> "HTTP"
            # (6 unchanged attributes hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # module.content_api_stage.aws_lb_target_group.content_api will be updated in-place
  ~ resource "aws_lb_target_group" "content_api" {
        id                                 = "arn:aws:elasticloadbalancing:eu-west-1:756629837203:targetgroup/content-api-stage/c38a4e6ae4625348"
        name                               = "content-api-stage"
        tags                               = {}
        # (15 unchanged attributes hidden)

      ~ health_check {
          + matcher             = "200"
          + path                = "/management/healthcheck"
          ~ protocol            = "TCP" -> "HTTP"
            # (6 unchanged attributes hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # module.pipeline.aws_scheduler_schedule.windows will be updated in-place
  ~ resource "aws_scheduler_schedule" "windows" {
        id                           = "default/content-pipeline-windows-2023-03-24"
        name                         = "content-pipeline-windows-2023-03-24"
        # (5 unchanged attributes hidden)

      ~ target {
          ~ input    = jsonencode(
              ~ {
                  - contentType = "all"
                    # (2 unchanged attributes hidden)
                }
            )
            # (2 unchanged attributes hidden)

            # (1 unchanged block hidden)
        }

        # (1 unchanged block hidden)
    }

Plan: 0 to add, 3 to change, 0 to destroy.

How to test

How can we measure success?

No false alarms during deployment, the healthcheck properly reports the state of the service.

Have we considered potential risks?

Changing the health-checks changes the failure modes for the API, we should test thoroughly in stage before deploying to prod, consider and document the impact of extending the health check to fail in other situations (e.g. elasticsearch is unavailable).