Inspecting the API Gateway metrics~region~'eu-west-1~start~'-P1D~end~'P0D);query=~'7bAWS2fApiGateway2cApiName2cStage*7d), I see a whole stack of 5XX errors that started being thrown just after 2pm today:
If you look at the sidecar logs for the API, you can see the queries that were causing the issue. Based on limited experiments, the errors are reliable – a given query either succeeds or fails, but they're not flaky.
e.g. I can search for “dinosaur” and always get a 500, whereas searching for “dinosaurs” always succeeds
The event log for the ECS task shows the API was redeployed at 2pm today, around when the errors started cropping up.
Were we getting alerts for these errors? If so, where? I had a look in Slack but couldn't see any. (And if we did get alerts, why wasn't the change fixed or rolled back?)
We definitely used to get Slack alerts about 500 errors from the API, but maybe that was lost when it moved to the catalogue account?
Where are the API logs? I tried to find them in CloudWatch, nothing. I tried to log into the logging cluster in Kibana, and got a JSON error:
{"statusCode":401,"error":"Unauthorized","message":"[security_exception] unable to authenticate user [<OIDC Token>] for action [cluster:admin/xpack/security/oidc/authenticate], with { header={ WWW-Authenticate={ 0=\"Bearer realm=\\\"security\\\"\" & 1=\"ApiKey\" & 2=\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\" } } }"}
I just tried to find dinosaur pictures on the Wellcome Collection site: https://wellcomecollection.org/works?page=1&query=dinosaur
500 error. Sadness.
Inspecting the API Gateway metrics~region~'eu-west-1~start~'-P1D~end~'P0D);query=~'7bAWS2fApiGateway2cApiName2cStage*7d), I see a whole stack of 5XX errors that started being thrown just after 2pm today:
If you look at the sidecar logs for the API, you can see the queries that were causing the issue. Based on limited experiments, the errors are reliable – a given query either succeeds or fails, but they're not flaky.
e.g. I can search for “dinosaur” and always get a 500, whereas searching for “dinosaurs” always succeeds
The event log for the ECS task shows the API was redeployed at 2pm today, around when the errors started cropping up.
The release ID for the new API is https://github.com/wellcometrust/catalogue/commit/6992489c34e2713eaf4f2075322d41154de47a6e, which is a merge commit for https://github.com/wellcometrust/catalogue/pull/338 I can’t work out what code was running prior to this change (the old tasks have been cleaned up by ECS, and I cba to dig through the releases table).
Two late-night thoughts:
Were we getting alerts for these errors? If so, where? I had a look in Slack but couldn't see any. (And if we did get alerts, why wasn't the change fixed or rolled back?)
We definitely used to get Slack alerts about 500 errors from the API, but maybe that was lost when it moved to the catalogue account?
Where are the API logs? I tried to find them in CloudWatch, nothing. I tried to log into the logging cluster in Kibana, and got a JSON error: