wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Persistent 500 errors coming from the catalogue API #4145

Closed alexwlchan closed 4 years ago

alexwlchan commented 4 years ago

I just tried to find dinosaur pictures on the Wellcome Collection site: https://wellcomecollection.org/works?page=1&query=dinosaur

500 error. Sadness.

Inspecting the API Gateway metrics~region~'eu-west-1~start~'-P1D~end~'P0D);query=~'7bAWS2fApiGateway2cApiName2cStage*7d), I see a whole stack of 5XX errors that started being thrown just after 2pm today:

Screenshot 2020-01-08 at 23 36 57

If you look at the sidecar logs for the API, you can see the queries that were causing the issue. Based on limited experiments, the errors are reliable – a given query either succeeds or fails, but they're not flaky.

e.g. I can search for “dinosaur” and always get a 500, whereas searching for “dinosaurs” always succeeds

The event log for the ECS task shows the API was redeployed at 2pm today, around when the errors started cropping up.

The release ID for the new API is https://github.com/wellcometrust/catalogue/commit/6992489c34e2713eaf4f2075322d41154de47a6e, which is a merge commit for https://github.com/wellcometrust/catalogue/pull/338 I can’t work out what code was running prior to this change (the old tasks have been cleaned up by ECS, and I cba to dig through the releases table).

Two late-night thoughts:

alexwlchan commented 4 years ago

Pretty sure this is now fixed.