world-federation-of-advertisers / cross-media-measurement

Apache License 2.0
34 stars 11 forks source link

Exchanges deletion cronjob exhausts DB connections #1659

Open mariolamassaavedra opened 3 weeks ago

mariolamassaavedra commented 3 weeks ago

Describe the bug On Monday June 10th all calls to the Kingdom’s DB halted triggering error: DEADLINE_EXCEEDED ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=/10.X.X.X:8443]] This impacted all calls to the Kingdom, including reporting server, panel exchange, herald and direct requisitions.

The deployment didn’t show any memory or CPU constraints

image image

However when analysing the Kingdom DB (Spanner) it is evident that “something” occurred around 7:40 AM UK time when processing stopped and latency went up

image image

At the time (7:40AM) the exchanges-deletion-cronjob triggered on its usual schedule all other cron jobs failed due to not being able to connect to the DB

image image

The issue persisted until the Data Server Deployment Pods were deleted and recreated, afterwards all calls worked fine

Note: the kingdom was only receiving PX traffic at this time, and only happened this time, so the issue seems to be sporadic

Steps to reproduce

  1. Trigger px clean up cronjob

Component(s) affected Kingdom

Version 0.4.4

Environment Origin PRD

Additional context Spanner config is set to 500 PUs

Can see various PX related queries scanning 11K rows and returning 0 rows

image image

SanjayVas commented 3 weeks ago

Temporary workaround is to just disable the cronjob, as Exchange metadata is unlikely to be a retention concern.

Things to look at:

  1. Reducing the batch size.
  2. Investigating the performance of the queries/statements run by the internal service.

Considering this low priority as the Kingdomless exchange architecture will render this cronjob moot.