Open mariolamassaavedra opened 5 months ago
Temporary workaround is to just disable the cronjob, as Exchange metadata is unlikely to be a retention concern.
Things to look at:
Considering this low priority as the Kingdomless exchange architecture will render this cronjob moot.
Describe the bug On Monday June 10th all calls to the Kingdom’s DB halted triggering error: DEADLINE_EXCEEDED ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=/10.X.X.X:8443]] This impacted all calls to the Kingdom, including reporting server, panel exchange, herald and direct requisitions.
The deployment didn’t show any memory or CPU constraints
However when analysing the Kingdom DB (Spanner) it is evident that “something” occurred around 7:40 AM UK time when processing stopped and latency went up
At the time (7:40AM) the exchanges-deletion-cronjob triggered on its usual schedule all other cron jobs failed due to not being able to connect to the DB
The issue persisted until the Data Server Deployment Pods were deleted and recreated, afterwards all calls worked fine
Note: the kingdom was only receiving PX traffic at this time, and only happened this time, so the issue seems to be sporadic
Steps to reproduce
Component(s) affected Kingdom
Version 0.4.4
Environment Origin PRD
Additional context Spanner config is set to 500 PUs
Can see various PX related queries scanning 11K rows and returning 0 rows