Publish and subscribe stops working after some time without restarts

snormore commented 1 year ago

It's looking like publish/subscribe stops forwarding messaging after some time, or after some event, it's not clear why yet and needs to be investigated. Query seems to continue working when this happens, but no messages are being published/subscribed.

We're seeing CPU spike on the nodes pretty soon after deployment, and then at some point later of no restarts (days or weeks), we're seeing publish/subscribe no longer work. Around the same time (or before) we see CPU drop back down to almost 0%. We can see this happening in the e2e tests too, they start to fail. Manually testing also shows non functioning publish/subscribe.

We've been seeing this most often on the dev network, but it also happened last night on prod for the first time. In order to remedy until we are back from holidays, I restarted the containers manually via AWS console, but this really needs to be investigated properly and fixed.

We can see in the e2e tests that production was in this state between midnight and 4:30am ET on January 2, so effectively down from a user's perspective, unable to publish or subscribe:

Screenshot 2023-01-02 at 4 45 02 AM

Looking back at the node CPU usage metrics, it seems like it may have started happening around November 24:

Screenshot 2023-01-02 at 11 31 25 AM

snormore commented 1 year ago

I looked back at some of the PRs merged and deployed around Nov 24, and did some testing around those changes. It looks like it's related to the subscribe-all update.

I'm able to reproduce the increase and non-reclaiming of CPU usage with the following:

dev/up
Watch CPU usage via docker stats
curl -sS -XPOST localhost:8080/message/v1/subscribe-all
Ctrl-C out of the curl
Notice that CPU usage has increased significantly (> 100%) after closing the subscribe-all request, and doesn't ever seem to decrease from there

Note that you don't even need to have e2e tests running for this to happen, no messages flowing through the system at all, and we still see the behaviour.

After a bit of time in this state, I can see publish/subscribe start to fail via e2e tests.

neekolas commented 1 year ago

Great find.

I don't believe that we call relaySub.Unsubscribe() anywhere else in our code or go-waku (other than as part of shutdown), so maybe there is a bug somewhere with that?

I've gotta take off in a bit, but maybe some sort of lock contention on the subscription.Lock()?

neekolas commented 1 year ago

One other piece of context that might explain why this has only been happening in dev: we only run the notification-server in dev right now, and that is the only code that uses SubscribeAll. But the code is open source, so maybe someone spun up an instance connected to the prod network last night.

snormore commented 1 year ago

I think I figured out what's happening: https://github.com/xmtp/go-waku/pull/34

snormore commented 1 year ago

That waku PR breaks our code, after the first unsubscribe from subscribe-all no more subscriptions work. I think it's because we're subscribing to the same pubsub topic multiple times and go-waku relay is using a map to track it by topic name so it's just overwriting the previous; the go-waku code is a bit of a mess tbh. I'm going to close it and just update our code to reuse the existing single waku relay subscription, so we're not subscribing to the same pubsub topic multiple times.

snormore commented 1 year ago

https://github.com/xmtp/xmtp-node-go/pull/199

snormore commented 1 year ago

Noticable increase in e2e rate since the fix was deployed (test duration decreased, faster response times):

Screenshot 2023-01-03 at 10 32 08 AM

snormore commented 1 year ago

Also very noticable decrease in query/publish request duration metrics since:

Screenshot 2023-01-03 at 12 22 41 PM

We should maybe even have an alert on these, 5-7s queries went unnoticed for a couple months.

neekolas commented 1 year ago

➕ on the alert

xmtp / xmtp-node-go

Publish and subscribe stops working after some time without restarts #198