yukimochi / Activity-Relay

Yet another powerful customizable ActivityPub relay server written in Go.
https://relay.toot.yukimochi.jp/
GNU Affero General Public License v3.0
279 stars 39 forks source link

Worker suddenly stop, after many network errors. #32

Closed shleeable closed 3 years ago

shleeable commented 4 years ago

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS ce998e2fa975 activity-relay_redis_1 89.93% 6.137GiB / 7.789GiB 78.79% 3.86kB / 2.59kB 0B / 0B 4 6a19df2a4d29 activity-relay_worker_1 0.00% 4.816MiB / 7.789GiB 0.06% 2.35kB / 1.25kB 0B / 0B 13 fd959228ee4e activity-relay_server_1 0.00% 4.77MiB / 7.789GiB 0.06% 2.4kB / 1.86kB 0B / 0B 13

Activity-Relay is killing my instance.. I'm running docker, but shortly after booting. it will fill up all of the memory.

shleeable commented 4 years ago

root@localhost:/home/shlee/Activity-Relay/redisdata# ls -lh total 2.1G -rw-r--r-- 1 999 shlee 1.5G Apr 11 10:15 dump.rdb -rw-r--r-- 1 999 shlee 592M Apr 11 10:20 temp-1.rdb

Redis seems to be massive

shleeable commented 4 years ago

How should I recover from this?

yukimochi commented 4 years ago

Sorry for inconvenient.

Do for now, recover from massive redis db.

Please delete relay key from the redis.

  1. Start redis in large memory machine with your dump.rdb.
  2. Use redis cli, execute del relay
  3. Replace dump.rdb in your server.

note: relay key contains all incoming job information. this operation not affect subscriptions. note: Please take backup of dump.rdb. note: Redis version matching is important. Please check.

shleeable commented 4 years ago

I apologise for the lack of any error messages. If there were any errors, they were not recorded recently.

Update: Thank you, my relay is back online.

yukimochi commented 4 years ago

In my research, Worker sometimes stop working after many network errors. This cause Redis size explosion.

shleeable commented 4 years ago

In my research, Worker sometimes stop working after many network errors. This cause Redis size explosion.

I believe that might be the cause, my relay has been active for a while, and there are lots of network errors.

Consider dropping dead instances automatically? they can rejoin the relay once it's back online?

shleeable commented 3 years ago

I've updated to v0.2.8, but this issue has returned again. Boo.

yukimochi commented 3 years ago

v0.2.9 or v1.0.0rc1 uses latest machinery and new parameters. Now it in my relay server, testing performance. For now works with good performance.

yukimochi commented 3 years ago

In v1.0.0rc2, I confirmed that it is stable enough for production.