zeromq / malamute

The ZeroMQ Enterprise Messaging Broker
Mozilla Public License 2.0
324 stars 77 forks source link

Drop the settling state #308

Closed michal42 closed 6 years ago

michal42 commented 6 years ago

We have proper memory management since commit ee93e46f4bed ("Properly handle pending stream traffic after client disconnect"), so there is no need for the settle period.

bluca commented 6 years ago

There's a fairly big backtrace in the valgrind test, as if things were not cleaned up properly at exit, could you please have a look?

michal42 commented 6 years ago

Yes, I will have a look. It seems to be a race that I'm unable to trigger locally.

bluca commented 6 years ago

If it happened before then it doesn't need to block this PR, just if it's new

michal42 commented 6 years ago

It didn't happen before I rebased with the zproject update. But since I'm removing a hack that was supposed to paper over race conditions, the bug is quite possibly linked to my change.

michal42 commented 6 years ago

My guess is that the leak is caused by mlm_msg object pointers in flight either from the server engine to the stream engine or back, when the server is shut down. But I still haven't reproduced.

michal42 commented 6 years ago

I finally had time to look into it. I can reliably trigger the leak with this test:

diff --git b/src/mlm_server.c a/src/mlm_server.c
index 7a5f323a3045..22273e58f542 100644
--- b/src/mlm_server.c
+++ a/src/mlm_server.c
@@ -1212,6 +1212,34 @@ mlm_server_test (bool verbose)
         zactor_destroy (&server);
     }

+    {
+        const char *endpoint = "inproc://mlm_server_memleak_with_pending_stream_traffic";
+        zactor_t *server = zactor_new (mlm_server, "mlm_server_memleak_with_pending_stream_traffic");
+        if (verbose) {
+            zstr_send (server, "VERBOSE");
+            printf("Regression test for a memleak with pending stream traffic after server destroy\n");
+        }
+        zstr_sendx (server, "BIND", endpoint, NULL);
+
+        mlm_client_t *producer = mlm_client_new ();
+        assert (mlm_client_connect (producer, endpoint, 1000, "producer") >= 0);
+        assert (mlm_client_set_producer (producer, "STREAM_TEST") >= 0);
+
+        zstr_sendx (server, "SLOW_TEST_MODE", NULL);
+
+        mlm_client_t *consumer = mlm_client_new ();
+        assert (mlm_client_connect (consumer, endpoint, 1000, "consumer") >= 0);
+        assert (mlm_client_set_consumer (consumer, "STREAM_TEST", ".*") >= 0);
+
+        zmsg_t *msg = zmsg_new ();
+        zmsg_addstr (msg, "world");
+        assert (mlm_client_send (producer, "hello", &msg) >= 0);
+
+        mlm_client_destroy (&consumer);
+        mlm_client_destroy (&producer);
+        zactor_destroy (&server);
+    }
+
     //  @end
     printf ("OK\n");
 }

And can trigger it both with and without this PR. The fix for the leak would be to synchronize with the stream engine instances during server shutdown and free any messages not yet dispatched. I can look into that, but can you please trigger a new travis run for this PR and if it is green, merge it?

michal42 commented 6 years ago

Hrm, it's failing in zmq shutdown this time? I'm lost. Does master pass travis?

bluca commented 6 years ago

It does not - so I'll merge this

bluca commented 6 years ago

You can ignore the shutdown error - I'll have a look at it, probably a regression in libzmq

michal42 commented 6 years ago

Thank you! I will look into the race condition during stream shutdown when I have time again.

bluca commented 6 years ago

@michal42 the regression has now been fixed in CZMQ