Client cannot reconnect after missing few DWA

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Make valid connection between two peers (one act as a client and one as a 
server)
2. Simulate network problem without disconnection e.g add breakpoint in 
sendDwaMessage method on server side
3. After at least two timeouted DWA remove breakpoint on server side 

What is the expected output? What do you see instead?
Expected behaviour of client peer is to reconnect without any problem, but 
instead it fall in vicious circle between state REOPEN and INITIAL. After first 
missed DWA, client goes to SUSPECT state (wait for DWA). In SUSPECT state we 
get next timeout and client is switched to REOPEN state. In REOPEN state client 
unsuccessfully try to switch to OKAY state (send CER, which hit the server, 
server reply by CEA message). Unfortunately, client doesn't read CEA from TCP 
socket and gets timeout (reset connection in server). Client again goes to 
REOPEN state (this will repeat until we send kill signal to client). I attached 
logs of this situation and clearly we can see that TCPReader thread do nothing 
after send CER message, so client doesn't know about server responses.

What version of the product are you using? On what operating system?
JDiameter 1.6.0-SNAPSHOT
Linux 3.2.0-60-generic x86_64
Distributor ID: Ubuntu
Description:    Ubuntu 12.04.4 LTS
Release:    12.04
Codename:   precise

Original issue reported on code.google.com by markiewiczr89 on 31 Mar 2014 at 9:36

Attachments:

jdiameter-logs.txt

GoogleCodeExporter commented 9 years ago

Can be fixed by adding interrupt call:

--- 
a/core/jdiameter/impl/src/main/java/org/jdiameter/client/impl/transport/tcp/TCPT
ransportClient.java
+++ 
b/core/jdiameter/impl/src/main/java/org/jdiameter/client/impl/transport/tcp/TCPT
ransportClient.java
@@ -231,6 +231,7 @@ public class TCPTransportClient implements Runnable {
       socketChannel.close();
     }
     if (selfThread != null) {
+      selfThread.interrupt();
       selfThread.join(100);
     }
     clearBuffer();

Original comment by koza...@gmail.com on 5 Jun 2014 at 5:47

GoogleCodeExporter commented 9 years ago

I suffer from the same issue.

I noticed the TCPReader thread is stuck when I disconnect from a peer, it seems 
to work fine on windows, but for some reason, on CentOS it keeps waiting most 
of the time.

I also confirm the proposed fix works like a charm.

Thanks !

Original comment by nicolas....@gmail.com on 10 Jun 2014 at 1:43

GoogleCodeExporter commented 9 years ago

Forgot to mention, I'm using jdiameter 1.5.10.0-build639

Original comment by nicolas....@gmail.com on 10 Jun 2014 at 1:44

GoogleCodeExporter commented 9 years ago

By adding the interrupt call, there are other issues that arise such as due to 
the interruption, the DISCONNECT_EVENT is not put into the queue and that will 
cause side-effects when trying to reconnect. The problem can be seen in the 
org.mobicents.diameter.stack.StackReConnectionTest.testClientReconnectOnServerRe
start() test.

Original comment by brainslog on 4 Jul 2014 at 3:07

Changed state: Accepted
Added labels: Priority-High, OpSys-All, Component-DIAMETER-Stack, DIAMETER-1.6.0.FINAL
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Just for confirmation, we run into the same problem recently. I have a network 
trace as well that shows that the server replies with a successful CEA but the 
client never reads it from the socket leading to a TIMEOUT event.

Original comment by atouloupis on 4 Jul 2014 at 12:08

GoogleCodeExporter commented 9 years ago

Hi,

So is this issue is fixed or not? If yes where i can find sources?

thank you.

Original comment by sirius....@gmail.com on 22 Jul 2014 at 10:46

GoogleCodeExporter commented 9 years ago

Good day

I guess to avoid issue with DISCONNECT_EVENT you should not add 
selfThread.interrupt(); to stop() method of 
org.jdiameter.client.impl.transport.tcp.TCPTransportClient.

But instead of this in method run() (class TCPTransportClient) in main cycle:

 while (!stop) {
        selector.select();
        ...

use select method with timeout just like this:
        selector.select(1000);

In my tests after such fix connection normally reinit after missing DWA and 
putting DISCONNECT_EVENT to queue seems working normally.

Please comment.

Thank you

Original comment by sirius....@gmail.com on 30 Jul 2014 at 12:54

GoogleCodeExporter commented 9 years ago

I try solution with setting timeout in Selector and... peers reconnect without 
any side effects. I mean amount of TCP sockets, which was created in 
REOPEN<->INITIAL state back to normal and messages was exchanged properly. 
Thanks for this :)

In my opinion, it is acceptable answer but I will wait for response from 
brainslog.

Original comment by markiewiczr89 on 5 Aug 2014 at 1:40

GoogleCodeExporter commented 9 years ago

I agree the select should be used with the timeout, it's a much better 
solution. But I'd use a smaller value.. 1 second is a bit of time, I'm using 
500 ms instead. Tests are also confirming to be ok, will commit this as fix.

Is any of you who were able to replicate the problem be able to write a 
regression test (based on the ones from 
org.mobicents.diameter.stack.StackReConnectionTest) ? Would be greatly 
appreciated.

Original comment by brainslog on 21 Aug 2014 at 1:10

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 604e8e9c9184.

Original comment by brainslog on 21 Aug 2014 at 1:16

Changed state: Fixed

xinbc / jdiameter

Client cannot reconnect after missing few DWA #56