xinbc / jdiameter

Automatically exported from code.google.com/p/jdiameter
0 stars 0 forks source link

Diameter Stack: Concurrency Problem by handling CER #18

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Reported by angel.t.angelov, Jul 11, 2012:

Hi, I used Diameter stack on slow solaris server without problem.
I migrated to more powerful Linux server and I got problems with the CER-CEA.
That mean, we have obviously concurrency problem.

To find the problem I put some new debug lines into 
org.jdiameter.client.impl.fsm.PeerFSMImpl (see the file attached)

See attached the full server log (server.zip) and the extract of this log 
regarding the event handling and executor thread (trace.txt)

Let take one event from trace.txt
For example: Event{name:CER_EVENT, key:aaa://172.23.86.7:50588, 
object:MessageImpl{commandCode=257, flags=128}}

2012-07-11 14:00:39,605 DEBUG [NetworkGuard-0] [PeerFSMImpl] AA: Put Event: 
Event{name:CER_EVENT, key:aaa://172.23.86.7:50588, 
object:MessageImpl{commandCode=257, flags=128}}
2012-07-11 14:00:39,605 DEBUG [NetworkGuard-0] [PeerFSMImpl] AA: Put Event: 
Event{name:CER_EVENT, key:aaa://172.23.86.7:50588, 
object:MessageImpl{commandCode=257, flags=128}}

The first question: why this Event is put twice into the queue!?

After this 2 Events are put to the queue, other Event is extracted and the 
thread executor is stopped:
2012-07-11 14:00:39,605 DEBUG [FSM-SPeer{Uri=aaa://gbg41.um.internal:50578; 
State=DOWN; 
con=org.jdiameter.client.impl.transport.tcp.TCPClientConnection@9aa402; 
incCon{aaa://172.23.86.7:50588=org.jdiameter.client.impl.transport.tcp.TCPClient
Connection@82f110, 
aaa://172.23.86.7:50578=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@110f528, 
aaa://172.23.86.7:50577=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@1b4a4b6} }-14] [PeerFSMImpl] AA: Got Event: Event{name:CER_EVENT, 
key:aaa://172.23.86.7:50583, object:MessageImpl{commandCode=257, flags=128}}
2012-07-11 14:00:39,706 DEBUG [FSM-SPeer{Uri=aaa://gbg41.um.internal:50578; 
State=DOWN; 
con=org.jdiameter.client.impl.transport.tcp.TCPClientConnection@9aa402; 
incCon{aaa://172.23.86.7:50588=org.jdiameter.client.impl.transport.tcp.TCPClient
Connection@82f110, 
aaa://172.23.86.7:50578=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@110f528, 
aaa://172.23.86.7:50577=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@1b4a4b6} }-14] [PeerFSMImpl] AA: Executor finished

After 2 seconds new executor is created. That mean 2 seconds the initial Event 
is not handled

2012-07-11 14:00:41,602 DEBUG [TCPReader-13] [PeerFSMImpl] AA: Executor 
created: FSM-SPeer{Uri=aaa://gbg41.um.internal:50578; State=DOWN; con=null; 
incCon{aaa://172.23.86.7:50588=org.jdiameter.client.impl.transport.tcp.TCPClient
Connection@82f110, 
aaa://172.23.86.7:50578=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@110f528, 
aaa://172.23.86.7:50577=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@1b4a4b6} }-15

And than the event is got for handling:
2012-07-11 14:00:41,604 DEBUG [FSM-SPeer{Uri=aaa://gbg41.um.internal:50578; 
State=DOWN; con=null; 
incCon{aaa://172.23.86.7:50588=org.jdiameter.client.impl.transport.tcp.TCPClient
Connection@82f110, 
aaa://172.23.86.7:50578=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@110f528, 
aaa://172.23.86.7:50577=org.jdiameter.client.impl.transport.tcp.TCPClientConnect
ion@1b4a4b6} }-15] [PeerFSMImpl] AA: Got Event: Event{name:CER_EVENT, 
key:aaa://172.23.86.7:50588, object:MessageImpl{commandCode=257, flags=128}}

But after 2 seconds the client has dropped the connection (timed out) and the 
sending of CEA fails because the connection is closed by the client.

And this repeats to infinite.

This happens with every Event you can see in the logs.
And this behaviour is reproducible any time on this server.

I found (after 3 days of debugging :-( ) a simple work around for this:
I just disabled the executor thread destroying and recreating.
I disabled this by commenting the following line:
executor = null;
in
org.jdiameter.client.impl.fsm.PeerFSMImpl
and
org.jdiameter.server.impl.fsm.PeerFSMImpl

Actually, this is very strange way to stop an executor thread.

Please, some developer of jdiameter analyse this problem and give me feedback. 
Is my work around OK? Will be provided some general fix of the issue?

I got a lot of experience last time with jdiameter and I'm ready to help if you 
like it.

P.S. The source code of the jdiameter is written in very strange way and it is 
very difficult to debug and find problems like this. I think this code should 
be optimized.

Regards,
Angel

Original issue reported on code.google.com by brainslog on 28 Aug 2012 at 2:29

Attachments:

GoogleCodeExporter commented 9 years ago
Comment 1 by angel.t.angelov, Jul 11, 2012:

I forgot to tell you:
I'm using jainslee-2.7.0.FINAL-jboss-5.1.0.GA with 
mobicents-diameter-mux-1.4.0.FINAL

I have 2 clients on remote machine that are trying every 2 seconds to get 
connection (CER) to the diameter server until they get a valid CEA.
In my case they never get valid CEA because the diameter stack sends the CEA 
when the connection is timed out by the client.

Regards,
Angel

Original comment by brainslog on 28 Aug 2012 at 2:29

GoogleCodeExporter commented 9 years ago
Angel, can you try with current master? I think the fix to Issue 4 is related 
and may help in this situation. Thanks!

Original comment by brainslog on 12 Nov 2012 at 11:17

GoogleCodeExporter commented 9 years ago
This has been fixed with other issues changes. If someone is still facing, 
please reopen.

Original comment by brainslog on 9 Jan 2013 at 2:13