zzyugl / xmemcached

Automatically exported from code.google.com/p/xmemcached
Apache License 2.0
0 stars 0 forks source link

in standby mode xmemcached doesn't switch to backup node if primary memcached host really fails (power off, not kill -9) #179

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Steps which reproduce the problem:
----------------------------------
There are two hosts, host1 10.50.3.1 and host2 10.50.3.2.
Standby mode is used.

1. run repcached on host1:
./memcached -m 1000 -p 11211 -x 10.50.3.2
2. run repcached on host2:
./memcached -m 1000 -p 11211 -x 10.50.3.1
3. run Java client with xmemcached on host1:
MemcachedClient builder=new 
XmemcachedClientBuilder(AddrUtil.getAddressMap("10.50.3.2:11211,10.50.3.1:11212"
));
This Java client only performs GET operations from repcached
4. Crash host2 (hold power button)
GET operations begin to fail with timeout:
2012-02-20 12:17:21,342 61825  [        r-0 ] ERROR       test  - Timed 
out(1000) waiting for operation
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2528)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:574)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:971)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:929)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:940)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:962)
        at com.test.perffw.XmemcachedClient$XmemcachedMap.get(XmemcachedClient.java:69)
        at com.test.perffw.MemcachedPerf$ReadThread.doOperation(MemcachedPerf.java:56)
        at com.test.perffw.ThreadBase.run(ThreadBase.java:175)
I expect these timeouts to stop in a few seconds and xmemcached to switch to 
another repcached (10.50.3.1:11212).
Instead these timeouts never stop. The app is getting them all the time the 
host 10.50.3.2 is down.

version of the product:
-----------------------
memcached 1.4.10
http://mdounin.ru/files/repcached-2.3.1-1.4.10.patch.gz
xmemcached-1.3.5
OS: RHEL 5.7
Java 1.6u24

Additional information:
-----------------------
If instead of "power off" I do "kill -9", xmemcached switches to backup just 
perfectly. 

Original issue reported on code.google.com by anton.mi...@gmail.com on 20 Feb 2012 at 3:17

GoogleCodeExporter commented 9 years ago
Further feedback?

Original comment by killme2...@gmail.com on 23 Feb 2012 at 2:07

GoogleCodeExporter commented 9 years ago
Could you please run my very simple class from the attached file 
xmemcached-standby-mode.zip. This test is doing GET and SET operations 
infinitely with 1 second interval.
You will easily reproduce my problem:
1) start 2 memcached, one on port 12000, second on port 12001:
/usr/local/memcached/bin/memcached -p 12000 -d
/usr/local/memcached/bin/memcached -p 12001 -d
2) run my test: bin/run.sh 127.0.0.1:12000,127.0.0.1:12001
3) kill -STOP [memcached-on-port-12000]

After that the application gets TimeoutException every 2 seconds. Forever:

2012-03-01 11:22:15,987 5216   [  main ] INFO  tandbyMode  - key:5, 
oldValue:value-5, newValue:value-5
2012-03-01 11:22:16,990 6219   [  main ] INFO  tandbyMode  - key:6, 
oldValue:value-6, newValue:value-6
2012-03-01 11:22:18,996 8225   [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2528)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:574)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:971)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:929)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:940)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:962)
        at TestStandbyMode.main(TestStandbyMode.java:50)
2012-03-01 11:22:21,001 10230  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2528)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:574)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:971)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:929)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:940)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:962)
        at TestStandbyMode.main(TestStandbyMode.java:50)

1) I noticed that heart beat ("version" command) works only on idle connection, 
when there is no traffic between xmemcached and memcached. If there is a 
traffic with at least 1 TPS (like in my test), then heart beats are not sent. I 
see it under debug and looking at -vv output of memcached. But this is wrong, 
heart beats shall be sent even under traffic.
2) Ok, I paused the traffic (by pressing p<ENTER> in my test), and finally I 
see these three log messages: "heartbeat fail", "Remove a session", "Trying to 
connect":

2012-03-01 11:09:07,591 19643  [ std-monitor ] INFO  tandbyMode  - pausing...
2012-03-01 11:09:08,166 20218  [  main ] INFO  tandbyMode  - pausing main thread
2012-03-01 11:09:26,164 38216  [ pool-1-thread-9 ] WARN  hedHandler  - 
Session(127.0.0.1:11201) heartbeat fail 10 times,close session and try to heal 
it
2012-03-01 11:09:26,168 38220  [ pool-1-thread-9 ] WARN  Controller  - Remove a 
session: 127.0.0.1:11201
2012-03-01 11:09:28,172 40224  [ Xmemcached-Reactor-0 ] WARN  Controller  - Add 
a session: 127.0.0.1:11201

But just after that I see the "Trying to connect" log message:

2012-03-01 11:09:28,172 40224  [ Heal-Session-Thread ] WARN  Controller  - 
Trying to connect to 127.0.0.1:11201 for 1 times

Which means, that xmemcached again connected to the frozen memcached. This is 
also wrong.

And if after that I resume the traffic, I still see TimeoutException:

2012-03-01 11:09:37,943 49995  [ std-monitor ] INFO  tandbyMode  - resuming...
2012-03-01 11:09:37,943 49995  [  main ] INFO  tandbyMode  - resuming main 
thread
2012-03-01 11:09:38,946 50998  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2528)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:574)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:971)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:929)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:940)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:962)
        at TestStandbyMode.main(TestStandbyMode.java:50)

So could you please fix two things:
1) Send heart beat even under the traffic towards memcached, not only when 
there is no traffic (the connection is idle).
2) Don't declare a connection open until first successful heart beat on it.

Original comment by anton.mi...@gmail.com on 1 Mar 2012 at 8:33

Attachments:

GoogleCodeExporter commented 9 years ago
Yes,it's a problem in this situation.i will try to fix it.

Original comment by killme2...@gmail.com on 2 Mar 2012 at 3:44

GoogleCodeExporter commented 9 years ago
Fixed by 
https://github.com/killme2008/xmemcached/commit/e154442cb26e42993c715c44257e9b49
571b55c4

Original comment by killme2...@gmail.com on 17 Mar 2012 at 12:38

GoogleCodeExporter commented 9 years ago
xmemcached 1.3.6 released.

Original comment by killme2...@gmail.com on 19 Mar 2012 at 2:13

GoogleCodeExporter commented 9 years ago
xmemcached 1.3.6 doesn't fix the issue.
With "Failure Mode" off:
  when memcached 1 is killed, the client doesn't switch to memcached 2, exception on each operation,
  when memcached 1 is frozen, the client doesn't switch to memcached 2, exception on each operation.

With "Failure Mode" on:
  when memcached 1 is killed, the client switches to memcached 2, successfully performs operations,
  when memcached 1 is frozen, the client switches to memcached 2, but from 11 operations only 2 are successful, other 9 throw exceptions.

Here are more details. Also see the logs and modified sample code in the 
attachment.
================================================================================
====

1) failure mode off
===================

./memcached -p 12000 -X 13000 -vv 1 > /tmp/repcached-1.log 2>&1
./memcached -p 12001 -X 13001 -vv 1 > /tmp/repcached-2.log 2>&1
./run.sh 127.0.0.1:12000,127.0.0.1:12001

1.1) kill memcached 1 => doesn't switch to memcached 2, exception on each 
operation

2012-03-23 17:53:47,139 104456 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
net.rubyeye.xmemcached.exception.MemcachedException: There is no available 
connection at this moment
        at net.rubyeye.xmemcached.impl.MemcachedConnector.send(MemcachedConnector.java:474)
        at net.rubyeye.xmemcached.XMemcachedClient.sendCommand(XMemcachedClient.java:268)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:575)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
        at TestStandbyMode.main(TestStandbyMode.java:56)

1.2) start memcached 1 => switches after some time to memcached 2, successfully 
performs operations

1.3) freeze memcached 1 => doesn't switch to memcached 2, exception on each 
operation

2012-03-23 17:57:22,845 320162 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
        at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 17:57:24,814 322131 [ pool-2-thread-2 ] WARN  hedHandler  - 
Session(127.0.0.1:12000) heartbeat fail 10 times,close session and try to heal 
it
2012-03-23 17:57:24,814 322131 [ pool-2-thread-2 ] WARN  Controller  - Remove a 
session: 127.0.0.1:12000
2012-03-23 17:57:24,815 322132 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
net.rubyeye.xmemcached.exception.MemcachedException: Session has been closed
        at net.rubyeye.xmemcached.impl.MemcachedTCPSession.destroy(MemcachedTCPSession.java:115)
        at net.rubyeye.xmemcached.impl.MemcachedHandler.onSessionClosed(MemcachedHandler.java:138)
        at com.google.code.yanf4j.core.impl.AbstractSession.onClosed(AbstractSession.java:305)
        at com.google.code.yanf4j.core.impl.AbstractSession.close(AbstractSession.java:293)
        at net.rubyeye.xmemcached.impl.MemcachedHandler$CheckHeartResultThread.run(MemcachedHandler.java:223)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-03-23 17:57:25,818 323135 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
net.rubyeye.xmemcached.exception.MemcachedException: There is no available 
connection at this moment
        at net.rubyeye.xmemcached.impl.MemcachedConnector.send(MemcachedConnector.java:474)
        at net.rubyeye.xmemcached.XMemcachedClient.sendCommand(XMemcachedClient.java:268)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:575)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
        at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 17:57:26,817 324134 [ Heal-Session-Thread ] WARN  Controller  - 
Trying to connect to 127.0.0.1:12000 for 1 times
2012-03-23 17:57:26,818 324135 [ Xmemcached-Reactor-0 ] WARN  Controller  - Add 
a session: 127.0.0.1:12000
2012-03-23 17:57:27,823 325140 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
        at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
        at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
        at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
        at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
        at TestStandbyMode.main(TestStandbyMode.java:56)

1.4) unfreeze memcached 1 => immediately switches to memcached 1 and 
successfully performs operations

================================================================================
======================

2) failure mode on
==================

./memcached -p 12000 -X 13000 -vv 1 > /tmp/repcached-1.log 2>&1
./memcached -p 12001 -X 13001 -vv 1 > /tmp/repcached-2.log 2>&1
Put -Dfailure-mode=true into run.sh
./run.sh 127.0.0.1:12000,127.0.0.1:12001

2.1) kill memcached 1 => switches to memcached 2, successfully performs 
operations

2.2) start memcached 1 => switches after some time to memcached 1, successfully 
performs operations

2.3) freeze memcached 1 => switches to memcached 2, but from 11 operations only 
2 are successful, other 9 throw exceptions:

2012-03-23 18:07:29,107 152619 [  main ] INFO  tandbyMode  - key:152, 
oldValue:null, newValue:value-152
2012-03-23 18:07:30,110 153622 [  main ] INFO  tandbyMode  - key:153, 
oldValue:null, newValue:value-153
2012-03-23 18:07:31,113 154625 [  main ] INFO  tandbyMode  - key:154, 
oldValue:null, newValue:value-154
2012-03-23 18:07:32,116 155628 [  main ] INFO  tandbyMode  - key:155, 
oldValue:null, newValue:value-155
2012-03-23 18:07:34,121 157633 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:36,125 159637 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:38,130 161642 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:40,135 163647 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:42,139 165651 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:44,143 167655 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:46,148 169660 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:48,153 171665 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:48,624 172136 [ pool-2-thread-2 ] WARN  hedHandler  - 
Session(127.0.0.1:12000) heartbeat fail 10 times,close session and try to heal 
it
2012-03-23 18:07:48,625 172137 [ pool-2-thread-2 ] WARN  Controller  - Client 
in failure mode,we don't remove session 127.0.0.1:12000
2012-03-23 18:07:49,156 172668 [  main ] INFO  tandbyMode  - key:164, 
oldValue:null, newValue:value-164
2012-03-23 18:07:50,159 173671 [  main ] INFO  tandbyMode  - key:165, 
oldValue:null, newValue:value-165
2012-03-23 18:07:50,627 174139 [ Heal-Session-Thread ] WARN  Controller  - 
Trying to connect to 127.0.0.1:12000 for 1 times
2012-03-23 18:07:50,628 174140 [ Xmemcached-Reactor-0 ] WARN  Controller  - Add 
a session: 127.0.0.1:12000
2012-03-23 18:07:52,163 175675 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-23 18:07:54,168 177680 [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)

2.4) unfreeze memcached 1 => immediately switches to memcached 1 and 
successfully performs operations

Original comment by anton.mi...@gmail.com on 23 Mar 2012 at 3:35

Attachments:

GoogleCodeExporter commented 9 years ago
With "Failure Mode" off,the behavior is all right,standby mode is only valid 
with failure mode.

With "failure mode" on,because heal session thread could reconnect to 
memcached1 even if it was frozen,so some requests sent to memcached1 would be 
failed,all requests before we found that memcached1 was frozen would be failed 
too.
But if you test this with power off rather than local frozen process,the heal 
session thread could not reconnect to memcached1,and the behavior will be right.

I don't want to fixed the issue with frozen memcached,i think this situation is 
so rare that i don't want to do heartbeat before adding a session to valid 
session list.

Original comment by killme2...@gmail.com on 23 Mar 2012 at 4:22

GoogleCodeExporter commented 9 years ago
Indeed, with real power cable unplug instead of process freezing everything 
works fine! Thanks!

I agree with you about non fixing process freezing use case. I used process 
freezing just to emulate real power outage, but it turns out that there is a 
difference between them.

More details:

3) failure mode on, real power cable unplug
===========================================

host2: ./memcached -p 12000 -X 13000 -vv 1 > /tmp/repcached-1.log 2>&1
host1: ./memcached -p 12001 -X 13001 -vv 1 > /tmp/repcached-2.log 2>&1
Put -Dfailure-mode=true into run.sh
host1: ./run.sh host2:12000,127.0.0.1:12001

3.1) unplug power cable from host2 => the client switches to memcached on host1 
in 12 seconds, and after that successfully performs operations:

2012-03-26 10:38:54,076 34491  [  main ] INFO  tandbyMode  - key:34, 
oldValue:null, newValue:value-34
2012-03-26 10:38:56,082 36497  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-26 10:38:58,087 38502  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-26 10:39:00,092 40507  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-26 10:39:02,097 42512  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-26 10:39:04,102 44517  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-26 10:39:06,106 46521  [  main ] ERROR tandbyMode  - something happened 
between xmemcached and memcached
java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2516)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:576)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:957)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:913)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:925)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:948)
    at TestStandbyMode.main(TestStandbyMode.java:56)
2012-03-26 10:39:06,724 47139  [ pool-2-thread-1 ] WARN  hedHandler  - 
Session(10.50.30.155:12000) heartbeat fail 10 times,close session and try to 
heal it
2012-03-26 10:39:06,731 47146  [ pool-2-thread-1 ] WARN  Controller  - Client 
in failure mode,we don't remove session 10.50.30.155:12000
2012-03-26 10:39:07,110 47525  [  main ] INFO  tandbyMode  - key:41, 
oldValue:null, newValue:value-41
2012-03-26 10:39:08,114 48529  [  main ] INFO  tandbyMode  - key:42, 
oldValue:null, newValue:value-42
2012-03-26 10:39:08,735 49150  [ Heal-Session-Thread ] WARN  Controller  - 
Trying to connect to 10.50.30.155:12000 for 1 times
2012-03-26 10:39:09,118 49533  [  main ] INFO  tandbyMode  - key:43, 
oldValue:null, newValue:value-43

Original comment by anton.mi...@gmail.com on 26 Mar 2012 at 8:23

GoogleCodeExporter commented 9 years ago
Hi all, thanks for the fix. 

I am wondering why, in case of power off, the failure detection take so long 
(12 sec). Is it possible to reduce such delay and eventually make it 
configurable ?

regards
  john

Original comment by john.boy...@gmail.com on 26 Mar 2012 at 10:00

GoogleCodeExporter commented 9 years ago
@john @anton

The failure detection take 10+ secs because xmemcached will try 5 times to do 
heartbeating,and each time it will wait 2 seconds for server response.I am 
sorry this parameter could not be modified right now,it is hard coded in 
source.If you wants,you can modify the contant variable 
MAX_HEART_BEAT_FAIL_COUNT in net.rubyeye.xmemcached.impl.MemcachedHandler and 
build xmemcached by yourself.

Original comment by killme2...@gmail.com on 26 Mar 2012 at 10:05

GoogleCodeExporter commented 9 years ago
I opened an issue about the MAX_HEART_BEAT_FAIL_COUNT 
http://code.google.com/p/xmemcached/issues/detail?id=188

Original comment by killme2...@gmail.com on 26 Mar 2012 at 10:08

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
 Issue 188  was fixed by https://github.com/killme2008/xmemcached/commit/6a5be1773f0ff9c299f5610b3e1f2010fbc6caea

I added a environment variable "xmemcached.heartbeat.max.fail.times" to modify 
the max heartbeating fail times.You can change it by adds 
"-Dxmemcached.heartbeat.max.fail.times=2" to your jvm options.

Original comment by killme2...@gmail.com on 26 Mar 2012 at 10:48