Closed pedigree closed 10 years ago
Leaving it running for a while results in incomplete datasets and further timeouts that I dont see on redis-server slaves
[97718] 04-27 11:03:14,279 WARN Master connection timeout.
[97718] 04-27 11:03:15,281 INFO [Slave]Remote master is a Redis 2.8.7 instance, support partial sync:1
[97718] 04-27 11:03:15,281 INFO Send PSYNC 57b1d385ee78fd25afa5eba471834d4e30c7f6f3 35180800
[97718] 04-27 11:03:15,281 INFO Recv psync reply:CONTINUE
[97718] 04-27 11:05:14,328 INFO now = 1398596714, ping_recved_time=1398596654
[97718] 04-27 11:05:14,328 WARN Master connection timeout.
[97718] 04-27 11:05:15,330 INFO [Slave]Remote master is a Redis 2.8.7 instance, support partial sync:1
[97718] 04-27 11:05:15,330 INFO Send PSYNC 57b1d385ee78fd25afa5eba471834d4e30c7f6f3 35337105
[97718] 04-27 11:05:15,330 INFO Recv psync reply:CONTINUE
[97718] 04-27 11:07:14,368 INFO now = 1398596834, ping_recved_time=1398596774
[97718] 04-27 11:07:14,368 WARN Master connection timeout.
[97718] 04-27 11:07:15,369 INFO [Slave]Remote master is a Redis 2.8.7 instance, support partial sync:1
[97718] 04-27 11:07:15,370 INFO Send PSYNC 57b1d385ee78fd25afa5eba471834d4e30c7f6f3 35830301
[97718] 04-27 11:07:15,370 INFO Recv psync reply:CONTINUE
It seems there is a bug when ardb's slave connection closed when doing full resync with redis server.
And about the timeout error, what's the value of config 'repl-ping-slave-period' and 'repl-timeout' in redis instance?
Commit a fix for the 'incomplete sync' bug.
Replication isnt working yet but the numbers repl-ping-slave-period 60 repl-timeout 120
ardb-server logs this when starting the process (with slaveof configured)
[135915] 04-27 16:04:47,951 INFO Init storage engine success.
[135915] 04-27 16:04:47,952 WARN No zookeeper servers specified, zookeeper agent would not start.
[135915] 04-27 16:04:47,952 INFO Server started, Ardb version 0.7.1
[135915] 04-27 16:04:47,952 INFO The server is now ready to accept connections on port 2222
[135915] 04-27 16:04:47,955 INFO [Slave]Remote master is a Redis 2.8.7 instance, support partial sync:1
[135915] 04-27 16:04:47,955 INFO Send PSYNC 7a28f80b06f98ae619676a9c7ffee5ae02d5b2fa 1
[135915] 04-27 16:04:47,956 INFO Recv psync reply:FULLRESYNC 57b1d385ee78fd25afa5eba471834d4e30c7f6f3 56336513
redis master log shows this from the connection
[1040] 27 Apr 16:09:47.804 * Slave asks for synchronization
[1040] 27 Apr 16:09:47.804 * Partial resynchronization not accepted: Runid mismatch (Client asked for '7a28f80b06f98ae619676a9c7ffee5ae02d5b2fa', I'm '57b1d385ee78fd25afa5eba471834d4e30c7f6f3')
[1040] 27 Apr 16:09:47.804 * Starting BGSAVE for SYNC
[1040] 27 Apr 16:09:47.826 * Background saving started by pid 220447
[220447] 27 Apr 16:10:07.820 * DB saved on disk
[220447] 27 Apr 16:10:07.838 * RDB: 31 MB of memory used by copy-on-write
[1040] 27 Apr 16:10:07.911 * Background saving terminated with success
ngrep logs this and only a small amount of replication happens but only after the process has segfaults. I see about 120kb of compress data before the replication stops.. The ardb-server process, it segfaults and core dumps before it sees the REDIS0006 tcp stream
# ngrep -d eth1 host 10.0.1.50 and port 6379
interface: eth1 (10.0.1.0/255.255.255.0)
filter: (ip or ip6) and ( host 10.0.1.50 and port 6379 )
#
T 10.0.1.50:53245 -> 10.0.1.120:6379 [AR]
......
###
T 10.0.1.50:53737 -> 10.0.1.120:6379 [A]
......
#
T 10.0.1.50:53737 -> 10.0.1.120:6379 [AP]
info Server..
##
T 10.0.1.120:6379 -> 10.0.1.50:53737 [AP]
$412..# Server..redis_version:2.8.7..redis_git_sha1:00000000..redis_git_dirty:0..redis_build_id:4e504c
85df90d66d..redis_mode:standalone..os:Linux 3.14.0-031400-generic x86_64..arch_bits:64..multiplexing_a
pi:epoll..gcc_version:4.8.1..process_id:1040..run_id:57b1d385ee78fd25afa5eba471834d4e30c7f6f3..tcp_por
t:6379..uptime_in_seconds:410579..uptime_in_days:4..hz:10..lru_clock:1449466..config_file:/etc/redis/6
379.conf....
#
T 10.0.1.50:53737 -> 10.0.1.120:6379 [A]
......
#
T 10.0.1.50:53737 -> 10.0.1.120:6379 [AP]
replconf listening-port 2222..
#
T 10.0.1.120:6379 -> 10.0.1.50:53737 [AP]
+OK..
#
T 10.0.1.50:53737 -> 10.0.1.120:6379 [AP]
psync 7a28f80b06f98ae619676a9c7ffee5ae02d5b2fa 1..
#
T 10.0.1.120:6379 -> 10.0.1.50:53737 [AP]
+FULLRESYNC 57b1d385ee78fd25afa5eba471834d4e30c7f6f3 56987551..
#
T 10.0.1.50:53737 -> 10.0.1.120:6379 [A]
......
#
T 10.0.1.120:6379 -> 10.0.1.50:53737 [AP]
.
#
T 10.0.1.50:53737 -> 10.0.1.120:6379 [A]
......
#
T 10.0.1.120:6379 -> 10.0.1.50:53737 [AP]
$1175338732..
#
T 10.0.1.120:6379 -> 10.0.1.50:53737 [A]
REDIS0006..........D....K...K... ..d........v..Fx6y.E..ZG9S...........K.;]..c&..o.[S..........h......H
....SNS. /.|r.4Wj.tcz..\%...T.R. ..?y........... .....R. ..B*d.M.`...f.dz../.fR. ......%"d.l...^...E..
R. ...).JZa...0lPl...](.R. ..L,...s^....&U*..S..R. ...a..6#.....8.....f.R. ...#Cbc....Y..W......`.....
K0...W.^.d....4.Q@..I......U..R......uz@......K....b@m....-.....1R. c...4..VO...Duw...m..@3..y.4.....R
.ua.......Q@..O1.f}.G~....[F..:.MS@.....\.qo.2.d... ..G.`G..,.....x..)..-....?a;..GMA.A.5.O_........`_
..........bB.......w`G..K'.....Ty. ......x`....r$.".5........Vfy`...Ss.@.~>.6d.l........p9./k.Nt...(.:
....a#....).>!A.F.0..]...#{`G..A........./.0...Y...I..'....&.|(t /...|`/.a...*{.....ka...%.}`..2C.+W(.
.H..:k........I..#.=..........$.}a..^..(.`...}.&.......Q. .......|..f..&....e..QB[....g.....T.C.......
`w.`.T"t(..F..Ae....6.`......m....3.#%b..........L*...O.>.....r....#..'a....Pj.l....l.`G...9{jH.%..r .
....a.Q. .........~6.............x.Cf...H.d}. ..O....4\...`....1.x G.e.......a.W.\.Ca..U.../.`w...$.M.
@....*.#K..E...4.*...&.q..-.. ....`/..8M...r[.?.........`......;F..a..N.$#....QD#.....h......e:#...s.`
/..Y5l^>..................XV...f....i........N...G3.D16..?"....`G...|)...].X2..$...n...Ug...S..eZ.%X..
..f.`/...i..{.S....j............;....~'.^......`/. S~...... `../...;.`.....7.F....\.@K /.T...pA0r. U.+
....M...}...T....[x..%k.,......a..0...I._.#...O...8..`_./.G4...E.....1"......zI...Mb...@..J.......B'.'
..b.y.9..9#....`G.....#.....O.......`...u........
Continues on for 13 x 9kb packets, then stops
The segfault happens after the process starts
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[137412] 04-27 16:10:28,119 INFO Start init storage engine.
[New Thread 0x7fff373ff700 (LWP 137413)]
[137412] 04-27 16:10:28,123 INFO Init storage engine success.
[New Thread 0x7fff353fe700 (LWP 137414)]
[137412] 04-27 16:10:28,125 WARN No zookeeper servers specified, zookeeper agent would not start.
[137412] 04-27 16:10:28,125 INFO Server started, Ardb version 0.7.1
[137412] 04-27 16:10:28,125 INFO The server is now ready to accept connections on port 2222
[New Thread 0x7fff343ff700 (LWP 137415)]
[New Thread 0x7fff33bfe700 (LWP 137416)]
[137412] 04-27 16:10:28,128 INFO [Slave]Remote master is a Redis 2.8.7 instance, support partial sync:1
[137412] 04-27 16:10:28,128 INFO Send PSYNC 7a28f80b06f98ae619676a9c7ffee5ae02d5b2fa 1
[137412] 04-27 16:10:28,128 INFO Recv psync reply:FULLRESYNC 57b1d385ee78fd25afa5eba471834d4e30c7f6f3 56987551
[New Thread 0x7fff327ff700 (LWP 137417)]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff791464d in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0 0x00007ffff791464d in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00007ffff79741f9 in std::string::assign(std::string const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00000000004cff3d in operator= (__str=..., this=0x7fffffffe418) at /usr/include/c++/4.8/bits/basic_string.h:547
#3 ardb::Slave::HandleRedisReply (this=0x7fffffffe2c0, ch=0x7ffff67c8dc0, reply=...) at replication/slave.cpp:302
#4 0x00000000004a5c73 in HandleStreamEvent (e=..., ctx=..., this=<optimised out>)
at ./channel/channel_upstream_handler.hpp:133
#5 ardb::ChannelPipeline::SendUpstream<ardb::codec::RedisMessage> (ctx=<optimised out>, e=..., this=<optimised out>)
at channel/all_includes.hpp:89
#6 0x00000000004b2f3d in SendUpstream<ardb::MessageEvent<ardb::codec::RedisMessage> > (this=<optimised out>,
this=<optimised out>, e=...) at channel/all_includes.hpp:164
#7 fire_message_received<ardb::codec::RedisMessage> (destructor=0x0, message=0x7fffffffd7e0, ctx=...)
at ./channel/channel_helper.hpp:91
#8 ardb::codec::StackFrameDecoder<ardb::codec::RedisMessage>::CallDecode (this=0x7fffffffe300, context=...,
channel=0x7ffff67c8dc0, cumulation=...) at ./channel/codec/stack_frame_decoder.hpp:102
#9 0x00000000004b353f in ardb::codec::StackFrameDecoder<ardb::codec::RedisMessage>::MessageReceived (this=0x7fffffffe300,
ctx=..., e=...) at ./channel/codec/stack_frame_decoder.hpp:157
#10 0x0000000000505103 in HandleStreamEvent (e=..., ctx=..., this=<optimised out>)
at ./channel/channel_upstream_handler.hpp:133
#11 ardb::ChannelPipeline::SendUpstream<ardb::Buffer> (ctx=<optimised out>, e=..., this=<optimised out>)
at ./channel/all_includes.hpp:89
#12 0x0000000000506550 in SendUpstream<ardb::MessageEvent<ardb::Buffer> > (event=..., this=0x7ffff67c8df8)
at ./channel/all_includes.hpp:128
#13 fire_message_received<ardb::Buffer> (destructor=0x0, message=0x7ffff67c8e28, channel=0x7ffff67c8dc0)
at ./channel/channel_helper.hpp:83
#14 ardb::Channel::OnRead (this=0x7ffff67c8dc0) at channel/channel.cpp:500
#15 0x0000000000504f2d in ardb::Channel::IOEventCallback (eventLoop=<optimised out>, fd=<optimised out>,
clientData=0x7ffff67c8dc0, mask=1) at channel/channel.cpp:50
#16 0x0000000000521870 in aeProcessEvents (eventLoop=eventLoop@entry=0x7ffff6439158, flags=flags@entry=3)
at channel/redis/ae.c:429
#17 0x0000000000521b5b in aeMain (eventLoop=0x7ffff6439158) at channel/redis/ae.c:485
#18 0x00000000004a93de in ardb::ArdbServer::Start (this=this@entry=0x7fffffffde20, props=...) at ardb_server.cpp:843
#19 0x0000000000407fd8 in main (argc=<optimised out>, argv=<optimised out>) at main.cpp:110
Have you 'make clean' after update src from github?
For the timeout error, you should also change the 'repl-timeout' value to 120 in ardb.conf, it has a default value 60 which would generate an error if no 'ping' received from master in 60 secs.
it looks like a rm of the ardb folder and a complete pull from git has fixed the segfaulting. I'm testing the replication now. Is there a way to prioritize then loading of the rdb file into the storage_engine or does that not matter?
You mean a redis command to load rdb file? There is a 'import' command to load rdb file. redis> import /tmp/data/dump.ardb OK
the rbd (1.1gb) is taking a very long time to load, almost an hour and I was just wondering about key availability during that time. If thre is no problem and replication works as well as testing is showing, then it looks like I'll be putting it into production on my website :)
Replication has been working well but I'm concerned about the time it takes to submit keys to file storage. My preferred lmdb can take hours to commit after initial replication. Is there a way to speed the commit to disk up and are keys available immediately after the RDB has started disk commit?
It could be faster , but can only 10% or 20% faster,which is still very slow for huge data set. That depends on the random write performance of the storage engine. The total write operation number executed in replication is "keys num + elements num in set/list/hash/zset".
While its dumping to Lmdb/rocksdb (etc), are the keys from the RDB available to query?
On 11/08/2014 04:51, yinqiwen wrote:
It could be faster , but can only 10% or 20% faster,which is still very slow for huge data set. That depends on the random write performance of the storage engine. The total write operation number executed in replication is "keys num + elements num in set/list/hash/zset".
— Reply to this email directly or view it on GitHub https://github.com/yinqiwen/ardb/issues/42#issuecomment-51739843.
Each key is available to query after it's been loaded from RDB, not after the whole RDB loaded.
I've just done some testing with blank databases (rm ./data and ./repl folders) and slaveof 10.0.1.120:6379 in ardb.conf. when I start ardb-server, i would expect to see a full sync with the master (in my case, the master is a read only slave to an upstream master) but ardb issues a psync and only gets a very small set of data from the master, which is 2gb
ardb console shows
network listening shows the following (10.0.1.120 is the readonly master, 10.0.1.50 is the ardb server)