zhaoyang0501 / memcached

Automatically exported from code.google.com/p/memcached
0 stars 0 forks source link

Accept4 on FreeBSD 10 - sockets stuck in CLOSE #375

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
We encountered that our memcached instances on FreeBSD 10 were rejecting 
connections after being under heavy load.

We reproduced this behavior by writing a script that gets/sets random data as 
well as rapidly opens and closes a lot of connections. The script will reach 
the connection limit of our memcached instance in more or less regular 
intervals.
While running this script we're monitoring netstat and memcached stats.

After running this script (sometimes for several hours) the server will stop 
accepting connections and a high number of sockets (close to max connections) 
is seen to be stuck in the CLOSE state. After closing the memcached process 
these numbers (of course) disappear.
See the screenshot for the data we could gather.

Screenshot explained:

- The server has been started with "memcached -u nobody -c 8192 -m 512 -l 
192.168.2.254 -p 11215 -vvv > & cache.log"
- Right window shows "tail -n 1000 cache.log"
- Top left shows "netstat -an | grep <state> | wc -l"
- Middle left shows the last "stats" command returned by the memcached instance
- Bottom left shows errors returned by the test script

After looking through the memcached source we suspect the error to be in the 
FreeBSD 10 kernel, but we would like to have a second opinion on that.

Original issue reported on code.google.com by arne.cl...@gmail.com on 21 Aug 2014 at 10:19

Attachments:

GoogleCodeExporter commented 9 years ago
A few more information on the topic:
Memcached version: 1.4.20
FreeBSD version:  10.0-STABLE #0 r269376

I could not yet reproduce the problem on FreeBSD 9 (which uses accept()).
Test to reproduce the problem on FreeBSD 10 using accept() instead of accept4() 
are still running but did not show the problem, yet.

I also attached the test scripts.
My tests typically run with '/test_memcached <server> <port> 1000'.
You might need to increase the number of allowed processes for this though (at 
least on OSX)

Original comment by arne.cl...@gmail.com on 21 Aug 2014 at 1:52

Attachments:

GoogleCodeExporter commented 9 years ago
Related FreeBSD ticket: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=192889

Original comment by arne.cl...@gmail.com on 21 Aug 2014 at 2:40

GoogleCodeExporter commented 9 years ago
Tests with accept() instead of accept4() showed the same Problem.

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 6:01

Attachments:

GoogleCodeExporter commented 9 years ago
Can you get the output of "stats conns" while it's hung up like this? pre-open 
a connection to run stats from and such.

Original comment by dorma...@rydia.net on 22 Aug 2014 at 6:20

GoogleCodeExporter commented 9 years ago
and you say that accept() has the same problem, but only under freebsd 10, or 
both 9 and 10?

Original comment by dorma...@rydia.net on 22 Aug 2014 at 6:20

GoogleCodeExporter commented 9 years ago
Looks something like this for a couple of rows:

STAT 7767:addr tcp:192.168.6.210:53765
STAT 7767:state conn_read
STAT 7767:secs_since_last_cmd 386

This time it's stuck in CLOSE_WAIT. But as the admins told me they observed 
this, too with CLOSE_WAIT slowly converting to CLOSED. I'll wait some minutes 
if a can see that happen here, too.
The other tests were run over night, so I could not directly observe that until 
now.

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 6:42

Attachments:

GoogleCodeExporter commented 9 years ago
Our admins did not observe this behavior on FreeBSD 9, only on FreeBSD 10 
machines.
However - I did not run my tests against a FreeBSD 9 machine, yet, but I will 
today.

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 6:49

GoogleCodeExporter commented 9 years ago
FreeBSD 9 test is now running (memcached 1.4.13).
The netstat query shows a totally different picture here.

While on FreeBSD 10 the CLOSE_WAIT and CLOSED numbers go up and down all the 
time during the test, the numbers stay pretty much stable on the FreeBSD 9 
machine.
I do see single CLOSE_WAITS coming in but I don't see any changes in the CLOSED 
number at all.

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 7:08

GoogleCodeExporter commented 9 years ago
I had a closer look a the "stats conns" output on the FreeBSD 10 machine:

7 Connections are in conn_new_cmd:

STAT 3845:addr tcp:192.168.6.210:57992
STAT 3845:state conn_new_cmd
STAT 3845:secs_since_last_cmd 2595

1 Connection is in this conn_listening:

STAT 16:addr tcp:192.168.2.254:11215
STAT 16:state conn_listening
STAT 16:secs_since_last_cmd 2595

Everything else sits idle at state conn_read.
Another interesting thing:
I just closed the telnet session which reduced CLOSE_WAIT by 3 but it is still 
stuck.
I tried a reconnect but this as not possible. That's strange because a 
connection should have been freed up ...?
The load test had long been killed before I closed the session so there were no 
new connects.

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 7:31

GoogleCodeExporter commented 9 years ago
Two updates:
On the stuck FreeBSD 10 system all connections just went from CLOSE_WAIT to 
CLOSED and are now stuck there.
We updated the FreeBSD 9 server to memcached 1.4.20 and are running the test 
script there.
The new version shows the same behavior as before.

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 8:43

GoogleCodeExporter commented 9 years ago
The test on FreeBSD 9 caused a stuck, too but this one looks different.
- The log complains about "accept(): Too many open files" but it's far away 
from the limits.
- There are "only" 192 connections permanently stuck in CLOSED.
- I cannot connect via telnet to this server anymore (a still open telnet 
connection works though).
- The stats command shows a low number of active connections

I have to note that memcached 1.4.20 is linked against libevent 1.4.14b-stable 
on this system which might be a source of this issue (the FreeBSD 10 memcached 
is linked against libevent 2.0.21-stable).
We will run another test with an updated libevent on monday.

FreeBSD 9 values:
stats
STAT curr_connections 5
STAT total_connections 17626537
STAT connection_structures 8176

stats conns
STAT 16:addr tcp:192.168.1.53:11222
STAT 16:state conn_listening
STAT 16:secs_since_last_cmd 17825
STAT 17:addr udp:192.168.1.53:11222
STAT 17:state conn_read
STAT 17:secs_since_last_cmd 17825
STAT 18:addr udp:192.168.1.53:11222
STAT 18:state conn_read
STAT 18:secs_since_last_cmd 17825
STAT 19:addr udp:192.168.1.53:11222
STAT 19:state conn_read
STAT 19:secs_since_last_cmd 17825
STAT 20:addr udp:192.168.1.53:11222
STAT 20:state conn_read
STAT 20:secs_since_last_cmd 17825
STAT 22:addr tcp:192.168.6.210:54704
STAT 22:state conn_parse_cmd
STAT 22:secs_since_last_cmd 0

sysctl -a | grep files
kern.maxfiles: 12328
kern.maxfilesperproc: 11095
kern.openfiles: 813

fstat | grep memcached
[... skipped about 20 lines of pipes and regular files ...]
16* internet stream tcp fffffe006599a000
17* internet dgram udp fffffe021126b498
18* internet dgram udp fffffe021126b498
19* internet dgram udp fffffe021126b498
20* internet dgram udp fffffe021126b498
22* internet stream tcp fffffe02c796a3d

Original comment by arne.cl...@gmail.com on 22 Aug 2014 at 1:56

GoogleCodeExporter commented 9 years ago
Sorry if I'm misreading, but the total number of connections shown in "stats 
conns" is how many?

Are you using -o maxconns_fast ?

Please attach text files rather than screenshots. I can't grep a picture 
without OCR'ing it first, and getting a subset of data instead of the whole 
thing isn't useful.

thanks.

Original comment by dorma...@rydia.net on 23 Aug 2014 at 11:36

GoogleCodeExporter commented 9 years ago
I'm not using a different configuration than shown in my initial post, so no -o 
maxconns_fast.

Regarding the high connection numbers - the test tries to simulate heavy loads 
with a lot of connections going on/off as well as occasionally reaching the 
connection limit. This test normally runs for a long time (but not always) 
that's why there are so many connections.

I attached a part of the log that got stuck - the last 1.000.000 lines. The 
whole log from that test is 24 GB large.
I'll try to reproduce it once more so I can give you stats and stats conns as a 
file, too.

Original comment by arne.cl...@gmail.com on 25 Aug 2014 at 7:09

Attachments:

GoogleCodeExporter commented 9 years ago
I mean I'm curious about the full "stats conns" output? I want to see if 
memcached thinks those connections are actually still there, or if it has the 
sockets open but has forgotten about them.

Original comment by dorma...@rydia.net on 25 Aug 2014 at 7:50

GoogleCodeExporter commented 9 years ago
The current test was still running when I went home.
I already reproduced it once today but the open telnet connection did not react 
to any of my commands, so I had to start another test.
I will send you the output of stats and stats conns once I have a working 
telnet connection with a stuck test.

Original comment by arne.cl...@gmail.com on 25 Aug 2014 at 7:59

GoogleCodeExporter commented 9 years ago
I tried the test with a smaller amout of max connections - this one got stuck 
after just a few seconds:
/usr/local/bin/memcached -u nobody -c 2048 -m 512 -l 192.168.2.157 -p 11215 
-vvv > & cache3.log

The connections are still stuck in CLOSE_WAIT at the moment.
I'll attach another logfile when the connections switched to CLOSED

Original comment by arne.cl...@gmail.com on 26 Aug 2014 at 12:04

Attachments:

GoogleCodeExporter commented 9 years ago
Here are the remaining stats commands + stats conns after all connections 
switched to closed

Original comment by arne.cl...@gmail.com on 26 Aug 2014 at 2:45

Attachments:

GoogleCodeExporter commented 9 years ago
I see in the stats output that curr_connections is always 455, despite maxconns 
being set to 2048.

I recall freebsd having more process limits than linux typically does. Can you 
set -c to 256 and re-run your test? Just in case.

Otherwise, I'm not sure offhand. I'll take a look at this in more detail when I 
get some time (could be a week or two though).

You said 1.4.17 definitionly doesn't do this? Even when built under freebsd 10?

Original comment by dorma...@rydia.net on 26 Aug 2014 at 7:36

GoogleCodeExporter commented 9 years ago
We tested it on 1.4.13 on FreeBSD 9 -> no problem.
Same system, 1.4.20 + lib event 1.x -> stuck
Same system, 1.4.20 + lib event 2.x -> still running after a couple of hours

What I did see today (FreeBSD 9, 1.4.20, libevent 2) is the following behavior:

- CLOSE_WAIT is 0, then suddenly rising to high numbers in a short time
- In that time the client shows a lot of timeouts
- The client pauses for a short time -> CLOSE_WAIT cleans up and everything is 
running as before

On FreeBSD 10 the CLOSE_WAIT goes up and down all the time and the client never 
stops for such a long time. So there IS a difference in the way the two systems 
are handling connections.

I did not run the test on a -c 256 server as the FreeBSD 9 test is still 
running.

Original comment by arne.cl...@gmail.com on 27 Aug 2014 at 6:13

GoogleCodeExporter commented 9 years ago
Stuck with -c 256, too (FreeBSD 10). Did take a lot longer though.
The logfile is 40 GB total so I added "just" the last 1 million rows.

The file c256.txt contains the result of all stats commands.
The file cache4_tail.log.gz contains the log file.

Original comment by arne.cl...@gmail.com on 2 Sep 2014 at 12:15

Attachments:

GoogleCodeExporter commented 9 years ago
I have same problem......
Freebsd 10
Memcached 1.4.20
libevent 1.x

Original comment by gamem...@gmail.com on 28 Sep 2014 at 4:17

GoogleCodeExporter commented 9 years ago
Same problem :(

When it stuck - in kernel log 

kernel: sonewconn: pcb 0xfffff80328c68930: Listen queue overflow: 1 already in 
queue awaiting acceptance (14114 occurrences)

and this line disappears from netstat -Lan
tcp4  0/0/1024       192.168.0.32.11211

Original comment by mlav...@gmail.com on 23 Oct 2014 at 8:56

GoogleCodeExporter commented 9 years ago
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=192889 some more info.

Original comment by mlav...@gmail.com on 25 Oct 2014 at 7:23

GoogleCodeExporter commented 9 years ago
Any new information on this?

I still haven't set up a freebsd VM to test with, but I'm pretty sure the code 
isn't doing anything wrong :( If certain versions of fbsd are broken in some 
way that's testable we can disable accept4 for them...

Original comment by dorma...@rydia.net on 1 Jan 2015 at 6:18

GoogleCodeExporter commented 9 years ago
I plan to do a test run on FreeBSD 10.1 this week - we'll see if this makes any 
difference.

Original comment by arne.cl...@gmail.com on 5 Jan 2015 at 8:17

GoogleCodeExporter commented 9 years ago
The current test is running:

memcached -u nobody -c 2048 -m 512 -vvv [...]

FreeBSD 10.1-RELEASE (GENERIC) #0 r274401
memcached 1.4.22
libevent 2.0.21-stable

At the moment it behaves a lot more like FreeBSD 9 (CLOSE is low, CLOSE_WAIT 
stays zero) but I'll let it run overnight to be sure.

Original comment by arne.cl...@gmail.com on 5 Jan 2015 at 3:42

GoogleCodeExporter commented 9 years ago
The test config mentioned above did not show any error (yet) on a VM and on a 
physical server. So the situation has at least improved but as we don't have a 
100% reproduction rule I will continue testing with different settings now.

As the bug showed many signs of a race condition maybe the thread related fixes 
in 1.4.22 fixed the problem. I did not go through the FreeBSD 10.1 changelog, 
yet, but maybe something happened there, too.

Original comment by arne.cl...@gmail.com on 7 Jan 2015 at 7:55

GoogleCodeExporter commented 9 years ago
Further test did not reproduce the error so the chance is high that either 
FreeBSD 10.1 or Memcached 1.4.22 fixed it. As we don't know what caused the 
problem it is not 100% sure though.

We've upgraded our servers showing the problem to see if the problem disappears 
in production,  too.

Original comment by arne.cl...@gmail.com on 13 Jan 2015 at 8:39

GoogleCodeExporter commented 9 years ago
You can't run .21 under your test on freebsd 10.1? I really highly doubt 
anything we changed in .22 would cause a bug like this.

Original comment by dorma...@rydia.net on 13 Jan 2015 at 5:58

GoogleCodeExporter commented 9 years ago
I will try .20 as this was the version where we originally encountered the 
problem.

As I already said this bug shows signs of a race condition and/or overflow 
(hard to reproduce, no fixed reproduction rule, happens under heavy load).

The following fixes introduced in .22 may have had an impact:
- Fix potential corruption for incr/decr of 0b items
- slab re-balance is not thread-safe in function do_item_get
- Fix potential corruption in hash table expansion
- use item lock instead of global lock when hash expanding

That is of course if the bug is really on memcached's side.
On the FreeBSD side there were not many changes that are related (see 
https://www.freebsd.org/releases/10.1R/relnotes.html):

- SA-14:19.tcp  16 September 2014    Denial of Service in TCP packet processing
- SA-14:08.tcp  30 April 2014 Fix TCP reassembly vulnerability
- A bug in sctp(4) that would allow two listening sockets bound to the same 
port has been fixed.

These are all not directly related but changes in that area might have affected 
the connection code.

Original comment by arne.cl...@gmail.com on 14 Jan 2015 at 9:40

GoogleCodeExporter commented 9 years ago
The tests with memcached 1.4.20 did not hang on FreeBSD 10.1 either.
So it looks like one of the FreeBSD fixes removed the problem but that's again 
just guessing.

The live servers we updated to 10.1 / 1.4.22 are still running without a hitch 
so it seems this setup is stable again.

Original comment by arne.cl...@gmail.com on 15 Jan 2015 at 1:42

GoogleCodeExporter commented 9 years ago
The FreeBSD team explained the issue: 
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=192889#c15, so I guess we can 
set this to fixed now.

Original comment by arne.cl...@gmail.com on 23 Feb 2015 at 8:47

GoogleCodeExporter commented 9 years ago
fixed, even though it was an upstream problem :)

Original comment by dorma...@rydia.net on 5 Jul 2015 at 1:08