xiongxu / s3fs

Automatically exported from code.google.com/p/s3fs
GNU General Public License v2.0
0 stars 0 forks source link

s3fs freezes when many files are transferred #136

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
After extensive activity and file transfer using rsync, cp or a custom 
application, the mount freezes  and the only way to get it working again is to 
force an amount with 'umount -l' and to re-mount the drive. It happens after 
about 290 files are transferred (approximate figure)

What steps will reproduce the problem?
1. Any time I do a large transfer of files, it happens

What is the expected output? What do you see instead?
I managed to mount an S3 repo but after a few minutes of rsyncing files, the 
mount becomes no longer available and I start received the following error if I 
try to 'cd' to it: ls: /mnt/myreponame/: Transport endpoint is not connected

The only way to get it working again is to unmount, and re-run s3fs. Anyone 
else experiencing this problem?

What version of the product are you using? On what operating system?
System running Centos 5.5  - 2.6.18-194.26.1.el5xen

Please provide any additional information below.

All dependent applications were updated so that I am working with latest 
version of relevant packages

Workaround: maybe have some sort of timeout or something to re-mount 
automatically 

Original issue reported on code.google.com by jve...@gmail.com on 10 Dec 2010 at 10:39

GoogleCodeExporter commented 9 years ago
More information:
/var/log/messages shows:
Dec 10 18:30:32 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 18:31:41 rt s3fs: init $Rev: 274 $
Dec 10 18:35:33 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 18:40:34 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 18:45:35 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 18:50:36 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 18:55:37 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 19:00:38 rt init: Id "x" respawning too fast: disabled for 5 minutes
Dec 10 19:04:34 rt s3fs: ###curlCode: 52  msg: server returned nothing (no 
headers, no data)
Dec 10 19:05:40 rt init: Id "x" respawning too fast: disabled for 5 minutes

using curl curl-7.15.5-9.el5.i386

Original comment by jve...@gmail.com on 11 Dec 2010 at 12:33

GoogleCodeExporter commented 9 years ago
SVN revision addresses the curlCode 52 error  -- please retrieve the most 
recent version and try again.

Original comment by dmoore4...@gmail.com on 11 Dec 2010 at 12:46

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Latest revision R279 now freezes after many file copies.

Also, each file copy is generating the following error in /var/log/messages 
even though the file saves fine to our bucket
Dec 10 20:12:20 rt s3fs: ### CURLE_HTTP_RETURNED_ERROR
Dec 10 20:12:20 rt s3fs: ###response=404
Dec 10 20:12:24 rt s3fs: ### CURLE_HTTP_RETURNED_ERROR
Dec 10 20:12:24 rt s3fs: ###response=404

BTW: using curl.i386 version 7.15.5

Original comment by jve...@gmail.com on 11 Dec 2010 at 1:27

GoogleCodeExporter commented 9 years ago
The 404 error is somewhat concerning, I've never seen it reported before.  
Unfortunately, it can mean several things according to AWS documentation:

http://docs.amazonwebservices.com/AmazonS3/latest/API/

I'll have to put in a hook to capture the message that is reported along with 
the code so that we can get a better idea of what is going on. 

Original comment by dmoore4...@gmail.com on 11 Dec 2010 at 1:51

GoogleCodeExporter commented 9 years ago
What was I thinking, the 404 is a normal response for when an object is not 
found. Think of it this way, you're writing a file to a fresh clean bucket and 
you want to first check to see if it all ready exists. You query for it and it 
returns "Not Found".

These showing up in the log is bad, although it IS a HTTP error, it isn't 
necessarily a s3fs error.

This boils down to the use of setting the curl option FAILONERROR true. What 
really needs to happen is to turn the option off and parse the HTTP return code 
and act accordingly

CURLE_HTTP_RETURNED_ERROR is only returned with FAILONERROR is set.  If 
CURLE_OK is returned, then the http return needs to be examined.

What does your rsync command look like?  ...try --verbose --progress 
--whole-file as options

Original comment by dmoore4...@gmail.com on 11 Dec 2010 at 4:07

GoogleCodeExporter commented 9 years ago
svn rev 280 gets rid of the HTTP 404 errors from syslog

Original comment by dmoore4...@gmail.com on 11 Dec 2010 at 4:44

GoogleCodeExporter commented 9 years ago
404 error messages have disappeared by s3fs still freezing after lots of 
continuous activity wether its using cp, rsync or a custom copying program. 
This really makes it unusable. Any way to further debug this? 

The frozen s3fs appears as: ps aux | grep -i s3 shows:

root     15419  0.0  0.2  51228  5024 ?        Ssl  04:05   0:00 s3fs -d 
myrepo876 /mnt/repo

Original comment by jve...@gmail.com on 11 Dec 2010 at 9:20

GoogleCodeExporter commented 9 years ago
I'm seeing issues too when using rsync. See issue #130

I think that I have correlated the issue with the transfer of large files ( > 
500 MB ), as an rsync of thousands of smaller files works.

Another user has wrapped his transfer routine in a script to automatically 
restart the transfer.

I have done this manually and rsync picks up where it left off, however since 
s3fs is (currently) file based (as opposed to block or partial file based), the 
whole file needs to be transferred again.

Less than perfect, but it appears to be a workaround. Admittedly, I'm at a loss 
as to why this is happening. I've put in more debugging messages and so far 
there isn't a smoking gun. 

Original comment by dmoore4...@gmail.com on 12 Dec 2010 at 5:06

GoogleCodeExporter commented 9 years ago
jvence,  the number of files probably isn't the issue, but its probably the 
size of the individual files.  When rsync seems to hang, what is the size of 
the last (or current) file that is being rsync'ed?  

Also what is your internet connection's upload speed.  If the file is large 
(>500 and your upload speed is slow (or typical ~1Mb/s)) then the seemingly 
hang may just be how long it takes.

With my experiments (a large file 2GB and upload speed of 1.5Mb/s), then if I 
am lucky, the rsync will work.  If I'm unlucky, then some sort of connection 
timeout occurs.

The answer here may be to implement AWS's new multipart upload. This allows for 
chopping up large files into smaller pieces and uploading them individually.  

Original comment by dmoore4...@gmail.com on 13 Dec 2010 at 9:49

GoogleCodeExporter commented 9 years ago
In my case it has nothing to do with large file size but the number of files 
being transferred. All the files are less then 20K except for one that is about 
4MB. The internet connect is a full T3 lines between both our servers and 
Amazon's AWS. And again this will not only happen with rsync but cp or our own 
custom java file copier.
Any way to debug this?

Original comment by jve...@gmail.com on 14 Dec 2010 at 10:11

GoogleCodeExporter commented 9 years ago
OK, sorry, I mis-speculated.

I haven't been able to reproduce this issue (doesn't mean that is doesn't 
exist), so it'll be tough for me to debug.

Here are some steps that you can do to try to debug this:

- run s3fs in foreground and debug modes while capturing stdout/stderr into a 
log file:

s3fs -d -f <bucket> <mountpoint> -o retries=20 2>&1 | tee /tmp/s3fs.log

- run rsync in extremely verbose mode

rsync -avvv --progress --stats --whole-file <source> <mountpoint> | tee 
/tmp/rsync.log

- wait for error to occur, and gather up the messages in syslog

grep s3fs /var/log/syslog > /tmp/s3fs-syslog.log

These log files may provide some clues as to what/where the issue might be.

You can also added more debug statements into the s3fs.cpp code itself and 
recompile to try and narrow your search to which function may be the culprit 
(or a contributing factor)

Original comment by dmoore4...@gmail.com on 14 Dec 2010 at 4:15

GoogleCodeExporter commented 9 years ago
Ok managed to crash it again while it was processing a file that is 4 KB on 
disk (2,304 bytes) in size. So far my freeze rate is probably around 90%. Here 
are the last few debug lines:

access[path=/]
access[path=/p]
access[path=/p/projects]
access[path=/p/projects/502]
access[path=/p/projects/502/tiles]
access[path=/p/projects/502/tiles/50094_561615121_1172865_n.jpg]
open[path=/p/projects/502/tiles/50094_561615121_1172865_n.jpg][flags=32768]
downloading[path=/p/projects/502/tiles/50094_561615121_1172865_n.jpg][fd=285]
read[path=/p/projects/502/tiles/50094_561615121_1172865_n.jpg]
flush[path=/p/projects/502/tiles/50094_561615121_1172865_n.jpg][fd=285]
release[path=/p/projects/502/tiles/50094_561615121_1172865_n.jpg][fd=285]
flush[path=/p/projects/502/tiles/41649_561280596_6157_n.jpg][fd=284]
release[path=/p/projects/502/tiles/41649_561280596_6157_n.jpg][fd=284]
getattr[path=/p]
getattr[path=/p/projects]
getattr[path=/p/projects/502]
getattr[path=/p/projects/502/tiles]
getattr[path=/p/projects/502/tiles/41659_561690764_5017_n.jpg]
getattr[path=/p/projects/502/tiles/.fuse_hidden00000186000000bb]
rename[from=/p/projects/502/tiles/41659_561690764_5017_n.jpg][to=/p/projects/502
/tiles/.fuse_hidden00000186000000bb]

Original comment by jve...@gmail.com on 14 Dec 2010 at 6:06

GoogleCodeExporter commented 9 years ago
How about the last 20 or so s3fs messages in /var/log/syslog?

grep s3fs /var/log/syslog | tail -n 20

Can you run it a few more times and see if the freeze occurs consistently on 
the rename function?

Original comment by dmoore4...@gmail.com on 14 Dec 2010 at 6:20

GoogleCodeExporter commented 9 years ago
Ok it froze again:
write[path=/p/502/generated.jpg]
write[path=/p/502/generated.jpg]
flush[path=/p/502/generated.jpg][fd=273]
uploading[path=/p/502/generated.jpg][fd=273][size=2110352]
release[path=/p/502/generated.jpg][fd=273]

previously in the log file [rename] worked fine:
...
getattr[path=/p/502/tiles/.fuse_hidden0000026800000001]
rename[from=/p/502/tiles/HsTZSDw4avx.gif][to=/p/canvases/502/tiles/.fuse_hidden0
000026800000001]
copying[path=/p/502/tiles/.fuse_hidden0000026800000001]
...

Original comment by jve...@gmail.com on 14 Dec 2010 at 7:00

GoogleCodeExporter commented 9 years ago
I'm using s3fs on Cent OS 5.5 and cannot find syslog. /var/messages not showing 
anything relevant

Original comment by jve...@gmail.com on 14 Dec 2010 at 7:01

GoogleCodeExporter commented 9 years ago
It appears that the standard installation of CentOS does not log level DEBUG 
messages. To change this, you'll have to edit /etc/syslog.conf and change this 
line:

*.info;mail.none;authpriv.none;cron.none               /var/log/messages

to

*.debug;mail.none;authpriv.none;cron.none               /var/log/messages

...and restart the syslog daemon

% sudo /etc/init.d/syslog restart

s3fs debug messages will now appear in /var/log/messages when the -d option is 
used.

Original comment by dmoore4...@gmail.com on 15 Dec 2010 at 1:26

GoogleCodeExporter commented 9 years ago
Well, this is might be re-assuring. I have duplicated your problem (at least I 
think so) on my CentOS 5.5 virtual machine.  It hung during the third file of 
an rsync.

How the heck were you able to recover your system from this?  Killing the 
process, (kill -9 PID), ctrl-c, etc... seem to have no effect.

I'm somewhat suspicious of the version of FUSE CentOS uses vs. what's on my 
Debian/Ubuntu machines: 2.7.4 vs. 2.8.4

The last s3fs debug message in /var/log/messages is:
Dec 14 18:51:15 centos s3fs: HTTP response code 404

This might give me something to work with.

Original comment by dmoore4...@gmail.com on 15 Dec 2010 at 1:59

GoogleCodeExporter commented 9 years ago
Ok turned on logging will report after the next crash. To recover from a crash, 
simply umount -l, kill s3fs process and restart s3fs.

Original comment by jve...@gmail.com on 15 Dec 2010 at 12:38

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Latest crash:
access[path=/]
access[path=/p]
access[path=/p/projects]
access[path=/p/projects/505]
access[path=/p/projects/505/tl]
access[path=/p/projects/505/tl/41661_628172_8289_n.jpg]
open[path=/p/projects/505/tl/41661_628172_8289_n.jpg][flags=32768]
downloading[path=/p/projects/505/tl/41661_628172_8289_n.jpg][fd=564]
read[path=/p/projects/505/tl/41661_628172_8289_n.jpg]
getattr[path=/p]
getattr[path=/p/projects]
getattr[path=/p/projects/505]
getattr[path=/p/projects/505/tl]
getattr[path=/p/projects/505/tl/48822_913226_8374372_n.jpg]
getattr[path=/p/projects/505/tl/.fuse_hidden000000560000002c]
rename[from=/p/projects/505/tl/48822_913226_8374372_n.jpg][to=/p/projects/505/tl
/.fuse_hidden000000560000002c]

/var/log/messages:
Dec 15 08:09:54 web2 s3fs: HTTP response code 200
Dec 15 08:09:54 web2 s3fs: URL is 
http://s3.amazonaws.com/myrepo123/p/projects/505/tl/48822_913226_8374372_n.jpg
Dec 15 08:09:54 web2 s3fs: URL changed is 
http://myrepo123.s3.amazonaws.com/p/projects/505/tl/48822_913226_8374372_n.jpg
Dec 15 08:09:54 web2 s3fs: connecting to URL 
http://myrepo123.s3.amazonaws.com/p/projects/505/tl/48822_913226_8374372_n.jpg
Dec 15 08:09:54 web2 s3fs: HTTP response code 200
Dec 15 08:09:54 web2 s3fs: URL is 
http://s3.amazonaws.com/myrepo123/p/projects/505/tl/.fuse_hidden000000560000002c
Dec 15 08:09:54 web2 s3fs: URL changed is 
http://myrepo123.s3.amazonaws.com/p/projects/505/tl/.fuse_hidden000000560000002c
Dec 15 08:09:54 web2 s3fs: connecting to URL 
http://myrepo123.s3.amazonaws.com/p/projects/505/tl/.fuse_hidden000000560000002c
Dec 15 08:09:54 web2 s3fs: HTTP response code 404

Original comment by jve...@gmail.com on 15 Dec 2010 at 1:32

GoogleCodeExporter commented 9 years ago
There's a wrapper around the curl library function curl_easy_perform.  That is 
where the HTTP response code debug message is coming from. This tells me that 
the call to libcurl's function returned "successfully" (i.e. it's not where the 
hang is).

Upon getting the http response, then our wrapper function 
"my_curl_easy_perform" returns a code back to the calling function.  Several 
functions call my_curl_easy_perform. So I'll put in some more debug code to 
trace the call and return.

Original comment by dmoore4...@gmail.com on 15 Dec 2010 at 8:37

GoogleCodeExporter commented 9 years ago
This is looking more and more like a bug in FUSE itself. This bug has been 
previously identified in other FUSE-based file systems and since fixed.

I still can't recover my system reliably from the hang, even with the hint.  
Many times I need a restart.

I added more debug into my working copy and the last hang occurred upon a 
system call to stat()   This appears to correlate to this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=493565
https://bugs.launchpad.net/ubuntu/+source/fuse/+bug/634554

Fixed in Ubuntu 10.10 and Debian (squeeze, sid) by installing FUSE 2.8.4 but 
not in Fedora or CentOS  -- that also correlates to my observation that I could 
not reproduce this on my main systems.

One way to gain confidence that this is the FUSE bug would be to compile and 
install 2.8.4 on a redhat-based system (I don't think that there is a .rpm 
available yet, besides, I'm not yum/rpm savvy)

Original comment by dmoore4...@gmail.com on 16 Dec 2010 at 4:45

GoogleCodeExporter commented 9 years ago
I've upgraded Fuse to 2.8.4 and it seems to be working fine now. No crashes so 
far and I did some extensive testing already. Will report back if it freezes 
again but the problem seemed to disappear with Fuse 2.8.4.

Original comment by jve...@gmail.com on 16 Dec 2010 at 9:23

GoogleCodeExporter commented 9 years ago
Did you find a RPM or did you have to download the tarball and compile/install 
yourself.  I plan on changing the minimum version of fuse to 2.8.4, but would 
like to provide some guidance in the Installation Notes on what redhat-based 
distro's need to do to get FUSE 2.8.4 installed. Thanks.

I'll give this a try myself on my CentOS and Fedora virtual machines

Original comment by dmoore4...@gmail.com on 16 Dec 2010 at 11:26

GoogleCodeExporter commented 9 years ago
I couldn't get fuse 2.8.4 installed on my CentOS 5.5 virtual machine to test 
and I couldn't find a rpm for it.  I'm going to assume that jvence's 
confirmation of resolution is good enough along with I wasn't able to duplicate 
this on my Fedora 14 (another redhat os) with fuse 2.8.5 installed.

Calling this one fixed and moving the minimum fuse verison needed to 2.8.4

Original comment by dmoore4...@gmail.com on 17 Dec 2010 at 4:26

GoogleCodeExporter commented 9 years ago
Closed with r281

Original comment by dmoore4...@gmail.com on 17 Dec 2010 at 4:41

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Upgrading to Fuse:
wget 
http://downloads.sourceforge.net/project/fuse/fuse-2.X/2.8.4/fuse-2.8.4.tar.gz
tar zxf fuse-2.8.4.tar.gz
cd fuse-2.8.4 
./configure 
make
make install 
echo /usr/local/lib >> /etc/ld.so.conf 
ldconfig 

Original comment by jve...@gmail.com on 17 Dec 2010 at 6:40

GoogleCodeExporter commented 9 years ago
Just a follow up, installed fuse 2.8.4 on my CentOS virtual machine, and all 
looks to be be good:

% rsync .....

Number of files: 1323
Number of files transferred: 1309
Total file size: 1665096709 bytes
Total transferred file size: 1665096709 bytes
Literal data: 1665096709 bytes
Matched data: 0 bytes
File list size: 23521
File list generation time: 0.096 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 1665381152
Total bytes received: 28902

sent 1665381152 bytes  received 28902 bytes  108034.77 bytes/sec
total size is 1665096709  speedup is 1.00
[mooredan@centos ~]$ cat /etc/issue
CentOS release 5.5 (Final)
Kernel \r on an \m

[mooredan@centos ~]$ fusermount --version
fusermount version: 2.8.4
[mooredan@centos ~]$ s3fs --version
Amazon Simple Storage Service File System 1.26
Copyright (C) 2010 Randy Rizun <rrizun@gmail.com>
License GPL2: GNU GPL version 2 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Original comment by dmoore4...@gmail.com on 19 Dec 2010 at 6:53

GoogleCodeExporter commented 9 years ago
I am using Fuse 2.8.4, still seen the freeze. I am using ftp server to upload 
on s3fs mount point. My observation is, when large files are uploaded this 
issue is reproducible.

Original comment by ravindra...@gmail.com on 15 Jan 2011 at 5:40

GoogleCodeExporter commented 9 years ago
Have the same thing here. It seems to freeze - should I install the SVN version 
or is the latest release supposed to fix this?

Original comment by marc.tru...@gmail.com on 20 Jan 2011 at 6:18

GoogleCodeExporter commented 9 years ago
Sorry, using Fuse 2.8.5 - is this a problem?

Original comment by marc.tru...@gmail.com on 20 Jan 2011 at 6:19

GoogleCodeExporter commented 9 years ago
The latest tarball is supposed to have fix this issue -- it did fix the 
original issue that was reported and duplicated.  This seems to be a new issue, 
however the symptoms may appear to be similar.  The FTP thing is a new wrinkle.

I don't see any reason why Fuse 2.8.5 would be a problem, but I haven't 
personally used it (yet).

I recommend opening a new issue concerning your problem with details, details, 
details, VERY SPECIFIC DETAILS. If we cannot reproduce the issue there's little 
we can do to try and fix it.

Original comment by dmoore4...@gmail.com on 20 Jan 2011 at 4:32

GoogleCodeExporter commented 9 years ago
I just found that using FUSE 2.8.5 will not work for me using s3fs 1.57. Going 
back to FUSE 2.8.4 fixed the problem instantly. The problem seems to occur 
when/after handing over control to fuse with the call to fuse_main(...) in 
s3fs.cpp. As far as I could tell using debug output statements in the code 
everything else works fine up to this point. With FUSE 2.8.5 s3fs terminated 
instantly after calling fuse_main. Unfortunately I can't provide an easily 
reproducable test scenario since this was on a custom built system loosely 
based on Debian with lots of manual updates from wheezy/sid.

Original comment by i...@readynasxtras.com on 10 Jul 2011 at 11:37

GoogleCodeExporter commented 9 years ago
how do you unmount?

Original comment by mike54...@gmail.com on 26 Nov 2011 at 4:05