Closed GoogleCodeExporter closed 9 years ago
If you need any additional information, or any debugging steps I could
undertake: please let me know.
Original comment by ptrm...@gmail.com
on 28 Jan 2011 at 10:07
Hmm, just noticed that the wget's acutally seem to have succeeded. All the
files seem to be there. The processes where running overnight and this morning
I noticed the "Transport endpoint is not connected" error.
Original comment by ptrm...@gmail.com
on 28 Jan 2011 at 10:11
Ok, able to reproduce now!
1. Do one of the wget's (alternatively, probably works with any s3fs mounted
directory that contains enough files)
2. cd to the directory
3. ls (waits ~2 seconds, then says "ls: reading directory .: Software caused
connection abort")
4. ls (says "ls: cannot open directory .: Transport endpoint is not connected")
So the problem sees to be listing the contents of a directory with a large
number of files (> 10.000 in this case). Could it be that s3fs does not deal
well with directories containing more than 1.000 files? (the default for
max-keys in a GET Bucket request according to
http://docs.amazonwebservices.com/AmazonS3/latest/API/)
Original comment by ptrm...@gmail.com
on 28 Jan 2011 at 10:19
I have seen similar too
Initially changing the time-outs fixed this but then when more files were
sync-ed to S3 the same thing started to happen again consistently after 25-30s
As a way of testing try creating 3 or 4 thousand directories in an S3 bucket
and then mounting the filesystem, do an ls on the mounted dir or try and rsync
it back and it will error
It seems that there's a timeout somewhere when listing large numbers of files
or folders from s3 which isn't overridable by an option as such, ls and rsync
type operations fail
Original comment by chrisjoh...@gmail.com
on 31 Jan 2011 at 11:42
Also seeing this behavior under Ubuntu 10.04 with fuse 2.8.4 and s3f 1.35 built
from source.
This problem seemed to start around s3fs 1.25 for us.
If there's anything I can do to help further diagnose the problem please let me
know.
Original comment by Sean.B.O...@gmail.com
on 2 Feb 2011 at 7:37
For your information, I had the same issue and rolling back to s3fs 1.19
(following Sean's comment) fixes the issue.
Original comment by yeoha...@gmail.com
on 2 Feb 2011 at 10:19
same behavior on debian lenny s3fs 1.35/fuse 2.8.5 built from source.
Original comment by ben.lema...@gmail.com
on 3 Feb 2011 at 4:06
syslog debug output:
Original comment by ben.lema...@gmail.com
on 3 Feb 2011 at 6:25
Attachments:
Thanks for the debug_output -- I have a good guess that will mitigate the issue
-- since I haven't tried to reproduce the issue yet (I don't have a bucket with
1000's of files), I'm looking for a volunteer to test the patch
...any takers?
But I don't think that the patch addresses the underlying issue though, and
this is how directory listings are done. s3fs_readdir is probably the most
complex piece of the this code and probably needs some tuning.
Original comment by dmoore4...@gmail.com
on 4 Feb 2011 at 12:52
I'll definitely test, I've got a few buckets with ~30K files in them.
Original comment by ben.lema...@gmail.com
on 4 Feb 2011 at 1:16
Give this patch a try.
Original comment by dmoore4...@gmail.com
on 4 Feb 2011 at 3:39
Attachments:
I think that the patch resolves the "transport endpoint not connected" issue,
but you'll still get input output errors on listing a directory with A LOT of
files
...can someone confirm?
Original comment by dmoore4...@gmail.com
on 4 Feb 2011 at 4:55
I know get:
ls: reading directory .: Input/output error
Subsequently ls'ing or cd'ing to another directory timeouts after about 20
seconds:
cd: <directory name>: Input/output error
Original comment by ptrm...@gmail.com
on 4 Feb 2011 at 9:34
It appears that the main contributing factor to this issue is the number of
files in a directory. Having a large number of files in a single directory (I
can't quantify "large" just yet, but it seems to be >= 1000) isn't illegal, but
the HTTP traffic that it presents when doing a directory listing appears to
cripple the file system.
I personally have never seen this issue with my buckets, but apparently the
practices that I use are not everyone's practices.
I created a bogus (to me) test case to try and duplicate the issue. The patch
above, resolves one of the initial failure points, but just pushes the issue
back further.
An understanding of how directory listings are made wrt to s3fs is necessary to
implement a fix. Briefly, a query is made to S3 asking for a listing of
objects that match a pattern (reminder, there is not a native concept of
directories in S3, that is, that's not how things are stored). For each of the
matching objects, then do another query to retrieve its attributes. So a simple
"ls" of a s3fs directory can generate A LOT of HTTP traffic.
It appears that Randy (the original author) attempted to address performance
issues with this by using advanced methods in the CURL API. Things are
pointing to that area.
Fixing this may be an easy fix or a major rewrite, I do not know. As I
mentioned, this section of code is one of the more complex sections in s3fs.cpp
One thought that I have is to scrap the multicurl stuff and replace it with a
simpler brute force algorithm. It may fix the issue, but the trade off with
probably be performance.
Original comment by dmoore4...@gmail.com
on 4 Feb 2011 at 5:09
Is it possible we're running into HTTP KeepAlive issues?
After enabling CURLOPT_VERBOSE (s3fs.cpp+586), output at failure includes a
line for each HTTP request, "Connection #0 to host example.s3.amazonaws.com
left intact".
It seems to make sense, modifying the 'max-keys' query parameter from 50, to
1000 does allow more objects to be returned however, the amount of before a
failure remains the same: ~25s
$# cd /mnt/cloudfront/images
$# time ls
ls: reading directory .: Input/output error
real 0m25.357s
user 0m0.000s
sys 0m0.000s
$# cd /mnt/cloudfront/images
$# time ls
ls: reading directory .: Input/output error
real 0m26.869s
user 0m0.000s
sys 0m0.000s
$# cd /mnt/cloudfront/images
$# time ls
ls: reading directory .: Input/output error
real 0m26.274s
user 0m0.000s
sys 0m0.000s
Original comment by ben.lema...@gmail.com
on 4 Feb 2011 at 6:31
Attachments:
It looks like KeepAlive may actually be the issue, forcing the connection to
close after each request does fix the problem however, I'm not convinced it's a
solid solution as it's quite slow :\
Attached is a patch for testing.
Original comment by ben.lema...@gmail.com
on 4 Feb 2011 at 7:01
Attachments:
Ben,
Great find. I tested this on an EC2 instance (well connected to S3) by doing a
directory listing of a bucket
that contains 10500 files. No I/O error -- it was dog slow, but it worked:
% time ls -l
...
real 1m42.467s
user 0m0.160s
sys 0m0.670s
I was able to lightswitch the issue by removing the CURLOPT_FORBID_REUSE option:
$ time ll
ls: reading directory .: Input/output error
total 0
Looks like a good fix to me.
...more testing:
On my home machine (not so well connected to the internet) I tried the same fix
and did a directory listing of
the same 10500 file bucket. Again, no I/O error and the listing completed, but
it took nearly half an hour:
% date ; /bin/ls -l /mnt/s3/misc.suncup.org/ | wc -l ; date
Fri Feb 4 18:06:04 MST 2011
10503
Fri Feb 4 18:31:14 MST 2011
I'll do a subversion commit shortly. Thanks so much for you contribution.
Original comment by moore...@gmail.com
on 5 Feb 2011 at 1:33
Pieter, please test r308 and report back. Thanks.
Original comment by dmoore4...@gmail.com
on 5 Feb 2011 at 2:05
I'm currently out of the country and have limited internet access. I will be
able to test next week!
Original comment by ptrm...@gmail.com
on 9 Feb 2011 at 12:45
Ok, r308 seems to solve the problem.
Original comment by ptrm...@gmail.com
on 14 Feb 2011 at 12:23
Original comment by dmoore4...@gmail.com
on 14 Feb 2011 at 5:20
i still have this problem on r368.
Running on Amazon EC2 (Amazon Linux AMI)
Original comment by K.Quiatk...@mytaxi.net
on 9 Jul 2012 at 6:59
I'm having the same issue with version 1.6.1 (r368) on CentOS 5.
Original comment by johnog...@gmail.com
on 12 Jul 2012 at 10:24
We are having the same issue with version 1.6.1 (r368) on Debian Squeeze
Reverting back to 1.6.0 and trying again. Will post back with results
Original comment by chris_r...@someones.com
on 17 Oct 2012 at 4:36
Am having the same issue with Centos 5.8 32 bit. Have tried s3fs 1.61 and 1.35
with the same outcome.
Original comment by gmason.x...@gmail.com
on 30 Oct 2012 at 12:12
We are also having this issue with 1.6.1 on an Amazon EC2 AMI.
Our directory doesn't have too many files, though (200 or so).
Original comment by ferran.m...@mmip.es
on 16 Jan 2013 at 3:07
Im having the same issue on latest S3fs when trying to upload to a S3 bucket,
it appears is happening to folders that handles a lot of files and folders only.
Original comment by nickz...@gmail.com
on 18 Jun 2013 at 4:46
Hi, nickzoid
I could not reappear this problem on s3fs(r449).
(I tested that many files(over 5000) copy to s3)
So that if you can, please try to use r449 and test it with some "multireq_max"
and "nodnscache" option.
** And if you have same problem yet, please post NEW ISSUE with more
information.
Thanks in advance for your assistance.
Original comment by ggta...@gmail.com
on 20 Jun 2013 at 1:27
Original issue reported on code.google.com by
ptrm...@gmail.com
on 28 Jan 2011 at 10:07