Performance feedback : relative slow read performance detected

GoogleCodeExporter commented 9 years ago

After installing quite a few dev libraries I managed to get s3fs compiled
and working on an S3 Ubuntu 7.10 instance.  I've been having a lot of fun
with it, thanks for taking the effort to write this nice tool.

I'm testing s3fs to see if I can use it to read data (a CSV file in this
case) in parallel across multiple EC2 servers.  

See also: 
- http://www.ibridge.be/?p=101
- http://kettle.pentaho.org

Here is feedback on the write and read performance of s3fs when dealing
with large files like the one I used. (single instance for the time being)

WRITE PERFORMANCE:
-------------------

root@domU-12-31-35-00-2A-52:~# time cp /tmp/customers-25M.txt /s3/kettle/

real    6m27.266s
user    0m0.260s
sys     0m9.210s

root@domU-12-31-35-00-2A-52:~# ls -lrta /tmp/customers-25M.txt
-rw-r--r-- 1 matt matt 2614561970 Apr  4 19:53 /tmp/customers-25M.txt

2614561970 
/ 387.266s
------------
6.4MB/s write capacity.

READ PERFORMANCE:
-------------------

root@domU-12-31-35-00-2A-52:~# time wc -l /s3/kettle/customers-25M.txt
25000001 /s3/kettle/customers-25M.txt

real    4m36.054s
user    0m0.810s
sys     0m0.950s

2614561970
/ 276.054
------------
9.0 MB/s read capacity

I couldn't care less about the write performance, but I would have expected
the read capacity to be higher and so I did a little investigation. 
Apparently, there is intensive "caching" going on on the local disk.  This
happens at around 10MB/s.  When that is done, the actual reading takes
place at 60+MB/s. (see below)

It would be nice if you could find a way to disable this disk-based caching
system altogether.  I tried to create a small ram disk fs and used option
-use_cache /tmp/ramdisk0 but the error I got was:

wc: /s3/mattcasters/customers-25M.txt: Bad file descriptor
0 /s3/mattcasters/customers-25M.txt

The /tmp/ramdisk0/ file system was small and as such very likely too small
to hold the 2.4GB file. (it was at 100% after the test and contained a part
of the file)

I believe that S3 charges per transfer request, not just per data volume
transferred, so perhaps you are doing the right thing cost wise.  However,
perhaps it should possible for users like me to set some kind of max block
size parameter.  With this you could allow the creation of a memory based
cache (say a few hundred MB) that doesn't have a file writing I/O penalty.

That would perhaps also help in the case where you don't want to read the
complete file, but only a portion of it. This can be interesting in our
parallel read case. (where each of the EC2 nodes is going to do a seek in
the file and read 1/Nth of the total file)

Matt

"iostat -k 5" during cache creation:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.40    0.00    1.20   22.51    0.40   75.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda1            424.90         0.80      8916.33          4      44760
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.60   33.60    0.40   65.40

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda1            444.20         0.00      9087.20          0      45436
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0

"iostat -k 5" during cache read-out:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.00    0.00    9.98   87.82    0.20    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda1           1319.96     55572.06         2.40     278416         12
sda2              0.00         0.00         0.00          0          0
sda3              0.20         0.00         0.80          0          4

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.80    0.00   10.00   86.80    1.40    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda1           1245.40     52591.20         2.40     262956         12
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00   10.38   88.42    0.00    0.20

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda1           1191.82     50286.63         2.40     251936         12
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.40    0.00   10.38   86.03    0.80    0.40

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda1           1370.26     57829.94         2.40     289728         12
sda2              0.00         0.00         0.00          0          0
sda3              0.00         0.00         0.00          0          0

Original issue reported on code.google.com by mattcast...@gmail.com on 5 Apr 2008 at 12:20

GoogleCodeExporter commented 9 years ago

Hi there- thanks fer the feedback!

>>> Apparently, there is intensive "caching" going on on the local disk.  This
happens at around 10MB/s.  When that is done, the actual reading takes
place at 60+MB/s.

Indeed- when you do "wc -l /s3/kettle/customers-25M.txt" this is what s3fs does:

 * open() is called
  * s3fs unconditionally downloads entire file to local tmp file
  * open() returns back to caller
 * one or more read() calls in "chunks"/ranges with a start/offset
  * s3fs fields each read() call from that local tmp file

"Apparently, there is intensive "caching" going on on the local disk."

That is the part where s3fs is downloading the entire file to local tmp file in
open() before returning to caller

"When that is done, the actual reading takes place at 60+MB/s."

That is because s3fs is fielding all read() requests from local tmp file

So, ya, that all makes sense...

"It would be nice if you could find a way to disable this disk-based caching
system altogether."

[that is *not* a reference to "use_cache"]

Yes- I think the use of HTTP partial/range GET would work here; I was actually 
just
thinking about adding this yesterday!

This would allow a client that is only interested in a small portion of the 
file to
*not* have to pay the penalty of waiting to download the entire file just to 
read a
few bytes (e.g., say a header/signature or something)

Also, note that this is largely unrelated to "use_cache" option.

Let me put some thought into how to implement this.., "use of HTTP 
partial/range GET"...

Thanks!

Original comment by rri...@gmail.com on 5 Apr 2008 at 4:54

GoogleCodeExporter commented 9 years ago

rrizun, you understood me perfectly, it is indeed not a reference to the 
"use_cache"
option.  I would add one option to enable/disable local 
"mirroring/copying/caching"
(enabled by default) of the files and one to specify the size of the cache to 
use.

If you do happen to make that option available, ping me and I'll test it 
straight
away :-)

Original comment by mattcast...@gmail.com on 5 Apr 2008 at 6:18

GoogleCodeExporter commented 9 years ago

More testing and benchmarking fun...

Parallel read test : 2 slave servers, 2 steps reading

On local filesystem: /tmp/
Slave1 :  (0-1307280984)
  - start : 12:26:39,085  (CSV Starting to run)
  - end   : 12:31:29,358  (Dummy finished)
Slave2 : (1307280985-2614561969)
  - start : 12:26:39,157  (CSV Starting to run)
  - end   : 12:32:45,503  (Dummy finished)

Transformation runtime : 12:26:39.085 - 12:32:45.503 = 6 minutes 5.582 seconds =
365.582 seconds

On s3fs filesystem: /bigone/
Slave1 :  (0-1307280984)
  - start : 12:37:19,333 (CSV Starting to run)
  - header: 12:41:28,288 (Header row skipped)
  - end   : 12:44:08,159 (Dummy finished)
Slave2 : (1307280985-2614561969)
  - start : 12:37:19,437  (CSV Starting to run)
  - header: 12:42:27,335  (CSV first feedback)
  - end   : 12:45:15,133  (Dummy finished)

Transformation runtime : 12:37:19,333 - 12:45:15,133  = 7 minutes 55.800 
seconds =
475.800 seconds
However, copying the file to local disk takes : 12:37:19,333 - 12:41:28,288 = 4
minutes 9 seconds.

As such, there is a huge performance gain to be had when we could read and 
position
directly on S3 data in stead of using locally copied data.
There is a speedup by adding more copies to the cluster.  As such, we could
conceivably launch 10 EC2 systems.
Let's see where that would bring us.

The speed of the copy would most likely still be around 4 minutes because that 
copy
operation is limited to the speed of the local disk.

Copying a 2.4GB file in 4 minutes equates to 10MB/s write speed.

The slaves read the block of data back in around 165 seconds or 7.5MB/s or in 
total
15MB/s

IF the copy operation scales, we can reach 10x7.5MB/s or 75MB/s with 10 slaves.
That would mean we would read the data in about 33 seconds + 4 minutes 9 
seconds for
the copy of the data...
Let's say we do it in 5 minutes.
That would mean : 250M rows / 300 seconds = 833k rows/s (only a little bit 
faster
than my laptop)

OK, now suppose we would read directly from S3 at the same leisurely speed of
7.5MB/s. (most likely feasible since we now do it at 10MB/s)
That would mean we would process the file in 33 seconds. (perhaps it's possible 
to go
faster)
--> 7.5M rows/s 

At 10MB/s per slave we would hit 100MB/s throughput and we would process the 
file in
25 seconds at 10M rows/s. (I would be happy with half that speed)

Conclusion: allowing the user to bypass the local file copying would allow us 
to gain
*really* interesting speeds.  I would make you famous for it too :-)

Matt <mcasters (at) pentaho dot org>

Original comment by mattcast...@gmail.com on 5 Apr 2008 at 9:11

GoogleCodeExporter commented 9 years ago

Did the "use of HTTP partial/range GET" ever get implemented? We need to be 
able to read the header of a file and maybe do some small further reads after 
that. I am trying to figure out if I can use s3fs to do this, or if I would 
have to write my own custom system.

Thanks!

Original comment by zdrumm...@gmail.com on 18 Jul 2011 at 4:18

GoogleCodeExporter commented 9 years ago

HI,

Updated to v1.72, this version supports parallel upload/download, and 
on-demand(?) download, etc for performance.

Please check new version, and if you find a bug please post new issue about the 
bug.

I closed this issue because this issue is old and I want to know about 
performance by new issue.

Thanks in advance.
Regards,

Original comment by ggta...@gmail.com on 10 Aug 2013 at 5:17

Changed state: Fixed

yostzach / s3fs

Performance feedback : relative slow read performance detected #24