xiongxu / s3fs

Automatically exported from code.google.com/p/s3fs
GNU General Public License v2.0
0 stars 0 forks source link

Performance optimization for static s3 data files #168

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
In my application, I'm using s3fs to access a static data set.  So once 
something is in the cache, I don't need to check s3 to see if it has changed.

I made the following changes to the source tree to optimize this case.  Don't 
really care if you mainline this, but I figured you might want it, since it 
must be a fairly common use-case:

*************** int get_local_fd(const char* path) {
*** 1102,1107 ****
--- 1102,1113 ----

    headers_t responseHeaders;

+   if (assume_s3_is_static && (use_cache.size() > 0)) {
+     fd = open(cache_path.c_str(), O_RDWR);
+     if (fd != -1)
+         return fd;
+   }
+ 
    if (use_cache.size() > 0) {
      result = get_headers(path, responseHeaders);
      if(result != 0) {
*************** static void show_help (void) {
*** 4392,4397 ****
--- 4398,4406 ----
      "   use_cache (default=\"\" which means disabled)\n"
      "      - local folder to use for local file cache\n"
      "\n"
+     "   assume_s3_is_static (default=\"\" which means false)\n"
+     "      - assume that files in the local file cache are up to date when 
set to 1\n"
+     "\n"
      "   use_rrs (default=\"\" which means diabled)\n"
      "      - use Amazon's Reduced Redundancy Storage when set to 1\n"
      "\n"
*************** static int my_fuse_opt_proc(void *data, 
*** 4533,4538 ****
--- 4542,4557 ----
        use_cache = strchr(arg, '=') + 1;
        return 0;
      }
+     if (strstr(arg, "assume_s3_is_static=") != 0) {
+       if ((assume_s3_is_static = (strcmp(strchr(arg, '='), "=1") == 0)) || 
+           strcmp(strchr(arg, '='), "=")  == 0 ) {
+         return 0;
+       } else {
+          fprintf(stderr, "%s: poorly formed argument to option: 
assume_s3_is_static\n", 
+                  program_name.c_str());
+          exit(1);
+       }
+     }
      if (strstr(arg, "use_rrs=") != 0) {
        use_rrs = strchr(arg, '=') + 1;
        if (strcmp(use_rrs.c_str(), "1") == 0 || 
diff -p /Other/download/s3fs-1.35/src/s3fs.h src/s3fs.h
*** /Other/download/s3fs-1.35/src/s3fs.h    2010-12-29 18:02:30.000000000 -0500
--- src/s3fs.h  2011-03-07 12:59:40.000000000 -0500
*************** static string use_rrs;
*** 56,61 ****
--- 56,63 ----
  static string ssl_verify_hostname = "1";
  static string public_bucket;

+ static bool assume_s3_is_static = 0;
+ 
  // TODO(apetresc): make this an enum
  // private, public-read, public-read-write, authenticated-read
  static string default_acl("private");

Original issue reported on code.google.com by kaon.int...@gmail.com on 7 Mar 2011 at 6:58

GoogleCodeExporter commented 9 years ago
Might this be considered something like "read only mode".  In which case, I 
would make it more restrictive.

I'm looking at this from a support point of view.

This mode may already be available, what happens if the "ro" option is given?  
In which case, s3fs would just need to detect this and forgo the consistency 
checking.

Original comment by dmoore4...@gmail.com on 7 Mar 2011 at 7:46

GoogleCodeExporter commented 9 years ago
It's different from read-only, because it is assuming nobody *ELSE* is writing, 
which isn't the same as saying *I* am not writing.

Even if I am in read-only mode, checking for the underlying s3 datastore 
changing is the correct behavior.

The behavior I implemented is only correct if you know that the s3 datastore 
cannot itself be changed.  For example, the content in a CDN would typically 
have this attribute, because CDN's use aggressive downstream caching.  So 
whenever they want to change a file, they actually make a brand new file.

-Joshua

Original comment by kaon.int...@gmail.com on 7 Mar 2011 at 8:09

GoogleCodeExporter commented 9 years ago
Upate:  I found that the stat_cache is being cleared every time it is read (why 
is that?), and it isn't being filled except when you do a directory listing.  
I've addressed both of these in the following patch, so that reading a file 
which is locally cached now requires NO round trips to S3, resulting in a 
several thousand fold increase in speed:

***************
*** 46,52 ****

  using namespace std;

- 
  class auto_curl_slist {
   public:
    auto_curl_slist() : slist(0) { }
--- 46,51 ----
*************** static int my_curl_easy_perform(CURL* cu
*** 593,598 ****
--- 592,600 ----
    } 

+   if (foreground)
+     cout << "    fetching " << ptr_url << endl;
+ 
    long responseCode;

    // 1 attempt + retries...
*************** int get_local_fd(const char* path) {
*** 1102,1107 ****
--- 1104,1115 ----

    headers_t responseHeaders;

+   if (assume_s3_is_static && (use_cache.size() > 0)) {
+     fd = open(cache_path.c_str(), O_RDWR);
+     if (fd != -1)
+         return fd;
+   }
+ 
    if (use_cache.size() > 0) {
      result = get_headers(path, responseHeaders);
      if(result != 0) {
*************** static int s3fs_getattr(const char *path
*** 2023,2029 ****
      stat_cache_t::iterator iter = stat_cache.find(path);
      if (iter != stat_cache.end()) {
        *stbuf = (*iter).second;
!       stat_cache.erase(path);
        pthread_mutex_unlock( &stat_cache_lock );
        return 0;
      }
--- 2031,2040 ----
      stat_cache_t::iterator iter = stat_cache.find(path);
      if (iter != stat_cache.end()) {
        *stbuf = (*iter).second;
!       if(foreground) 
!         cout << "   stat_cache hit for path " << path << endl;
!       if (!assume_s3_is_static)
!         stat_cache.erase(path);
        pthread_mutex_unlock( &stat_cache_lock );
        return 0;
      }
*************** static int s3fs_getattr(const char *path
*** 2067,2073 ****
        free(body.text);
      }
      destroy_curl_handle(curl);
!     return result; \
    }

    stbuf->st_nlink = 1; // see fuse faq
--- 2078,2084 ----
        free(body.text);
      }
      destroy_curl_handle(curl);
!     return result;
    }

    stbuf->st_nlink = 1; // see fuse faq
*************** static int s3fs_getattr(const char *path
*** 2096,2101 ****
--- 2107,2120 ----
    stbuf->st_uid = strtoul(responseHeaders["x-amz-meta-uid"].c_str(), (char **)NULL, 10);
    stbuf->st_gid = strtoul(responseHeaders["x-amz-meta-gid"].c_str(), (char **)NULL, 10);

+   if (assume_s3_is_static) {
+     if(foreground) 
+       cout << "   stat_cache put for path " << path << endl;
+     pthread_mutex_lock( &stat_cache_lock );
+     stat_cache[path] = *stbuf;
+     pthread_mutex_unlock( &stat_cache_lock );
+   }
+ 
    if(body.text) {
      free(body.text);
    }
*************** static void show_help (void) {
*** 4392,4397 ****
--- 4411,4419 ----
      "   use_cache (default=\"\" which means disabled)\n"
      "      - local folder to use for local file cache\n"
      "\n"
+     "   assume_s3_is_static (default=\"\" which means false)\n"
+     "      - assume that files in the local file cache are up to date when 
set to 1\n"
+     "\n"
      "   use_rrs (default=\"\" which means diabled)\n"
      "      - use Amazon's Reduced Redundancy Storage when set to 1\n"
      "\n"
*************** static int my_fuse_opt_proc(void *data, 
*** 4533,4538 ****
--- 4555,4570 ----
        use_cache = strchr(arg, '=') + 1;
        return 0;
      }
+     if (strstr(arg, "assume_s3_is_static=") != 0) {
+       if ((assume_s3_is_static = (strcmp(strchr(arg, '='), "=1") == 0)) || 
+           strcmp(strchr(arg, '='), "=")  == 0 ) {
+         return 0;
+       } else {
+          fprintf(stderr, "%s: poorly formed argument to option: 
assume_s3_is_static\n", 
+                  program_name.c_str());
+          exit(1);
+       }
+     }
      if (strstr(arg, "use_rrs=") != 0) {
        use_rrs = strchr(arg, '=') + 1;
        if (strcmp(use_rrs.c_str(), "1") == 0 || 
diff -p /Other/download/s3fs-1.35/src/s3fs.h src/s3fs.h
*** /Other/download/s3fs-1.35/src/s3fs.h    2010-12-29 18:02:30.000000000 -0500
--- src/s3fs.h  2011-03-07 12:59:40.000000000 -0500
*************** static string use_rrs;
*** 56,61 ****
--- 56,63 ----
  static string ssl_verify_hostname = "1";
  static string public_bucket;

+ static bool assume_s3_is_static = 0;
+ 
  // TODO(apetresc): make this an enum
  // private, public-read, public-read-write, authenticated-read
  static string default_acl("private");

Original comment by kaon.int...@gmail.com on 8 Mar 2011 at 7:16

GoogleCodeExporter commented 9 years ago
Hey Joshua,

What revision of s3fs are you working against? Your comment regarding the stat 
cache had me curious though, I'm unable to apply your patch to a working of 
trunk (r333).

Original comment by ben.lema...@gmail.com on 9 Mar 2011 at 5:05

GoogleCodeExporter commented 9 years ago
I'm against 1.35, the most recently released tarball.

Original comment by kaon.int...@gmail.com on 9 Mar 2011 at 5:30

GoogleCodeExporter commented 9 years ago
I'm against 1.35, the most recently released tarball.

Original comment by kaon.int...@gmail.com on 9 Mar 2011 at 5:33

GoogleCodeExporter commented 9 years ago
A lot has changed since 1.35 (trunk is currently at 1.52), especially in 
regards to the stat cache. 

Original comment by ben.lema...@gmail.com on 9 Mar 2011 at 5:46

GoogleCodeExporter commented 9 years ago
1.35 is the latest available here: 
http://code.google.com/p/s3fs/wiki/FuseOverAmazon

It isn't clear from your site whether the trunk is stable.  I figured 1.35 is 
the latest stable, since that's the most recent tarball.

Original comment by kaon.int...@gmail.com on 9 Mar 2011 at 5:57

GoogleCodeExporter commented 9 years ago

Original comment by ben.lema...@gmail.com on 30 Aug 2011 at 10:13