Universal async framework

mmomtchev commented 3 years ago

I have started working on an universal async call framework so that the all the bindings could easily be converted to async with minimal code modifications

Here it is the current version:

The NAN define on the function method is changed as this

from NAN_METHOD(Driver::open) it becomes GDAL_ASYNCABLE_DEFINE(Driver::open) This will automatically create two methods, Driver::open, the sync version and Driver::openAsync, the async version and a third hidden method Driver::open_do which will contain all the code and will be called with async=true|false

Then only the end part of the function has to be modified as this

Calling GDAL is encapsulated in an asyncable (no automatic variables) lambda, for example

GDAL_ASYNCABLE_MAIN(GDALDataset *) = [raw, filename, x_size, y_size, n_bands, type, options]() {
  GDALDataset *ds = raw->Create(filename.c_str(), x_size, y_size, n_bands, type, options->get());
  delete options;
  if (ds == nullptr)
    throw "Error creating dataset";
  return ds;
};

The return value generation is encapsulated too

GDAL_ASYNCABLE_RVAL(GDALDataset *) = [](GDALDataset *ds, GDAL_ASYNCABLE_OBJS) { return Dataset::New(ds); };

And finally everything is automagically executed, either synchronously or asynchronously (3 is the callback argument)

GDAL_ASYNCABLE_EXECUTE(3, GDALDataset*);

If the function needs to protect some objects from the GC, a persist interface is provided

GDAL_ASYNCABLE_PERSIST(passed_array, band->handle());

The RVAL lambda can access the persisted objects - sync/async transformations are automatic

GDAL_ASYNCABLE_RVAL(CPLErr) = [](CPLErr err, GDAL_ASYNCABLE_OBJS o) { return o[0]; };

Here is the full example for the bottom of `Driver::create`

Before (only sync)

GDALDataset *ds = raw->Create(filename.c_str(), x_size, y_size, n_bands, type, options->get());
delete options;
if (ds == nullptr) {
  Nan::ThrowError("Error creating dataset");
  return;
}
info.GetReturnValue().Set(Dataset::New(ds));

After (sync and async)

// Main execution block
GDAL_ASYNCABLE_MAIN(GDALDataset *) = [raw, filename, x_size, y_size, n_bands, type, options]() {
  GDALDataset *ds = raw->Create(filename.c_str(), x_size, y_size, n_bands, type, options->get());
  delete options;
  if (ds == nullptr) throw "Error creating dataset";
  return ds;
};

// Generate return value
GDAL_ASYNCABLE_RVAL(GDALDataset *) = [](GDALDataset *ds, GDAL_ASYNCABLE_OBJS) { return Dataset::New(ds); };

// Go
GDAL_ASYNCABLE_EXECUTE(6, GDALDataset *);

Here is the full example for the bottom of `RasterBandPixels::read`

Before (only sync)

CPLErr err = gdal_band->RasterIO(GF_Write, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
if (return err != CE_None)
  Nan::ThrowError(CPLGetLastErrorMsg());
info.GetReturnValue().Set(obj);

After (sync and async)

// Copy these to local pointers for the asyncable lambda
uv_mutex_t *async_lock = band->async_lock;
GDALRasterBand *gdal_band = band->get();

// These objects must be protected from the GC while the thread runs
GDAL_ASYNCABLE_PERSIST(obj, band->handle());

// This is the main execution block
GDAL_ASYNCABLE_MAIN(CPLErr) =
  [gdal_band, async_lock, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space]() {
    uv_mutex_lock(async_lock);
    CPLErr err = gdal_band->RasterIO(GF_Write, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
    uv_mutex_unlock(async_lock);
    if (err != CE_None)
      throw CPLGetLastErrorMsg();
    return err;
  };

// Generate the return value from the persistent objects array -> o[0] is obj from GDAL_ASYNCABLE_PERSIST
// The persistent objects array is automatically transformed from sync to async mode (through Nan::Persistent)
GDAL_ASYNCABLE_RVAL(CPLErr) = [](CPLErr err, GDAL_ASYNCABLE_OBJS o) { return o[0]; };

// Go
GDAL_ASYNCABLE_EXECUTE(9, CPLErr);

Any comments, suggestions or volunteers?

mmomtchev commented 3 years ago

Just one remark: I don't see any way how we could reduce everything to a single lambda, because some of the lambdas need to run in different execution contexts - there are three separate execution contexts here:

in the V8 main thread with the JS world stopped (this is the only context in sync mode)
in the aux thread with the JS world running (concurrent with the GC) - V8 objects are not accessible here and can move/be freed
back in the V8 main thread the JS world stopped for calling the JS callback

mmomtchev commented 3 years ago

UPDATE: I reduced the lambdas, if anyone has any ideas how to further simplify this, feel free to comment

mmomtchev commented 3 years ago

I am actually hesitating between those two: (current one)

GDAL_ASYNCABLE_MAIN(CPLErr) =
  [gdal_band, async_lock, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space]() {
    uv_mutex_lock(async_lock);
    CPLErr err = gdal_band->RasterIO(GF_Write, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
    uv_mutex_unlock(async_lock);
    if (err != CE_None) throw CPLGetLastErrorMsg();
    return err;
  };
GDAL_ASYNCABLE_RVAL(CPLErr) = [](CPLErr err, GDAL_ASYNCABLE_OBJS o) { return o[0]; };

(another one)

GDAL_ASYNCABLE_DO(gdal_band, async_lock, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space) {
  uv_mutex_lock(async_lock);
  CPLErr err = gdal_band->RasterIO(GF_Read, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
  uv_mutex_unlock(async_lock);
  if (err != CE_None) throw CPLGetLastErrorMsg();
  ret(persistent[0]);
};

I think the second one is better?

UPDATE: this doesn't work in all cases

mmomtchev commented 3 years ago

No, I am unable make all the cases work with only one lambda, if anyone has any ideas, I am listening It is very important that this is perfect before starting to transform the code

yocontra commented 3 years ago

@mmomtchev I like it and think the code ultimately comes out to be pretty clean without reducing to a single lambda.

mmomtchev commented 3 years ago

Yes, I think I will leave it like this, unless someone comes up with a brilliant idea In fact all the solutions revolve around some form of a second lambda, because there is that final part with JSObject::New that must be done at the end in the context of the main V8 thread with the JS world stopped I will start slowly transforming the code - the vector API is absolutely huge - and unlike the raster API where the bottleneck is purely I/O, the vector API also has some CPU-intensive functions which will benefit from multi-threading

yocontra commented 3 years ago

@mmomtchev My primary use case is https://www.npmjs.com/package/verrazzano so I'm excited to start benchmarking and see how much of a difference this makes

mmomtchev commented 3 years ago

I tried asyncing gdal.LayerFeatures.get for test purposes (a supposedly difficult function), it turned out quite clean:

diff --git a/src/collections/layer_features.cpp b/src/collections/layer_features.cpp
index 852f61d3..83b805a1 100644
--- a/src/collections/layer_features.cpp
+++ b/src/collections/layer_features.cpp
@@ -18,6 +18,7 @@ void LayerFeatures::Initialize(Local<Object> target) {
   Nan::SetPrototypeMethod(lcons, "count", count);
   Nan::SetPrototypeMethod(lcons, "add", add);
   Nan::SetPrototypeMethod(lcons, "get", get);
+  Nan::SetPrototypeMethod(lcons, "getAsync", getAsync);
   Nan::SetPrototypeMethod(lcons, "set", set);
   Nan::SetPrototypeMethod(lcons, "first", first);
   Nan::SetPrototypeMethod(lcons, "next", next);
@@ -91,7 +92,7 @@ NAN_METHOD(LayerFeatures::toString) {
  * @param {Integer} id The feature ID of the feature to read.
  * @return {gdal.Feature}
  */
-NAN_METHOD(LayerFeatures::get) {
+GDAL_ASYNCABLE_DEFINE(LayerFeatures::get) {
   Nan::HandleScope scope;

   Local<Object> parent =
@@ -104,9 +105,17 @@ NAN_METHOD(LayerFeatures::get) {

   int feature_id;
   NODE_ARG_INT(0, "feature id", feature_id);
-  OGRFeature *feature = layer->get()->GetFeature(feature_id);
-
-  info.GetReturnValue().Set(Feature::New(feature));
+  OGRLayer *gdal_layer = layer->get();
+  uv_mutex_t *async_lock = layer->async_lock;
+  GDAL_ASYNCABLE_PERSIST(parent);
+  GDAL_ASYNCABLE_MAIN(OGRFeature*) = [async_lock, gdal_layer, feature_id]() {
+    uv_mutex_lock(async_lock);
+    OGRFeature *feature = gdal_layer->GetFeature(feature_id);
+    uv_mutex_unlock(async_lock);
+    return feature;
+  };
+  GDAL_ASYNCABLE_RVAL(OGRFeature*) = [](OGRFeature *feature, GDAL_ASYNCABLE_OBJS) { return Feature::New(feature); };
+  GDAL_ASYNCABLE_EXECUTE(1, OGRFeature*);
 }

 /**

mmomtchev commented 3 years ago

Then I tried these 2 simple gists (they are really simple, they are 90% instrumentation code) https://gist.github.com/mmomtchev/14c3428255fd9c88ed20517572d8efdc

sync case

open: 2.359s
count: 0.083ms
get: 5.369s
eventLoop didn't start (100% sync code)
gets 662 per-get 8.0672ms

async case

open: 2.261s
count: 0.256ms
get: 5.471s
eventLoopUtilization 0.022852257300014747
gets 662 per-get 4047.8268ms

So, no direct performance gain - this is impossible as long as the operations on the same dataset run sequentially. Hopefully a future version of GDAL will allow this, they have been talking about this and they even have an RFC about removing the big dataset lock. But while the sync case runs at 100% CPU and doesn't even start the event loop - the async case runs at 2.3% CPU utilization :smile: If you were to read another dataset in parallel, it would be basically a free lunch. How is this possible? It is because GDAL runs on a secondary thread / secondary core of the CPU.

mmomtchev commented 3 years ago

See the per-get duration - it really explodes in the async case, because the GDAL stubs are running parallel - but they are waiting in line for the async_lock

mmomtchev commented 3 years ago

The dataset is a 25MB GeoJSON with all the European administrative borders (from my weather site https://www.meteo.guru)

yocontra commented 3 years ago

@mmomtchev Yeah, having a second thread is great for the case of having this run on a webserver (we do). If I'm understanding you correctly though, multiple datasets still share a single secondary thread so parsing multiple files at the same time will still block with eachother, or does each dataset receives its own thread in GDAL?

I think the library I linked will see a major speedup if we're able to thread pool within a dataset (probably the RFC you're referring to?) if the coordinate transformation (transformTo) and the reading of features/feature properties is async. The goal is to parse the file and stream features out of it, right now its all happening synchronously (stream needs 16 items to fill backpressure, go loop the dataset and emit them) only one feature is being parsed/transformed at any given time - theoretically we should be able to kick off all 16 items (or however many needed to fill backpressure) at the same time if we have a proper pool and the file format supports reads like that. Do you have a link to that RFC?

mmomtchev commented 3 years ago

I added a modified benchmark to the same gist that opens 4 datasets on the same file to be able to read with 4 threads. One must pay 4x times the open cost to get a marginal improvement in speed - in the order of 20%:

open-0: 2.771s
count-0: 0.079ms
open-1: 2.848s
count-1: 0.056ms
open-2: 2.883s
count-2: 0.052ms
open-3: 2.888s
count-3: 0.037ms
get-0: 4.688s
get-1: 4.600s
get-2: 4.565s
get-3: 4.560s
eventLoopUtilization 0.03760332186211492
gets 662 per-get 3719.6174ms

There is also one severe problem with this approach that is general in Node and cannot be easily solved: the maximum number of threads is limited to UV_THREADPOOL_SIZE, 4 by default. This means that when launching a test that creates 1000 async contexts, Node/libuv will randomly choose 4 of those async contexts to run simultaneously. If they happen to be independent, that's good. If they are waiting on each other, well, they will just run sequentially, starving out the others, leaving you with no other option than to increase UV_THREADPOOL_SIZE.

I will start pushing the first async vector functions over the weekend to node-gdal-async if someone is interested to play with.

yocontra commented 3 years ago

@mmomtchev I think we should definitely document and recommend raising UV_THREADPOOL_SIZE in the README where we document the async functions.

mmomtchev commented 3 years ago

@contra multiple datasets should be completely independent aside from the UV_THREADPOOL_SIZE issue The problem is if you launch 16 contexts on 4 datasets and Node/libuv picks 4 of them, all on the same dataset, to run, then you have 100% sequential execution. UV_THREADPOOL_SIZE goes up to 1024, with the price being about 1MB per thread. The default of 4 is quite low, these days most CPUs have more cores.

mmomtchev commented 3 years ago

Another solution is to always await so that you don't launch more than one operation per dataset - knowing that the second one will eat one thread pool slot without doing anything until the first one finishes

mmomtchev commented 3 years ago

https://gdal.org/development/rfc/rfc47_dataset_caching.html

mmomtchev commented 3 years ago

I think it is stalled (it is quite an undertaking) but I think that they are still considering options Some drivers individually support internal multi-threading when calling them - this is something that works independently of Node/libuv async - and these can be controlled via creation/opening options. For example GeoTIFF supports multi-threaded compression/decompression

mmomtchev commented 3 years ago

For everyone who is interested, a big chunk of the async vector API has reached usable state and is available at https://github.com/mmomtchev/node-gdal-async It is compatible with Express-like Node frameworks and can also be used for multi-threading purposes, as it contains lots of CPU-intensive primitives

mmomtchev commented 3 years ago

@contra, this will be a huge PR (>5000 lines) which lots of new code for the synchronous API too I see that lots of people have checked out my branch, feel free to report any problems with it I wonder how should this be handled, maybe release a beta version before landing everything?

yocontra commented 3 years ago

@mmomtchev Yeah, I think we can push/prebuild it as a beta release then move it to a major bump once its been tested in the ecosystem for a week. I'll use it in our production ETL system and put it through its paces.

mmomtchev commented 3 years ago

@contra If this is going to be a 3.0, then I probably should also rewrite the Geometry classes and convert to node-addon-api - these are the two most pressing changes that will also require a major version bump

I have been reassured by the node-addon-api team that the conversion will be straightforward

N-API should also come before the switch to github releases as it will impact the release process - after the switch there will be a single binary per platform that will be compatible with all Node versions

yocontra commented 3 years ago

@mmomtchev I'm fine to do multiple major releases instead of putting it all into a 3.0 - up to you how you think it should be ordered and released though.

mmomtchev commented 3 years ago

I ran the node-addon-api conversion tool and there is a few days of work left before it works again It is not a very complex change - it is mostly regexping and a little bit of refactoring - but it is very invasive, it modifies 1 line out of 4 I don't want to be merging it with the async branch which is already huge by itself, so if you are ok, you should probably create a 2.x branch and then I will submit the PR to master so that I can base the N-API migration upon it Then maybe the releases can be switched over to github releases But this means that there won't be any releases on the master branch for at least 2 weeks until it is stable enough again

yocontra / node-gdal-next