whatwg / fs

File System Standard
https://fs.spec.whatwg.org/
Other
238 stars 20 forks source link

FileSystemSyncAccessHandle contains async methods #7

Closed janvarga closed 2 years ago

janvarga commented 2 years ago

Besides read and write which are sync methods, there are async methods truncate, getSize, flush and close, returning promises.

Does it make sense for the dedicated interface available only in dedicated workers to have async methods ?

janvarga commented 2 years ago

This is already being discussed here: WICG/file-system-access#323

People are actually requesting fully sync OPFS surface.

ankoh commented 2 years ago

After chatting with @tomayac and @hannes, I'd also like to summarize a few points here that we find noteworthy regarding the asynchronous methods in the synchronous access handles.

We recently entered WebAssembly territory with a browser version of the analytical database DuckDB (available at https://github.com/duckdb/duckdb-wasm). Our initial focus was accelerating analytical processing in the browser and the current state of WebAssembly served us well for that purpose. Now that we have a WebAssembly version of DuckDB running, we realized that this API proposal here holds a tremendeous (!) potential in the form of locally persistent fast embedded databases in browsers. In its current state, however, the API won't really work for us as the mixed-in asynchronous methods will effectively prevent us from using the synchronous access handles.

The problem is the following:

1) Languages like C++ are still very weak when it comes to asynchronous I/O. We have coroutines in C++ now, but enough large C++ projects still model I/O through blocking I/O calls and multithreading which becomes seriously problematic with WebAssembly. If we leave threads aside for a moment (see 5)), we are forced to map blocking I/O in WebAssembly to synchronous browser APIs. Introducing asynchronous I/O in a system that is built around blocking I/O calls is very hard as it quickly requires all surrounding code to become a conceptual or explicit coroutine. 2) Our database DuckDB is built on top of a virtual filesystem abstraction and we use it today already to map WebAssembly I/O to, for example, synchronous XmlHttpRequests or the FileReaderSync. Feel free to check this yourself at https://shell.duckdb.org/ with a simple SQL query such as: SELECT count(*) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'; 3) The proposed synchronous access handles have blocking read and write methods which is already great! However, open, getSize, truncate, flush and close are asynchronous which defeats the purpose of providing an API for a blocking context. We are calling all of these methods while sitting in a C++ call stack which is just incompatible to Javascript Promises.

I've seen 2 workarounds for this problem which are both insufficient today:

4) Automatic coroutines through Asyncify. This is an interesting approach but just not possible at a certain project size. These large-scale rewrite passes are not feasible when projects grow in complexity. 5) Dedicated I/O webworker that can use asynchronous APIs and communicates with Wasm through a Spin-Lock using SharedArrayBuffer Atomics. Even when ignoring the significant performance impact and scheduler disaster of modeling this through a spinlock, the even larger problem is the use of SharedArrayBuffers. SharedArrayBuffers have become very painful since Spectre and Meltdown as they force every user into cross-origin isolation. For some use-cases this might be acceptable, but there are enough cases where these headers are just not available. (E.g. just consider analytical tools that want to process user-provided external data - a dilemma with CORS) We therefore run DuckDB-Wasm by default only within a single WebWorker and without SharedArrayBuffers.

The vision: We would love to make DuckDB-Wasm fully persistent in the browser. This would give us: A) IndexedDB on steroids, as we can build a full-fledged relational database with near-native analytical processing power. B) Out-of-core processing, as we can spool to temporary files in the browser filesystem if we hit the heap limits of 2 (or recently 4) GB.

We think our points here are very similar to the arguments raised for the WebAssembly version of SQLite. (Ref: https://github.com/WICG/file-system-access/issues/323, @jlongster) They are hitting a very similar barrier and would certainly profit from a fully synchronous API surface (for web workers) as well.

tomayac commented 2 years ago

@fivedots FYI.

a-sully commented 2 years ago

Thank you for the feedback! At this point it's clear to me that the split sync/async interface was a mistake. I'm in agreement that I'd like to see an all-sync* interface. I'm happy to put up a PR once #21 lands.

Note that changing methods from async->sync is a web-exposed change with potential breakage, since although awaiting sync methods is a no-op, any use of promise.then will break. I suspect this shouldn't be an issue given the overwhelming support we've seen from developers and other browsers (I know this will make @jesup very happy :) )

This change would affect at least #3 and #28

Chromium implementation bug: https://crbug.com/1338340

*all-sync meaning that all methods on the SyncAccessHandle (getSize, truncate, flush, and close) will be sync, but createSyncAccessHandle() (aka open) will still be async. I don't see us being able to change that

jesup commented 2 years ago

Good. It really never made sense to me that the API specifically to provide sync access was (mostly) async.

So... why can't createSyncAccessHandle (aka open()) be sync? This would require spinning event loops to wait for the IPCs needed to complete. However, that's not fundamentally different than what happens if the preallocated quota is exceeded by a write(), the content process will need to request more quota from the main process, and freeze the content in the meantime, since write() is sync. createSyncAccessHandle can only happen on a worker, so if can be frozen/spinning.

a-sully commented 2 years ago

That's a great question, and after doing a bit more digging I realized that some of the assumptions I'd held weren't as solid as I thought they were. The primary argument (and we can debate its strength) is that keeping this async retains some flexibility to put UI, such as permission prompts, on createSyncAccessHandle() (which can still be shown from a worker). In general, things that can show UI should never be sync, so making this sync would basically preclude us from ever being able to add UI here (such as if we ever want to allow SyncAccessHandles outside of the OPFS).

A counter-argument is that by the time you have a file handle to call createSyncAccessHandle() on, it's likely that any kind of "is this site/frame/worker allowed to access storage" checks that might involve UI in some browsers would have already been passed, or it could just error out if you don't already have write access to the file.

rektide commented 2 years ago

I know webassembly today is quite troublesome to use with async, but I do want to call out that webassembly folks are working hard on a variety of fronts. They're working to make component-model have good async interop (and bonus wasi meeting notes link). There's ideas to further significantly optimize wasm async performance, such as with https://github.com/WebAssembly/js-promise-integration/blob/main/proposals/js-promise-integration/Overview.md.

I know relief would not come right away even if this wasm work was all done, that there's still challenges. And I don't have specific use cases for why we'd want to be able to sequence these calls asynchronously. Yet, I feel like the proposal here is to make a less capable, less flexible system, with a more legacy style/feel, to help a target that might not necessarily need to exist for a long time. I'd prefer we wait & see or make a sync alternative, before making a generally less capable system, that hogs threads during not-fast i/o operations. Targeting some specific users ease of use at the cost of general flexibility feels like a violation of Extensible Web Manifesto.

As a web developer, having a bunch of sync apis show up that I cannot track & reason about normally seems intimidating, and doesn't feel like a pleasurable or modern experience for webdev. Many of these operations seemingly must block- flush, getSize probably. What are non-wasm webdevelopers expected to do? I don't see transferable appear anywhere in the accesshandles pr, so I don't believe I could do some of the work in the app, then send a handle to a DedicatedWorker to go handle a potentially running task. I could well be missing something, but, how, as a JS developer, am I supposed to safely call flush() without locking up the app/how am I supposed to use access handles?

Can we talk with some of the other wasm groups & see what we expect the future performance will be? There were a lot of very high visibility folks asking to prioritize wasm, and I do want to give them a no-compromise option. Maybe we should go back to where we were & have both sync & async interfaces available. In general, my inclination is that everything ought be async if at all possible. I'm hoping above all the js-promise-integration folks might have some cause to make us pause, make us less intent on spending effort warping ourselves into a odd & specific shape/form.

fgmccabe commented 2 years ago

@a-sully: As an outsider to this discussion, I need to take issue with:

Thank you for the feedback! At this point it's clear to me that the split sync/async interface was a mistake.

The web as a whole, and many other OSes, are 'moving' to an async world. The reason is that that fits real world requirements (responsive apps) much better. Sticking with a sync interface is not compatible with the Web.

There are wasm technologies coming that would make an async interface feasible. In particular, I am involved with JS-Promise integration. That is a standards-track activity; which makes it slower. However, our preliminary results seem pretty positive.

a-sully commented 2 years ago

Maybe we should go back to where we were & have both sync & async interfaces available. In general, my inclination is that everything ought be async if at all possible.

I agree with the overall sentiment here. From my perspective as a Chromium engineer, we are currently focused on making SyncAccessHandles as performant as possible to support C(++) applications being ported to the web via WASM, but we are very interested to hear developer feedback about whether an asynchronous alternative to SyncAccessHandles is worth pursuing. One could imagine an async alternative could be more webby: available from Window contexts etc, using a streams-based interface with built-in queueing, etc. To this point we haven't seen much developer demand for this async alternative. If you have compelling use cases for an async alternative, we'd love to hear your feedback :)

I just want to clarify one big point before continuing...

I don't see transferable appear anywhere in the accesshandles pr, so I don't believe I could do some of the work in the app, then send a handle to a DedicatedWorker to go handle a potentially running task.

SyncAccessHandles can only be created from a DedicatedWorker (and are intentionally not transferrable) precisely because we want to ensure these sync methods do not cause jank.

Okay, so here's the reality of the current landscape:

I'm excitedly following the developments to improve async support in WASM. My hope is that this will allow an async alternative to SyncAccessHandles to have comparable performance characteristics, at which point any web developer looking for fast storage will have no reason to choose the more restrictive SyncAccessHandles.

That being said, it's also unclear to me whether C(++) applications being ported to WASM would be able to use or benefit from this async alternative as easily as SyncAccessHandles. There may always be a need for a synchronous file API to support porting these applications to the web.

The current split interface is the worst of both worlds. We'll never be able to expose SyncAccessHandles to anything other than DedicatedWorkers due to the existence of sync methods. However, the split sync/async interface forces the application to write effectively sync code anyways, since you can't queue operations due to the sync methods. The result is a significantly slower and more complicated interface than if all the methods were sync. I'd like to see SyncAccessHandles be the MOST useful for WASM-compiled applications, while keeping an ear out for WASM ecosystem async support + developer demand for an async alternative.

jesup commented 2 years ago

Since making FileSystemSyncAccessHandles fully sync is something of a breaking change, how should we go about it? While a few apps are using this API already (Photoshop), they're still in beta. I imagine they could check if say seek to 0 returns an integer or a promise, which would be pretty easy, and if it's the old promise version wrap it in something like syncseek(...) { return await seek(...); } The other question is the creation of FileSystemSyncAccessHandles (eg. open() in c++ land). Should that API be sync or async? How much does that matter to WASM performance or ease of programming? Any downsides to it? It can only be done on a worker already.

a-sully commented 2 years ago

While the changes we're making are technically not backwards-compatible, my hope is that most apps use the async methods of this API with await (which would be a no-op) rather than Promise.then (which would break). I'm not too concerned about this breaking much, since:

The big question, in my opinion, is whether we want to make createSyncAccessHandle() sync. At first I thought this wasn't feasible, but if we're truly trying to make SyncAccessHandles "the MOST useful for WASM-compiled applications" then we may want to consider making it sync, as well. The argument seems to be:

Pro-sync:

Pro-async:

I'm not particularly convinced by the pro-async arguments. I think my preference (and I suspect @jesup will be on board with this, since the non-OPFS portion of the API is not part of this spec) would be to make createSyncAccessHandle() sync to optimize SyncAccessHandles for the WASM use case, while ensuring that a future async alternative mentioned in https://github.com/whatwg/fs/issues/7#issuecomment-1201811600 is flexible enough to support files outside of the OPFS

Thoughts?

a-sully commented 2 years ago

... assuming the WASM performance improvements are as significant as I expect. @tlively can probably speak more to this

ankoh commented 2 years ago

more performant, especially for WASM

I think for us, the performance of creating a synchronous file handle isn't that big of a deal. What's much more important though is the capability to open the file while sitting in a C++ call stack. E.g. imagine running a SQL statement that scans a previously unknown file or a query that hits the memory limit and wants to spool to disk. Ideally, we wouldn't need to abort the query, create the file asynchronously and rerun it afterwards.

That having said, I think that an asynchronous creation of filehandles is something that we could work around ourselves by sacrificing convenience. It's not raising an insurmountable barrier like with the other functions being async.

tlively commented 2 years ago

What's much more important though is the capability to open the file while sitting in a C++ call stack. E.g. imagine running a SQL statement that scans a previously unknown file or a query that hits the memory limit and wants to spool to disk. Ideally we wouldn't need to abort the query, create the file asynchronously and rerun it afterwards.

Usually opening this file would require some path traversal first, and that requires directory operations. All directory operations are already async, and I don't think anyone has proposed making them synchronous yet. Given that, opening a file will require working around asynchrony and will be a slow operation no matter whether createSyncAccessHandle is sync or not.

That being said, the more operations we can make synchronous, the better for performance. But in this case it will be an incremental improvement rather than a panacea.

a-sully commented 2 years ago

Thank you for the input! It's useful to see that WASM developers seem mostly indifferent about whether createSyncAccessHandle is sync or async.

To summarize, it seems like there's a very strong argument for making all the methods of the SyncAccessHandle sync, but the argument for making createSyncAccessHandle sync is less clear-cut, especially considering that applications typically have to deal with async methods when opening a file anyways. To me, the slight performance improvement is not worth the greater risk of site breakage and the sacrifice of future flexibility.

At this point, I plan to move forward with my initial proposal (https://github.com/whatwg/fs/issues/7#issuecomment-1162501488) of making only the methods of the SyncAccessHandle sync (as soon as #21 lands). If anyone else can provide a strong case to make createSyncAccessHandle sync and would like to drive that work, then of course they are welcome to :)

tomayac commented 2 years ago

It looks like we're all set for launching this in Chrome 108:

a-sully commented 2 years ago

The spec was updated in #55 so this issue can now be closed.

Thank you all for the great discussion and for helping to make the web better!