Rethinking storage proxy map

annevk commented 4 years ago

One thing I noticed while working on https://github.com/whatwg/html/pull/5560 is that we don't have a nice formalized way to deal with bottle/proxy map operations failing. And I think in principle all can fail for a variety of reasons.

asutherland commented 4 years ago

The LocalStorage case being dealt with in https://github.com/whatwg/html/pull/5560 isn't synchronously dealing with the authoritative map, it's dealing with a replicated copy of the map, but that's largely hand-waved away via the "multiprocess" disclaimer. Perhaps the hand-waving should be reduced and that will help clear up the error handling[1]?

I think the inescapable implementation reality is that there are always going to be at least 3 event loops involved for any storage endpoint and it could be worth specifying this:

The event loop hosting the authoritative storage bottle map for the endpoint for the given bucket. (Which may be different than the event loop for buckets on the same shelf or on different shelves, etc.)
One or more event loops processing I/O operations for the storage bottle map. (Or put another way, for performance reasons, implementations will not/cannot be required to serialize storage API decisions based in a blocking manner on disk I/O.)
The event loop for the agent where the API calls are happening.
(There might also be separate event loops for the authoritative storage bucket map and higher levels, but those don't matter for bottle map errors unless they are fatal.)

Although there will always be policy checks that can happen in the agent event loop that are synchronous, the reality is that most unexpected failures will happen in the I/O event loops and these will then want to notify the authoritative storage bottle map.

Especially given that there's interest in the Storage Corruption Reporting use-case (explainer issue in this repo, this async processing would make sense as any corruption handlers would want to be involved in the middle of the process.

One might create the following mechanisms:

report a broken bottle: Used by endpoints to report something is wrong with the endpoint's storage bottle.
process a broken bottle report: On the authoritative bucket event loop, consult the bucket metadata which determines what action to take. In the future this would allow for a storage bucket corruption handler to get involved. For now the decision would always be to "wipe". In the future this action would then be handed off to the corruption reporting mechanism, however that would work.
perform a bottle inventory: Future work: Exposed by endpoints so that corruption handlers could get an idea of the damage. This might take the form of returning an object with sets of map keys corresponding to: known fully retained map entries, known partially retained map entries, known fully lost map entries. It would also have a boolean that indicates if there are map entries for which the keys were lost. I suppose there could also be a set for map entries where the name was lost but some/all of the data was retained and a synthetic name was created.

For all Storage endpoints, the question whenever any error occurs on the I/O loop or when ingesting data provided by the I/O loop is: Does this break the bottle?. For the "indexedDB", "caches", and "serviceWorkerRegistrations" endpoints there are already in-band API means of relaying I/O failures (fire an UnknownError or more specific error, reject the promise, reject the promise) and there's no need to break the bottle. For "localStorage" and "sessionStorage" there's no good in-band way to signal the problem, but any transient inability to persist changes to disk can be mitigated by buffering and when the transient inability becomes permanent, the bottle can be said to be broken.

1: From a spec perspective (ignoring optimizations), Firefox's LocalStorage NextGen overhaul can be said to synchronously queue a task to make a snapshot of the authoritative bottle map on the authoritative bottle map event loop the first time the LocalStorage API is used in a given task on the agent event loop. The snapshot is retained until the task and its micro-task checkpoint completes, at which point any changes made are sent to the authoritative bottle map in a task where they are applied. This maintains run-to-completion consistency (but does not provide magical global consistency). There are other possible implementations like "snapshot at first use and broadcast changes" which could also be posed in terms of the event loops/task sources.

annevk commented 4 years ago

There's also "does this fit in the bottle?" I suppose, which does happen to fail synchronously for localStorage and sessionStorage (though as specified only for a single method), but presumably based on a thread-local understanding of the status quo.

asutherland commented 4 years ago

Yeah, I was lumping the LocalStorage/SessionStorage quota checks into agent-local policy decisions along with structured serialization refusing to serialize things (for other storage endpoints). For LocalStorage/SessionStorage the quota check need to happen synchronously (and structured serialization is not involved for them).

Impl-specific notes: For Firefox's LSNG the agent can be said to hold a quota pre-authorization like used for credit/debit cards. If a call needs more space than was pre-allocated, a task is synchronously dispatched from the agent event loop to the authoritative bottle map's event loop in order to secure the added quota.

whatwg / storage

Rethinking storage proxy map #96