whatwg / storage

Storage Standard
https://storage.spec.whatwg.org/
Other
126 stars 55 forks source link

Define size of all storage actions #110

Open annevk opened 4 years ago

annevk commented 4 years ago

In order to give developers a more consistent experience across browsers, while allowing browsers to compress, deduplicate, and otherwise optimize the stored data, we should standardize the upper bound for each storage action and have all browsers enforce that.

E.g., the size of localStorage[key] = value could be (key's code unit length + value's code unit length) × 2 + 16 bytes of safety padding or some such. (I did not put a lot of thought into this. If we go down this path we'd need to do that.)

(See 6 in https://github.com/whatwg/storage/issues/95#issuecomment-656555686 and reply for context.)

asutherland commented 4 years ago

This seems desirable and has indeed come up before. Specifically, in terms of allowing structured serialized storage of data on things like ServiceWorker registrations and related data (ex: Notification.data) where it would be desirable to place an upper bound on storage but is an interop nightmare without this issue addressed.

I believe this would require the serialization steps for [Serializable] to also produce a size/upper-bound value as well?

It seems like the most complex issues are;

  1. Blob/File and any similarly immutable abstractions which allow implementations like IndexedDB to store a single copy of the data on disk. Firefox only stores a single copy of a given Blob/File (based on object identity, independent of contents). I presume the only course of action is to either standardize this or to tally each time the blob is used in a structured serialization (which will be-duplicate internally via its "memory"). If standardized, interesting and terrifying new possibilities are raised, such as the BlobStore being its own storage endpoint which can then be used by Notification.data and even ServiceWorker's Cache API storage.
  2. Compression. It would be unfortunate for implementations to be able to implement CPU/power/disk-efficient native storage of data but need to charge a high quota cost, resulting in content performing less efficient compression in JS/WASM in order to be charged a lower quota cost but actually use more disk space. Presumably the answer is Compression Streams? But this is still awkward because, for example, Firefox currently uses Snappy (for Cache API storage) and wants to use LZ4 (for Cache API storage and IndexedDB), and neither of those are yet specified and it would be arguably silly to run gzip against data just for the purposes of calculating a more generous quota charge while actually storing the data using LZ4.
pwnall commented 4 years ago

Thank you very much for opening a specific issue for this topic!

Reiterating here for clarity -- Chrome is supportive of this effort to come up with an abstract cost model for storage. We'd be willing to take on the (quite non-trivial) implementation costs if the model gains cross-browser acceptance.

I also really like that @asutherland brought up some of the complex issues early on. I'd be tempted to follow the solutions of other systems I'm aware of.

  1. Blobs: Charge a separate copy per item. I claim this approach is more intuitive to users -- you're charged for what you write, with decisions made locally. Implementers get the benefits from content de-duplication as operational cost reduction. I think this approach would also make the proposal more palatable, because we'd be avoiding asking browsers to implement content de-duplication to be compliant.

  2. Compression: Charge for uncompressed data. Same reasoning as above -- it's more intuitive to be charged for what you write. Also, unless we mandate that each object is compressed individually, compression ratios depend on adjacent data, so I think we'd end up with a lot of constraints around physical data layout. I'd strongly prefer that specs don't get into this business :smile:

On a brighter note, the zstd benchmarks suggest that the algorithms we'd consider have ratios within 2x of each other (and below 3x of uncompressed) for "typical" data. I claim this is well within the precision margin for the cost model we'd be building up here.

Along the same lines, I hope that we can avoid having apps play games (like manual compression) by being reasonably generous with quota. Ideally, apps without bugs should not run into quota problems.

pwnall commented 4 years ago

I found some notes from when I tried to sketch a storage cost model for IndexedDB. This was in 2018, and I knew a lot less about the implementation back then. So, the numbers are probably bad, but at least it's a list of things to consider.

Object cost:

I might have missed some other object. The idea is to assign a cost based on a straightforward representation for each clonable. The cost doesn't have to be exact, because we expect implementations to have their own overhead.

IndexedDB transaction costs (get refunded when the transaction completes):

This isn't a complete list. I hope it's a good starting point if someone is itching to start an explainer :smile:

asutherland commented 4 years ago

@pwnall Your simplifying proposal in https://github.com/whatwg/storage/issues/110#issuecomment-662493325 sounds good to me. Also, it's very consistent with reality, as Mozilla's Servo project is an example of bringing up a browser from scratch-ish and they've found implementing IndexedDB non-trivial, so further complicating the standard and raising the bar to building a compliant browser engine would not be a win for the web.

annevk commented 4 years ago

See also: https://github.com/whatwg/html/issues/4914.