ysono commented 3 years ago

Terminology:

Let me attempt to clarify the terminology first. (I'm open to calling things differently.)

One "item" is either a key or a value, and it's represented as

|datum_size|datum| -- format

|size_of(usize)|(variable length)| -- byte size

A key-value entry is presented as

Issue:

Right now, both the tombstone item and the "empty datum" item are presented as

|0usize|| -- serialized

Solutions:

Option 1: Change datum_size to be typed `isize`.

|size_of(isize)|(variable length)| -- byte size

|-1isize|| -- serialized tombstone |0isize|| -- seriazlied "empty datum"

Option 2: Add discriminant into the serialization format.

The presence of discriminant allows us to add explicit data type info in the serialized format, such as integer, string, tuple, maybe even custom types. I'd like to know if this is what other storage engines do, or whether they punt the interpretation to some mechanism outside serde, and if so what this mechanism is.

Let's call "span" := discriminant + datum.

"item" := span_size + span

|span_size|discriminant|datum| -- format

|size_of(usize)|size_of(u16)|(variable length)| -- byte size

Span size needs to come first (i.e. before discriminant) b/c that allows a reading client that is not interested in deserializing datum to seek ahead in file.

The choice of u16 to represent discriminant is arbitrary. In case we need to deprecate supported discriminants over time, this allows us (pow(2,16) - count_of_active_discriminants) deprecations, before rolling over to zero.

If we go with option 2, we should go ahead and add discriminant for both key and value. Not only would doing so simplify serde, it would add a new feature that keys can be integer, tuple, etc.

For in-memory interpretation, we would have a related change: Both key and value wrap datum.

enum Datum {
    Bytes(Vec<u8>),
    I64(i64),
    Str(String),
    Tuple(Vec<Datum>), // This allows infinite nesting. Is this necessary and sufficient for defining custom types?
}

struct Key(pub Datum);

enum Value {
    Tombstone,
    Datum(Datum),
}

The primary index (i.e. struct LSM) concepts should not need to change.

the commit log file lists (item, item) pairs; and the memtable maps {key: value}
each sstable file lists (item, item) pairs; and its sparse pointer maps {key : offset_in_file}

How would this affect secondary indexing?

A secondary index maps { a portion of a value extracted in a specific way : associated values }. It seems to me that Option 2 offers an opportunity to specify the extraction.

At a hand-waving level, one secondary index needs to have the following things:

("pk_cl_offset" := identifier of primary key's commit log and the file offset therein)

("pk_sst_offset" := identifier of one of primary key's sstables and the file offset therein)

secondary key extractor definition. Eg "for values typed Tuple(Str, I64, Str), extract the 2nd Str"
the commit log file lists (item, pk_cl_offset) pairs; and the memtable maps {extracted_key: pk_cl_offset}
each sstable file lists (item, [pk_sst_offset]) pairs; and its sparse pointer maps {extracted_key: [pk_sst_offset]}

This has been an idea dump based on a lot of guesswork in the head. Please point out blindspots, inconsistencies, gotchas, and also examples of rela-life DBs if you find them!

btc commented 3 years ago

Maybe this is a silly question: This consideration seems to stem from the need for an empty value. Is there a use-case you have in mind for empty values?

ysono commented 3 years ago

If we're choosing to not distinguish empty-valued entries and nonexistence, then that's a contract we need to specify. The client needs to be aware that using the put api may result in data deletion (no-op if there was no data before).

An empty value can be interpreted in several ways:

A non-interpreted placeholder indicating existence. Using the key-value store as a set data structure.

{
    "reservation:room:123:alice": _,
    "reservation:room:123:bob": _
}

The "room 123" set contains alice and bob.

Empty string, list, or any other data structure that might serialize to empty bytes.

Values are middle names.
Values are shopping carts.
Would empty bytes be a valid serialization result by things like cap'n proto? I'm not sure.

There can be arbitrary business logic that differentiates nonexistence and emptiness, say a previously deleted shopping cart and a newly created shopping cart.

When an operation reads one key, it might be fine -- say we're trying to append on item to a list-valued entry. But for an operation that reads two keys, user may desire different handling based on whether zero or one or both entries are missing.

If we were building a production-worthy db, I wouldn't make a low-level design decision that impacts high levels so widely. But since we're writing a tool to learn on, design decisions should keep things from becoming too complicated and should enable us to learn more stuff.

I think I'll read up on how DBs handle types, before moving forward.

btc commented 3 years ago

on contracts

If we're choosing to not distinguish empty-valued entries and nonexistence, then that's a contract we need to specify.

i agree it would be helpful to establish an API contract for public methods, and lean on those to guide design decisions.

(establishing a contract for get, put, scan, delete, and perhaps has_key)

then maybe it's clear what the underlying implementation must do in order to satisfy the contract?

in the wild

I think I'll read up on how DBs handle types, before moving forward.

Here's one from Scylla DB: https://docs.scylladb.com/architecture/sstable/sstable2/sstable-data-file/

and Cassandra: http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html

pancake contract e.g.

pub fn get(&self, k: Key) -> Result<Option<Value>>

Perhaps get ought to return Result<Value>? and get(k) a non-existent key k returns an "Err::NotExists"?

Am I misunderstanding the use-case for Option again? lol

Overall

I agree with the need to define a tombstone to be different from the empty value. Sorry it took me a while to wrap around it.

Perhaps now that I am onboard, we can bring the discussion back to your original offering of ways to represent tombstones and values.

🙏🏽

btc commented 3 years ago

discriminant

Is the discriminant essentially an enum which specifies the type of the value?

ysono commented 3 years ago

API, how to expose existence vs empty

No problem. I agree that Option can hide intention. IMO we should go ahead and convert it to explicit enums. We'd lose monad methods like .map() but IMO they make things harder to read.

The type that the pub fn get() returns should prob be different from the internal storage::api::*.

mod storage {
    pub enum Datum {
        ...
    }
    pub struct Key(Datum)
    pub enum Value {
        Tombstone,
        Datum(Datum),
    }
}
mod frontend {
    pub use storage::Datum;
    pub enum Value {
        None, // http 404
        Datum(Datum), // http 200 or 204
    }
}

Encoding datum type info

Yeah by discriminant I mean what rust calls the integer version of an enumerated type.

The scylla and cassandra docs (thanks for sharing) validated the need for there to be type info serialized. Cassandra uses a one-byte discriminant. Scylla seems to punt it to some deserialization mechanism, which I couldn't figure out at a quick look at the source code. I argue it's a good interface for our db to encode primitive info, support tuple, and make client be responsible for parsing tuples --ie clients don't have to parse primitives.

To clarify - most use cases would put externally serialized data into a bytes-typed blob in the db. The tuple thing is for db-aware separation of data, like searching by a non-primary key.

ysono / pancake

Distinguish tombstone vs empty datum #22

Terminology:

Issue:

Solutions:

Option 1: Change datum_size to be typed `isize`.

Option 2: Add discriminant into the serialization format.

A non-interpreted placeholder indicating existence. Using the key-value store as a set data structure.

Empty string, list, or any other data structure that might serialize to empty bytes.

on contracts

in the wild

pancake contract e.g.

Overall

discriminant

API, how to expose existence vs empty

Encoding datum type info

ysono / pancake

Distinguish tombstone vs empty datum #22

Terminology:

Issue:

Solutions:

Option 1: Change datum_size to be typed isize.

Option 2: Add discriminant into the serialization format.

A non-interpreted placeholder indicating existence. Using the key-value store as a set data structure.

Empty string, list, or any other data structure that might serialize to empty bytes.

on contracts

in the wild

pancake contract e.g.

Overall

discriminant

API, how to expose existence vs empty

Encoding datum type info

Option 1: Change datum_size to be typed `isize`.