ysono / pancake

7 stars 1 forks source link

implementing LSM::scan #23

Closed btc closed 3 years ago

btc commented 3 years ago

It would be nice to have a scan method on the key-value store. This is an issue to track its implementation.

SoryRawyer commented 3 years ago

Just to make sure I'm on the same page, should scan return all Values present? If so, how does this look as the signature for lsm::scan?:

pub fn scan(&self) -> Result<Iterator<Value>>

Given that the return value for Iterator::next is an Option<Value>, I think that should handle the case where we have no Values to return.

As for implementation, it seems like scan would return any items present in each of these locations:

Does that sound about right? Would we need to worry about duplicate elements between any of those locations?

ysono commented 3 years ago

scan could be a range scan, with the whole-db scan a specific case. Both would be nice. And it can help with testing too.

Iterator::next returning Option<Foo> is rust's ways of saying that a terminated iterator returns None.

One key can be located across any and all of memtalbe, memtable_in_flush, and sstables. Each key is deduped within each file, but not across files. For scan, we need to dedupe per key. LSM::get() doesn't so much dedupe as stop at the first instance found.

btc commented 3 years ago

the interface of scan

a minor tweak: @SoryRawyer scan ought to take start and end parameters. e.g.:

// Scan(start-key, end-key) - Retrieve all of the keys between start-key (inclusive) and end-key (exclusive).
pub fn scan(&self, start: Key, end: Key) -> Result<FooBarIterator>

⚠️ Note the return type above is left as FooBarIterator. Functionally, the iterator must provide both keys and values, so in pseudo-code, it would be akin to Result<Iterator<(Key, Value)>>. My Rust knowledge is weak so, off the top of my head, I don't know what the precise type would be.

⚠️ in practice, the Key types might be better as &Key references. (Sometimes the compiler and function calls have a way of dictating the terms.)

💡 Elsewhere in the project, there is a KeyValueIterator type which could potentially be helpful as a guide. However, it might make the task simpler relax code-reuse-concern constraints and duplicate a little code for simplicity.

regarding the implementation

As for implementation, it seems like scan would return any items present in each of these locations: memtable, memtable_in_flush, sstables.

Yup. It'll have to take into account the various sources of data.


Would we need to worry about duplicate elements between any of those locations?

Yup!. The same key k may appear in more than one source. The most recent value for k is the one to be returned.

The memtable is the most recent. The memtable in flush is the second most recent. The sstables are ordered from oldest to newest, and are all older than the memtables.

So the scan method requires merging of values from the three sources according to the above precedence rules.

pragmatics

the above guidance may be enough to proceed. however, perhaps it would be helpful for me to mention that it might be worthwhile to break the problem down into smaller, modular chunks:

  1. implement SSTable::scan for a single table, perhaps returning an iterator.
  2. implement Memtable::scan for a single memtable, perhaps returning an iterator.
  3. implement a function to combine the results of individual SSTable::scan and Memtable::scan calls, perhaps taking a list of iterators and returning a merged iterator.
  4. implement LSM::scan which combines all of the above

performance considerations

there are probably elegant and clever ways to optimize the computation to minimize allocations and comparisons, but perhaps it would be helpful to shoot for clarity and correctness in a first pass (and punt on performance).

checking in

Is this helpful? Did this explanation raise any new questions?

SoryRawyer commented 3 years ago

Thanks folks, this is super helpful. The start and end Key arguments make sense, I just wasn't sure if there was some kind of ordering involved with the Keys. After taking another look at the definition of Key, I see the Ord trait is derived. I think this is enough to get started. Thank you for such thorough and thoughtful responses!