Closed btc closed 3 years ago
Just to make sure I'm on the same page, should scan
return all Values present?
If so, how does this look as the signature for lsm::scan
?:
pub fn scan(&self) -> Result<Iterator<Value>>
Given that the return value for Iterator::next
is an Option<Value>
, I think that should handle the case where we have no Values to return.
As for implementation, it seems like scan
would return any items present in each of these locations:
memtable
memtable_in_flush
sstables
Does that sound about right? Would we need to worry about duplicate elements between any of those locations?
scan
could be a range scan, with the whole-db scan a specific case. Both would be nice. And it can help with testing too.
Iterator::next
returning Option<Foo>
is rust's ways of saying that a terminated iterator returns None
.
One key can be located across any and all of memtalbe, memtable_in_flush, and sstables. Each key is deduped within each file, but not across files. For scan, we need to dedupe per key. LSM::get()
doesn't so much dedupe as stop at the first instance found.
scan
a minor tweak: @SoryRawyer scan
ought to take start
and end
parameters. e.g.:
// Scan(start-key, end-key) - Retrieve all of the keys between start-key (inclusive) and end-key (exclusive).
pub fn scan(&self, start: Key, end: Key) -> Result<FooBarIterator>
⚠️ Note the return type above is left as FooBarIterator
. Functionally, the iterator must provide both keys and values, so in pseudo-code, it would be akin to Result<Iterator<(Key, Value)>>
. My Rust knowledge is weak so, off the top of my head, I don't know what the precise type would be.
⚠️ in practice, the Key types might be better as &Key
references. (Sometimes the compiler and function calls have a way of dictating the terms.)
💡 Elsewhere in the project, there is a KeyValueIterator
type which could potentially be helpful as a guide. However, it might make the task simpler relax code-reuse-concern constraints and duplicate a little code for simplicity.
As for implementation, it seems like scan would return any items present in each of these locations: memtable, memtable_in_flush, sstables.
Yup. It'll have to take into account the various sources of data.
Would we need to worry about duplicate elements between any of those locations?
Yup!. The same key k
may appear in more than one source. The most recent value for k
is the one to be returned.
The memtable is the most recent.
The memtable in flush is the second most recent.
The sstables
are ordered from oldest to newest, and are all older than the memtables.
So the scan method requires merging of values from the three sources according to the above precedence rules.
the above guidance may be enough to proceed. however, perhaps it would be helpful for me to mention that it might be worthwhile to break the problem down into smaller, modular chunks:
SSTable::scan
for a single table, perhaps returning an iterator.Memtable::scan
for a single memtable, perhaps returning an iterator.SSTable::scan
and Memtable::scan
calls, perhaps taking a list of iterators and returning a merged iterator.LSM::scan
which combines all of the abovethere are probably elegant and clever ways to optimize the computation to minimize allocations and comparisons, but perhaps it would be helpful to shoot for clarity and correctness in a first pass (and punt on performance).
Is this helpful? Did this explanation raise any new questions?
Thanks folks, this is super helpful. The start and end Key
arguments make sense, I just wasn't sure if there was some kind of ordering involved with the Keys. After taking another look at the definition of Key
, I see the Ord
trait is derived. I think this is enough to get started. Thank you for such thorough and thoughtful responses!
It would be nice to have a
scan
method on the key-value store. This is an issue to track its implementation.