study fingerprint index

zhangchunzhong commented 1 year ago

Is your feature request related to a problem? Please describe. Read document:

a) Avoiding the Disk Bottleneck in the Data Domain Deduplication File System @ FAST'08. b) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality @ FAST'09. c) Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup @ MASCOTS'09. d) SiLo: A Similarity-Locality based Near-Exact Deduplicatin Scheme with Low RAM Overhead and High Throughput @ USENIX ATC'11. e) Building a High-Performance Deduplication System @ USENIX ATC'11. f) Block Locality Caching for Data Deduplication @ SYSTOR'13. g) The design of a similarity-based deduplication system @ SYSTOR'09.

Describe the solution you'd like

Need keep fingerprint in RAM to quick query, but RAM is limited, so need a quick swap between RAM and disk.
Find a way to predict which fingerprint index files have more probability to hit the incoming input data.

Describe alternatives you've considered n/a

Additional context n/a

zhangchunzhong commented 1 year ago

it is a core concept in deduplication.

zhangchunzhong commented 1 year ago

study a) Avoiding the Disk Bottleneck in the Data Domain Deduplication File System @ FAST'08.

bloom filter before querying cache
Raw data and raw data index put together
Select Raw data container files into cache
- How to select is the KEY.

zhangchunzhong commented 1 year ago

find a site for dedupe simulator https://github.com/jkaiser/dedup_simulation

zhangchunzhong commented 1 year ago

https://github.com/yuanjingsong/PaperNoteCollection

zhangchunzhong commented 1 year ago

stduy b) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

We present sparse indexing, a technique that uses sam- pling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to de- termine which chunks have already been stored; unfortu- nately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and dedu- plicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks’ hashes are kept in RAM and the sampling rate is low, we dramat- ically reduce the RAM to disk ratio for effective dedu- plication. At the same time, only a few seeks are re- quired per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorpo- rated into number of Hewlett-Packard backup products

zhangchunzhong / go-dedupe

study fingerprint index #4