Open zhangchunzhong opened 1 year ago
it is a core concept in deduplication.
study a) Avoiding the Disk Bottleneck in the Data Domain Deduplication File System @ FAST'08.
find a site for dedupe simulator https://github.com/jkaiser/dedup_simulation
stduy b) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
We present sparse indexing, a technique that uses sam- pling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to de- termine which chunks have already been stored; unfortu- nately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and dedu- plicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks’ hashes are kept in RAM and the sampling rate is low, we dramat- ically reduce the RAM to disk ratio for effective dedu- plication. At the same time, only a few seeks are re- quired per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorpo- rated into number of Hewlett-Packard backup products
Is your feature request related to a problem? Please describe. Read document:
Describe the solution you'd like
Describe alternatives you've considered n/a
Additional context n/a