y-256 / libdivsufsort

A lightweight suffix-sorting library
MIT License
362 stars 81 forks source link

May I ask why don't use DC3 algorithm for building suffix array? #17

Closed JimChengLin closed 6 years ago

JimChengLin commented 6 years ago

I am quite curious. DC3 is O(n), and typically suffix array is used to handle a considerably large amount of data, such as DNA sequences, therefore I expect DC3 will beat O(n log n) algorithms easily.

Did you try DC3? I had implemented DC3 in Python, but no much motivation to transfer to practical C++ currently :(.

Thanks, pal!

kloetzl commented 6 years ago

I expect DC3 will beat O(n log n) algorithms easily.

Unfortunately, your expectation does not resemble reality. The O-notation hides the constant. Optimal algorithms, such as DC3, usually come with a high constant and suboptimal algorithms are much faster in practice.

A different example: KMP is O(n + m), whereas a naïve search is O(n*m). However the latter is ridiculously fast for most input sizes, because of its cache-friendly behaviour and tight loops.

In essence: Measurements beat intuition.

JimChengLin commented 6 years ago

Thanks for the information. Since you guys already did a good benchmark then I don't need to re-spend my time on it.

akamiru commented 6 years ago

The currently most efficient O(n) algorithm is still Yuta Mori's SAIS implementation which in my testing comes close to and sometimes beats libdivsufsort but only for a few large files.

The problem is that suffix sorting is memory latency bound for all (at least) well known algorithms and accesses to a higher level caches basically need exponentially more time. Therefore e.g. libdivsufsort (or my own daware modification) can compete with them because they use O(n * log(n)) ~= O(n * level_0_access + n * level_1_access + ...+ n * level_(log(n))_access) ~= O(n * level_0_access + n * level_0_access/2 + ... +n * level_0_access/2^log(n)) = O(n * level_0_access) which explains why they basically scale like their linear time competitors in practice (which are also O(n * level_0_access)). Memory access times are hard to quantify (due to the many different and complex to model cache replacement algorithms and configurations) and therefore usually assumed to be constant which explains their theoretical superiority. [In this examples level_0 refers to theslowest cache that will be accessed e.g. the smallest cache to fit the dataset and working memory]

BTW: The same problem applies to basic sorting too. E.g. quicksort vs radix sorting. Quicksort will sort smaller groups first resulting in the elements still being in cache while (LSB) radix sort always sorts the whole set. For that reason one could argue that MSB radix sort is superior but sadly it needs additional counting at each stage (while LSB can count each radix in a single step) which doubles the amount of work. This really makes it hard to optimise sorting for all plattforms and all data sets.

Bulat-Ziganshin commented 6 years ago
  1. this calls for hybrid methods - is it possible?
  2. what about prefetching? my tests of random memory access throughput on i7-4770 lead to conclusion that prefetching may improve speed ~4x for single-threaded program (compared to chained access), although it's still 1.5-2x slower than m/t+prefetching (i.e. CPU in this case can't alloc full throughput to a single core). Probably, with DDR4 and with 4-channel memory the difference will be even higher
kloetzl commented 6 years ago

W.r.t 2.: Prefetching is not useful here, because, roughly speaking, S[SA[i]+k] is pretty much random for increasing values of i. Thus, when iterating through the suffix array, you have a cache miss on each entry. Iterating through a build SA is already cache-unfriendly, so building one cannot be better.

Bulat-Ziganshin commented 6 years ago

actually, unBWT is easily made m/t by storing several starting points in the BWT output data, which was implemented by BscLib, in particular. And then each thread (or just a single thread) can follow multiple chains and use prefetching to shift performance from latency-bound to throughput-bound, which was implemented by JamPack. I even expect that with GPU, unBWT may run at the speeds close to 1 GB/s. I don't know, though, whether it's somewhat applicable to BWT.

JimChengLin commented 6 years ago

Do you think suffix tree is an option? The most critical part of building suffix array is roughly like sorting N*N matrix. No matter what sorting algorithm used, e.g. radix sort, quick sort, pointers or structs will be swapped frequently. It leads to cache miss. CPU don't know where to prefecth data.

By contrast, suffix tree preserves the tree structure. It is likely nodes near root will be cached. Everyone claims suffix tree would occupy larger memory space which is true, but with some sophisticated trie tech, i.e. Dynamic Compact Trie, the space usage may be reduced to an acceptable level.

I had already implemented a quite optimized O(n) suffix tree. I am reading the paper about "Bonsai Trie". The idea is simple, using a huge hash table to replace pointers between trie nodes. I may reopen the issue in the future then post my suffix tree with compact trie version.

UPDATE: I am confident to say Suffix Tree can be just 2 times bigger than Suffix Array. Yeah! It is much better than naive Suffix Tree implementation, but not very sexy... And OMG, I just noticed this repo was created more than 10 years ago...

kloetzl commented 6 years ago

Searching within a suffix tree will be faster than in an ESA, because of the reduced number of memory accesses. Not sure about building, though.

akamiru commented 6 years ago

The suffix array is a very sophisticated data structure. It can be contructed a lot faster than (1-2 orders of magnitude in my tests) and it can be queried very fast (pretty easily parallizable) in form of an fm index using backward-search. And all of this while using a lot less memory than the suffix tree. Still the memory usage of the suffix array is too big for many (e.g. whole genom comparision). Therefore basically all research is currently looking into constructing the Burrows-Wheeler Transform directly resulting in the research of succinct data structures even if it still takes a lot more time.

In short suffix trees are not an option worth looking into: People have done it for litterally decades and abandoned them.

On prefetching: I've tried a lot in this direction but I didn't get a reliable improvement with them because most of the time daware (my SACA) works with data already in L1-L3. I did in fact get a nice speed up from reducing branch misses using block quicksort but I don't expect more optimization to pay of a lot. If I ever find the time I will implement a lot of ideas I have for suffix sorting (I got ideas for greatly improving GSACA - the first none recursive linear time SACA) and bce (using AVX512 or even make a GPU implementation - it's extremly vectorizable except for the entropy coding stage which can be implemented using ANS which is very fast). Sadly it doesn't look like that will happen any time soon due to me working on vastly different topics currently (... rockets) and only having a limited amount of free time.

JimChengLin commented 6 years ago

@akamiru I fond an interesting link, https://www.cs.helsinki.fi/group/suds/cst/ Guys from University of Helsinki where Linux was born, claims Compressed Suffix Tree(CST) could use less space than suffix array. Suffix Tree seems is not dead yet... Those PhDs are just amazing! I hope I could be called doctor too, one day.

akamiru commented 6 years ago

@JimChengLin The suffix tree is dead.

The compressed suffix tree has nothing to do with what is commonly known as a suffix tree. The CST is actually build using the suffix array (or the relatively new developed slow direct bwt's I mentioned which are currently being actively researched) and its very basic version is the fm-index I already cited. Those things actually use the burrows-wheeler transform and are only called compressed suffix trees because they support the same queries in nearly the same asymptotic time as the original suffix tree and because they can be compressed due to the properties inherited from the BWT (e.g. bzip and other bwt compressor make use of these too). The Suffix Array can do all of this too. They simply first developed compressed suffix arrays and then extended their functionality until they started calling them compressed suffix trees.

Most CSTs are really slow because their basic idea is to sacrifice speed in order to save memory. This research is bound to become relatively irrelevant soon anyway because of Moore's law. Human DNA is of bound size and soon regular SACA's will be able to transform them completely in memory on regular PCs. Memory usage for constructing the suffix array in memory is arround 27GB (3 billion base pairs a 8 Byte each + up to 3 GB genom data); not much for a scientific computer anyway.

JimChengLin commented 6 years ago

@akamiru You are right! I bet you spent a lot of time on this field. Thanks so much for the valuable information!! For anti-plagiarism/data compression, the desire for smaller suffix {tree, array} will never be satisfied. The data growing probably follows Moore's law as well.

I will re-start my student life on September, So I have nearly half year completely free. I have assigned myself a task that is how to build a optimal dictionary/code? Huffman code produces optimal code for a fixed-size alphabet. The alphabet is typically 256 8bit bytes. If I loose the constraint, the alphabet now can be built by arbitrary number of arbitrary bits word, what is optimal code/dictionary that make sources smallest?

Since most folks around me are not interested in this, would you mind sharing some ideas of it? Some paper if you ever heard of?

Why I am fond of Suffix Tree over Suffix Array, main reason is I could do statistic on-the-fly. Inner nodes mean repeation that implies potentials to compress data.

akamiru commented 6 years ago

Some starting points: http://mattmahoney.net/dc/dce.html https://en.wikipedia.org/wiki/Kolmogorov_complexity https://en.wikipedia.org/wiki/Entropy_(information_theory) https://encode.ru/ (You will find e.g. Bulat-Ziganshin and me over there but I'm more of a passiv reader).

while automatically finding optimal codes is impossible for basically all usefull files there are a few notable practical algorithms which perform good in practice:

2403772980ygy commented 4 years ago

A different example: KMP is O(n + m), whereas a naïve search is O(n*m). However the latter is ridiculously fast for most input sizes, because of its cache-friendly behaviour and tight loops.

Well, I don't think naive search is faster than kmp. I've build random strings in cases, in which the programs match $\pi$, and kmp is much faster when dealing most long strings......

But I do believe that DC3 is slower than $\Theta(n\logn)$ approaches.