zhaoxiaofei / bindash

Fast and precise comparison of genomes and metagenomes (in the order of terabytes) on a typical personal laptop
Other
57 stars 7 forks source link

question on mergeability of one permutation Minhash with optimal densification #10

Closed jianshu93 closed 10 months ago

jianshu93 commented 1 year ago

Hello Xiaofei,

A quick question about the mergeability of MinHash like algorithms, when sketching a subset of a give dataset, and then sketch another subset of the same dataset, can we merge them as the exact sketch with the sketch from sketching those 2 subset together. This is relevant to genomes, often all are fragmented, several hundred sequences, we can sketching kmers of the first, second sequences upon reading the genome, and when all sequences of this genomes are done reading, sketching is also done and we just need to merge them. This will save some time for large number of genomes when sketching & reading/IO takes a lot of time. I am asking because the densification idea introduce some randomness and I am not sure whether is is mergeable.

Thanks,

Jianshu

jianshu93 commented 1 year ago

Hello Xiaofei,

Just found that you also studied in Tsinghua (Ting Chen's team?), I was in Tsinghua from 2016 to 2019, now at Georgia Tech. Happy to chat more about those Minhash related probabilistic data structures (WeChat is Jianshu_Zhao).

Jianshu

jianshu93 commented 1 year ago

See another even interesting new Minhash called C-Minhash, just published in ICML: https://proceedings.mlr.press/v162/li22m/li22m.pdf

only one permutation needed but circulant. Could be even faster right.

Thanks,

Jianshu

zhaoxiaofei commented 1 year ago

@jianshu93

In general, minhash (with or without densification) requires the entire genome to be sketched at once. In theory, it it possible to combine two sketches from the same genome, but in practice it introduces inconsistency: ideally, combining two sketches from two datasets should be equivalent to generating one single sketch using the two datasets, but a lot of information is already lost after minhashing. If you would like to recover such information, then you are probably better off just sketching the two datasets anyway.

C-Minhash provides an interesting idea. However, in genome analysis, the time complexity of sketching is O(n) whereas the time complexity of pairwise distance comparison is O(n^2) where n is the number of sketches or organisms (assuming each organism has one sketch and the genome sizes of organisms are not too different). Therefore, the overall saving in runtime is negligible even if C-Minhash is used because sketching is not at the runtime bottleneck.

I already sent you an invitation in wechat.

Best, Xiaofei

jianshu93 commented 1 year ago

Hello Xiaofei,

My understanding of Minhash-like ones, they are mergeable, meaning hash part of the genome and then merge. However, the densification step, is not mergeable, meaning we can run one permutation MinHash for part of the genomes when reading, and merge each part of the genome to have a merged one for the entire genome, and then use densification on the merged sketches, this should be fine and can reduce memory when reading many genome at a time into memory because we reading while hashing and when hashing is done, sequences can be released from memory. In a streaming fashion that is to say. Hashing in Bindash is very fast due to rolling hash, so not a big problem. The case I am talking about is when all those sketches, we need them right after getting them to do other things, so we keep them in memory instead of writing to disk. We need a lot of memory then, sketches + genomes that are being hashed (assume many genomes are hashed at the same time).

Thanks,

Jianshu

zhaoxiaofei commented 1 year ago

Hi @jianshu93 ,

Yes, your understanding is correct. You need a lot of memory to keep the raw hash values before the densification step. If you are interested in min-hash related research, I would like to suggest you to think about how to speed up the pairwise genome comparison so that the expected running time becomes sub-quadratic. For example, you can first partition the genomes into diferent clusters (using a min-hash index) and then perform pairwise comparison within each cluster.