wwood / galah

More scalable dereplication for metagenome assembled genomes
GNU General Public License v3.0
48 stars 11 forks source link

Question about clustering mechanism #5

Closed MrOlm closed 4 years ago

MrOlm commented 4 years ago

Hi Ben,

I just stumbled on this today while looking at coverM and it looks great. dRep can't really handle clustering huge numbers of genomes (~30,000 is the reasonable limit in my experience) because it does not employ greedy clustering, as you mention, making solutions like this a necessity as genome sets increase in size. Also fun to see dRep mentioned by name in the README- much appreciated.

My question is related to how the clustering is actually done in Galah. You mention in the README that:

Generated cluster representatives have 2 properties. If the ANI threshold was set to 99%, then:

1) Each representative is <99% ANI to each other representative.

2) All members are >=99% ANI to the representative.

but in practice this isn't possible to satisfy in many genomes sets. In my experience it's really common to hit the scenario where, for example, genome A is >99% ANI to genome B, genome B is >99% to genome C, but genome A is <99% to genome C. The way that dRep handles this is using the scipy.cluster.hierarchy.linkage package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html), which let's the user specify how it should be handled (single, complete, average, etc.), and with the default setting of average, some averaging is done to handle these scenarios as best as possible. What is the algorithm that Galah uses for this secondary clustering?

Thanks in advance, Matt

wwood commented 4 years ago

Hi Matt,

Thanks - any advice you have is appreciated. Galah isn't quite as fast as it could be yet, though.

I would argue that those properties can always be satisfied. in your example, the process for the secondary clustering would be, assuming the input order of genomes was A, B then C:

  1. A is assigned as a rep
  2. A compared to B, >99%, so B is not a rep.
  3. C compared with A, >99%, so C is not a rep.
  4. All reps now determined, so then each non-rep is assigned to its closest rep. For both B and C, this is A, so both B and C belong to the A|BC cluster.

That doesn't contradict either of the 2 properties. There is no property that no 2 non-rep members must be >99% for instance. In fact B is never even compared to C in the secondary clustering.

Hope that makes sense. ben

wwood commented 4 years ago

Closing since I hope I've answered your question - happy to reconsider. ben

MrOlm commented 4 years ago

Yes you absolutely answered my question and I'm sorry about the radio-silence. This does all make sense, thank you for explanation, good luck with the development, and feel free to reach out if there is anything I could help with for some reason!