Closed MrOlm closed 4 years ago
Hi Matt,
Thanks - any advice you have is appreciated. Galah isn't quite as fast as it could be yet, though.
I would argue that those properties can always be satisfied. in your example, the process for the secondary clustering would be, assuming the input order of genomes was A, B then C:
That doesn't contradict either of the 2 properties. There is no property that no 2 non-rep members must be >99% for instance. In fact B is never even compared to C in the secondary clustering.
Hope that makes sense. ben
Closing since I hope I've answered your question - happy to reconsider. ben
Yes you absolutely answered my question and I'm sorry about the radio-silence. This does all make sense, thank you for explanation, good luck with the development, and feel free to reach out if there is anything I could help with for some reason!
Hi Ben,
I just stumbled on this today while looking at coverM and it looks great. dRep can't really handle clustering huge numbers of genomes (~30,000 is the reasonable limit in my experience) because it does not employ greedy clustering, as you mention, making solutions like this a necessity as genome sets increase in size. Also fun to see dRep mentioned by name in the README- much appreciated.
My question is related to how the clustering is actually done in Galah. You mention in the README that:
but in practice this isn't possible to satisfy in many genomes sets. In my experience it's really common to hit the scenario where, for example, genome A is >99% ANI to genome B, genome B is >99% to genome C, but genome A is <99% to genome C. The way that dRep handles this is using the
scipy.cluster.hierarchy.linkage
package (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html), which let's the user specify how it should be handled (single
,complete
,average
, etc.), and with the default setting ofaverage
, some averaging is done to handle these scenarios as best as possible. What is the algorithm that Galah uses for this secondary clustering?Thanks in advance, Matt