Closed apcamargo closed 4 years ago
Hmm, thanks. Are you able to point me to the genomes?
From: Antônio Pedro Camargo notifications@github.com Sent: Friday, July 17, 2020 12:42:33 PM To: wwood/galah galah@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [wwood/galah] Galah not clustering MAGs with ANI greater than the threshold (#7)
Hi Ben!
I wrote a script to cluster 522 dereplicated MAGs into 329 species clusters (inspired by https://www.nature.com/articles/s41587-020-0501-8). As my algorithm is quite similar to Galah's, I also tried to use it to see if the results would be similar:
galah cluster -t 64 --quality-formula Parks2020_reduced --checkm-tab-table dereplicated_mags_CheckM.txt -f Dereplicated_mags/* --prethreshold-ani 0 --ani 95 --min-aligned-fraction 60 --output-cluster-definition clusters.txt
Unexpectedly, Galah generated 418 clusters. I noticed that there are some MAGs that have more than 97% ANI and more than 90% aligned fraction (according to fastANI).
What might be going on here? I'm I using some parameter wrong?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wwood/galah/issues/7, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5B7DM3W3YLME3CAAK3R363BTANCNFSM4O5LV7CA.
I can't send them right now because it's not my data. In any case, I think I found the problem.
Galah executes fastANI in a pairwise manner, while my script executes it to do an all-versus-all comparison. Because fastANI is not "symmetric", I get results like this one:
MAG_A MAG_B 97.4762 228 629
MAG_B MAG_A 98.351 232 255
My script is taking the average ANI and the largest aligned fraction, so this pair would be clustered in the same species. I presume that Galah is getting a result that might be similar to the second line.
What do you think? If you want to, I can try to find some more examples with my own data so I can share with you.
Hmm, I'm not sure. If that were true then clustering these 2 test genomes wouldn't work (500kb is a subset of the 1mbp):
galah/tests/data/set1$ RUST_LOG=ERROR cargo run -q -- cluster -f 500kb.fna 1mbp.fna --min-aligned-fraction 0.7 -o /dev/stdout
500kb.fna 500kb.fna
500kb.fna 1mbp.fna
galah/tests/data/set1$ RUST_LOG=ERROR cargo run -q -- cluster -f 1mbp.fna 500kb.fna --min-aligned-fraction 0.7 -o /dev/stdout
1mbp.fna 1mbp.fna
1mbp.fna 500kb.fna
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 500kb.fna -r 1mbp.fna -o /dev/stdout 2>/dev/null
500kb.fna 1mbp.fna 100 166 166
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 1mbp.fna -r 500kb.fna -o /dev/stdout 2>/dev/null
1mbp.fna 500kb.fna 100 166 333
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 500kb.fna -r 1mbp.fna -o /dev/stdout 2>/dev/null --minFraction 0.7
500kb.fna 1mbp.fna 100 166 166
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 1mbp.fna -r 500kb.fna -o /dev/stdout 2>/dev/null --minFraction 0.7
1mbp.fna 500kb.fna 100 166 333
Are you able reproduce the error with only the genomes from an offending pair of clusters for instance? You've my word I won't look at those genomes other than for this bug i.e. only calculate ANI, clustering, if that helps sharing.
Thanks, ben
Hi Ben!
I wrote a script to cluster 522 dereplicated MAGs into 329 species clusters (inspired by https://www.nature.com/articles/s41587-020-0501-8). As my algorithm is quite similar to Galah's, I also tried to use it to see if the results would be similar:
Unexpectedly, Galah generated 418 clusters. I noticed that there are some MAGs that have more than 97% ANI and more than 90% aligned fraction (according to fastANI) that were not clustered.
What might be going on here? I'm I using some parameter wrong?