wwood / galah

More scalable dereplication for metagenome assembled genomes
GNU General Public License v3.0
48 stars 11 forks source link

Galah not clustering MAGs with ANI greater than the threshold #7

Closed apcamargo closed 4 years ago

apcamargo commented 4 years ago

Hi Ben!

I wrote a script to cluster 522 dereplicated MAGs into 329 species clusters (inspired by https://www.nature.com/articles/s41587-020-0501-8). As my algorithm is quite similar to Galah's, I also tried to use it to see if the results would be similar:

galah cluster -t 64 --quality-formula Parks2020_reduced --checkm-tab-table dereplicated_mags_CheckM.txt -f Dereplicated_mags/* --prethreshold-ani 0 --ani 95 --min-aligned-fraction 60 --output-cluster-definition clusters.txt

Unexpectedly, Galah generated 418 clusters. I noticed that there are some MAGs that have more than 97% ANI and more than 90% aligned fraction (according to fastANI) that were not clustered.

What might be going on here? I'm I using some parameter wrong?

wwood commented 4 years ago

Hmm, thanks. Are you able to point me to the genomes?


From: Antônio Pedro Camargo notifications@github.com Sent: Friday, July 17, 2020 12:42:33 PM To: wwood/galah galah@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [wwood/galah] Galah not clustering MAGs with ANI greater than the threshold (#7)

Hi Ben!

I wrote a script to cluster 522 dereplicated MAGs into 329 species clusters (inspired by https://www.nature.com/articles/s41587-020-0501-8). As my algorithm is quite similar to Galah's, I also tried to use it to see if the results would be similar:

galah cluster -t 64 --quality-formula Parks2020_reduced --checkm-tab-table dereplicated_mags_CheckM.txt -f Dereplicated_mags/* --prethreshold-ani 0 --ani 95 --min-aligned-fraction 60 --output-cluster-definition clusters.txt

Unexpectedly, Galah generated 418 clusters. I noticed that there are some MAGs that have more than 97% ANI and more than 90% aligned fraction (according to fastANI).

What might be going on here? I'm I using some parameter wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wwood/galah/issues/7, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5B7DM3W3YLME3CAAK3R363BTANCNFSM4O5LV7CA.

apcamargo commented 4 years ago

I can't send them right now because it's not my data. In any case, I think I found the problem.

Galah executes fastANI in a pairwise manner, while my script executes it to do an all-versus-all comparison. Because fastANI is not "symmetric", I get results like this one:

MAG_A     MAG_B     97.4762    228     629
MAG_B     MAG_A     98.351      232     255

My script is taking the average ANI and the largest aligned fraction, so this pair would be clustered in the same species. I presume that Galah is getting a result that might be similar to the second line.

What do you think? If you want to, I can try to find some more examples with my own data so I can share with you.

wwood commented 4 years ago

Hmm, I'm not sure. If that were true then clustering these 2 test genomes wouldn't work (500kb is a subset of the 1mbp):

galah/tests/data/set1$ RUST_LOG=ERROR cargo run -q -- cluster -f 500kb.fna 1mbp.fna --min-aligned-fraction 0.7 -o /dev/stdout
500kb.fna   500kb.fna
500kb.fna   1mbp.fna
galah/tests/data/set1$ RUST_LOG=ERROR cargo run -q -- cluster -f 1mbp.fna 500kb.fna  --min-aligned-fraction 0.7 -o /dev/stdout
1mbp.fna    1mbp.fna
1mbp.fna    500kb.fna

(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 500kb.fna -r 1mbp.fna -o /dev/stdout 2>/dev/null
500kb.fna   1mbp.fna    100 166 166
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 1mbp.fna -r 500kb.fna -o /dev/stdout 2>/dev/null
1mbp.fna    500kb.fna   100 166 333
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 500kb.fna -r 1mbp.fna -o /dev/stdout 2>/dev/null --minFraction 0.7
500kb.fna   1mbp.fna    100 166 166
(galah-dev) ben@u2:~/git/galah/tests/data/set1$ fastANI -q 1mbp.fna -r 500kb.fna -o /dev/stdout 2>/dev/null --minFraction 0.7
1mbp.fna    500kb.fna   100 166 333

Are you able reproduce the error with only the genomes from an offending pair of clusters for instance? You've my word I won't look at those genomes other than for this bug i.e. only calculate ANI, clustering, if that helps sharing.

Thanks, ben