wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes
http://wwood.github.io/singlem/
GNU General Public License v3.0
112 stars 16 forks source link

Taxonomic assignment discrepancy #81

Open ashwinssudarshan opened 3 years ago

ashwinssudarshan commented 3 years ago

Hello! I was using this package to generate OTU assignments and to understand the communities for a set of metagenomic raw reads that I have. While converting the table to a wide format suitable for phyloseq objects, I noticed that sequences within the same OTU were having different levels of taxonomic assignment (Some upto order and other upto genus level for example).

This conversion to a wide format was done through a custom R script and cause some problems especially while building the taxonomy table since there were these varying taxonomic assignments for the same OTU. This issue didn't occur when I used the function in the package to get the wide format OTU table.

My questions are:-

1) Why is this discrepancy occurring in the first place?

2) As far as resolving through the functions in the package, does the code only retain the taxonomic classification till the point where there is agreement between all the assignments? Is that how this issue is being resolved through the package.

Thanks in Advanced!

wwood commented 3 years ago

Hello.

I think you have guessed right - the taxonomic assignment of each OTU is based on a summary of the taxonomic assignment of all reads that go into that OTU. So if an OTU has 2 sequences where one is from the phylum actinobacteria and the other is from acidobacteria, then the OTU gets assigned to bacteria only. But then if you ran the next sample and there was only 1 sequence in the OTU, assigned to actinobacteria, then the OTU from the second sample would be assigned Bacteria; Actinobacteria.

There is some pesky information loss when you are summarising a pre-existing OTU table, since the taxonomic classification of the reads that went into each OTU are not known (not in the OTU table), so it may make a decision different than if the full information was available.

In the bigger picture there are 2 things here:

HTH, ben