Closed lxsteiner closed 6 months ago
Hi,
I imagine what is happening here is a consequence of how CoverM predicts relative abundance in the presence of unmapped reads. The rest of the relative abundances are based on the mean values, which is roughly equivalent to the number of mapped reads divided by the genome size. When there are unmapped reads, CoverM assumes that there is 1 "mystery" genome with the same genome size as the average of the known genome sizes.
When most of the reads are from a Eukaryote, this assumption is very wrong - that is what you are seeing I think.
You might be interested in SingleM microbial_fraction.
Thank you for the swift response.
If I understand correctly, I don't think the unmapped reads are the problem here.
I know from mapping with Bowtie to Genome AA directly, the overall mapping rate for that genome is ~11.88% which fits with the rel.abund that CoverM reports in the 1st run (11.37%). But in the 2nd run, of the same metagenome sample, the rel.abund increases from 11%->76%. This would mean that more reads mapped to that genome based on your explanation: "The rest of the relative abundances are based on the mean values, which is roughly equivalent to the number of mapped reads divided by the genome size.", which can't be. I didn't alter the mapping settings.
The only difference between these two runs is that I included the eukaryotic host genome in the 2nd run with all those bacterial genomes (Genome A, B, C ... AA). Most of the reads in the metagenome sample belong to the eukaryotic host which is why the number of unmapped reads in the 1st run is high (88%). With the inclusion of the host genome, the number of unmapped reads should decrease, which it does (88%->14%), but instead of mapping the majority of the reads to the host genome, they are assigned to bacterial Genome AA.
You explained: "When there are unmapped reads, CoverM assumes that there is 1 "mystery" genome with the same genome size as the average of the known genome sizes." but the size of the mystery genome is always constant. And the sizes of all other genomes are known. I don't understand why the number of unmapped reads would influence the other rel.abundances, especially if those unmapped reads should be recruited somewhere else in the 2nd run. The problem being, they are recruited and reported for the wrong target.
Sorry if I'm misunderstanding something. I'll check the tool you suggested out. Thanks.
Edit: TL;DR this
CoverM run1 CoverM run2
--- ---
unmapped 88% unmapped 14%
Bacterium 11% Bacterium 76%
Misc. 1% Misc. 1%
Animal 9%
of the very same dataset.
Hi,
Relative abundance is found by taking the ratios of the mean coverages of the known genomes, dividing them amongst the % of the reads that are mapped.
In the first experiment only 12% are reads are mapped, so the relative abundance of genomeAA (which has by far the largest mean coverage) is 11%.
In the second experiment, 86% of reads are mapped. The mean coverage of genomeAA is ~10 times that of the eukaryote, so of the 86% of reads that are mapped, most of the relative abundance is given to genomeAA, not the euk.
Maybe I wasn't explaining well. In this case it seems that the euk has a higher number of reads mapped to it, but genomeAA has a higher mean coverage (by virtue of having a smaller genome). You can verify by using the -m mean count
.
Understand it is a bit counter-intuitive, but I'm not seeing evidence for a code bug here. It might make sense for you to report the relative abundance within the microbial community, instead of the total relative abundance e.g. by removing reads mapping to the euk first before CoverM, or by normalising the microbial relative abundances by what SingleM microbial_fraction reports.
Thank you for the further explanation!
Ah, I see now. I was fundamentally misunderstanding how relative abundance is reported despite the explanations of the calculations in the wiki. The numbers were unfortunately similar to expected read mapping rates (7-15% for Genome AA, and 60-70% for the Euk genome), which made me assume something was off since the reported numbers seemed switched. Apologies.
I will follow your suggestions. Thank you.
Thank you for the useful tool!
However I'm running into some odd behaviour when mapping to different numbers of genomes. I mapped once to bacterial genomes, and then again to bacterial genomes + eukaryotic host, and I expected a larger amount of mapped reads in the second run mapping predominantly to the eukaryotic host genome, but what happens is that it increases the relative abundance of all previously mapped bacterial genomes as well, quite drastically in one case.
Both runs were done with default parameters and
--min-covered-fraction 0
:Why does including the eukaryotic host genome increase the relative abundance of (most) bacterial genomes? Especially why would the relative abundance of Genome AA jump from 11 to 76? That makes no sense to me. Are the numbers maybe misreported? I would have expected most of the reads to map to the eukaryotic host genome and maybe some reads that previously mapped to bacterial genomes would be recruited to the eukaryotic genome as well (due to contamination). I suppose that in the 2nd run the 76% reported for Genome AA should be reported to the Eukaryotic host and vice versa,
Any clue what might be happening? Or am I misunderstanding something? Thanks.