wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
297 stars 30 forks source link

Increasing number of genomes increases relative abundance of all genomes? #208

Closed lxsteiner closed 6 months ago

lxsteiner commented 6 months ago

Thank you for the useful tool!

However I'm running into some odd behaviour when mapping to different numbers of genomes. I mapped once to bacterial genomes, and then again to bacterial genomes + eukaryotic host, and I expected a larger amount of mapped reads in the second run mapping predominantly to the eukaryotic host genome, but what happens is that it increases the relative abundance of all previously mapped bacterial genomes as well, quite drastically in one case.

Both runs were done with default parameters and --min-covered-fraction 0:

unmapped    88.3479 unmapped    14.8476
Genome A    0.0005  Genome A    0.0019
Genome B    0.0000  Genome B    0.0003
Genome C    0.0011  Genome C    0.0007
Genome D    0.0160  Genome D    0.0020
Genome E    0.0001  Genome E    0.0006
Genome F    0.0001  Genome F    0.0005
Genome G    0.2421  Genome G    0.0022
Genome H    0.0007  Genome H    0.0048
Genome I    0.0014  Genome I    0.0042
Genome J    0.0003  Genome J    0.0018
Genome K    0.0028  Genome K    0.0147
Genome L    0.0005  Genome L    0.0019
Genome M    0.0002  Genome M    0.0011
Genome N    0.0006  Genome N    0.0043
Genome O    0.0001  Genome O    0.0006
Genome P    0.0002  Genome P    0.0010
Genome Q    0.0012  Genome Q    0.0078
Genome R    0.0002  Genome R    0.0026
Genome S    0.0003  Genome S    0.0022
Genome T    0.0001  Genome T    0.0008
Genome U    0.0001  Genome U    0.0004
Genome V    0.0000  Genome V    0.0003
Genome W    0.0001  Genome W    0.0006
Genome X    0.0022  Genome X    0.0005
Genome Y    0.0005  Genome Y    0.0031
Genome Z    0.0001  Genome Z    0.0010
Genome AA   11.3763 Genome AA   76.2294
            Eukaryotic host 8.8540

Why does including the eukaryotic host genome increase the relative abundance of (most) bacterial genomes? Especially why would the relative abundance of Genome AA jump from 11 to 76? That makes no sense to me. Are the numbers maybe misreported? I would have expected most of the reads to map to the eukaryotic host genome and maybe some reads that previously mapped to bacterial genomes would be recruited to the eukaryotic genome as well (due to contamination). I suppose that in the 2nd run the 76% reported for Genome AA should be reported to the Eukaryotic host and vice versa,

Any clue what might be happening? Or am I misunderstanding something? Thanks.

wwood commented 6 months ago

Hi,

I imagine what is happening here is a consequence of how CoverM predicts relative abundance in the presence of unmapped reads. The rest of the relative abundances are based on the mean values, which is roughly equivalent to the number of mapped reads divided by the genome size. When there are unmapped reads, CoverM assumes that there is 1 "mystery" genome with the same genome size as the average of the known genome sizes.

When most of the reads are from a Eukaryote, this assumption is very wrong - that is what you are seeing I think.

You might be interested in SingleM microbial_fraction.

lxsteiner commented 6 months ago

Thank you for the swift response.

If I understand correctly, I don't think the unmapped reads are the problem here.

I know from mapping with Bowtie to Genome AA directly, the overall mapping rate for that genome is ~11.88% which fits with the rel.abund that CoverM reports in the 1st run (11.37%). But in the 2nd run, of the same metagenome sample, the rel.abund increases from 11%->76%. This would mean that more reads mapped to that genome based on your explanation: "The rest of the relative abundances are based on the mean values, which is roughly equivalent to the number of mapped reads divided by the genome size.", which can't be. I didn't alter the mapping settings.

The only difference between these two runs is that I included the eukaryotic host genome in the 2nd run with all those bacterial genomes (Genome A, B, C ... AA). Most of the reads in the metagenome sample belong to the eukaryotic host which is why the number of unmapped reads in the 1st run is high (88%). With the inclusion of the host genome, the number of unmapped reads should decrease, which it does (88%->14%), but instead of mapping the majority of the reads to the host genome, they are assigned to bacterial Genome AA.

You explained: "When there are unmapped reads, CoverM assumes that there is 1 "mystery" genome with the same genome size as the average of the known genome sizes." but the size of the mystery genome is always constant. And the sizes of all other genomes are known. I don't understand why the number of unmapped reads would influence the other rel.abundances, especially if those unmapped reads should be recruited somewhere else in the 2nd run. The problem being, they are recruited and reported for the wrong target.

Sorry if I'm misunderstanding something. I'll check the tool you suggested out. Thanks.

Edit: TL;DR this

CoverM run1     CoverM run2
---         ---
unmapped    88% unmapped    14%
Bacterium   11% Bacterium   76%
Misc.       1%  Misc.       1%
            Animal      9%

of the very same dataset.

wwood commented 6 months ago

Hi,

Relative abundance is found by taking the ratios of the mean coverages of the known genomes, dividing them amongst the % of the reads that are mapped.

In the first experiment only 12% are reads are mapped, so the relative abundance of genomeAA (which has by far the largest mean coverage) is 11%.

In the second experiment, 86% of reads are mapped. The mean coverage of genomeAA is ~10 times that of the eukaryote, so of the 86% of reads that are mapped, most of the relative abundance is given to genomeAA, not the euk.

Maybe I wasn't explaining well. In this case it seems that the euk has a higher number of reads mapped to it, but genomeAA has a higher mean coverage (by virtue of having a smaller genome). You can verify by using the -m mean count.

Understand it is a bit counter-intuitive, but I'm not seeing evidence for a code bug here. It might make sense for you to report the relative abundance within the microbial community, instead of the total relative abundance e.g. by removing reads mapping to the euk first before CoverM, or by normalising the microbial relative abundances by what SingleM microbial_fraction reports.

lxsteiner commented 6 months ago

Thank you for the further explanation!

Ah, I see now. I was fundamentally misunderstanding how relative abundance is reported despite the explanations of the calculations in the wiki. The numbers were unfortunately similar to expected read mapping rates (7-15% for Genome AA, and 60-70% for the Euk genome), which made me assume something was off since the reported numbers seemed switched. Apologies.

I will follow your suggestions. Thank you.