wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
273 stars 30 forks source link

A question about rpkm #179

Closed LRH1995 closed 10 months ago

LRH1995 commented 11 months ago

Hi,

I am working on the abundance calculation of MAGs with genome mode and rpkm method. I had some problems. Here are the specific issues.

Here are the commands I used: coverm genome -1 read1.fastq.gz -2 read2.fastq.gz --genome-fasta-files genome1.fna --mapper bwa-mem --min-read-percent-identity 0.99 --min-read-aligned-percent 0.80 --proper-pairs-only --methods rpkm -t 20 --min-covered-fraction 0.1 -o output.tsv

Given the basic data: total_reads=156483806 coverM_mapped_reads=248374 genome_size=940307bp my_ideal_result=248374*(10^9)/156483806/940307=1.688 rpkm

Here is the result I received: Genome read1.fastq.gz RPKM genome1 1063.4824

It seems that coverM works like it: result=(10^9)/940307=1063.4824

Do I have a wrong usage of coverM or a wrong understanding of the calculation formula of RPKM? Please give me help.

Thanks, Ronghua Liu

ilnamkang commented 11 months ago

I hope you find a Q&A on the link below helpful. https://github.com/wwood/CoverM/issues/157

"my_ideal_result = 248374*(10^9)/156483806/940307 = 1.688" -> This is an incorrect RPKM calculation.

You're supposed to use 248374 (not 156483806) as a denominator in this case. RPKM = 248374*(10^9)/248374/940307=1063.4824 (Note that 248374 appears twice because you used only one genome.)

As far as I know, RPKM is a calculation method devised in the RNA-seq field, where non-mapped reads have no or little meaning. IMHO, RPKM is not useful to estimate the relative abundance of genomes.

LRH1995 commented 11 months ago

Hi, Thanks for your answer and clear explanation. As you mentioned, RPKM is not useful to estimate the relative abundance of genomes. If metagenomic datasets with different sequencing throughputs are used for the abundance estimation of some dereplicated MAGs, should I choose the "relative_abundance" method? Whether the abundance of different metagenomic datasets can be compared? Please help me.

ilnamkang commented 11 months ago

I think that "relative_abundance" method is suitable for your purpose.

In case you want to compare multiple genomes, you may have to normalize the "relative_abundance" values by genome sizes, because the "relative_abundance" method does not consider genome sizes.

I'm not sure whether the "relative_abundance" method would be also suitable when your metagenomes have different lengths. Reads from 2*150 bp metagenomes might have better chance of being aligned to your genomes than those from 2*100 bp metagenomes.

LRH1995 commented 11 months ago

Thank you for your advice, it is very useful to me!

wwood commented 10 months ago

Thanks for the interest @LRH1995 and helpful responses @ilnamkang , much appreciated.

One point of clarification, relative abundance does implicitly account for genome size because it is derived from the average coverage of each base.

ilnamkang commented 10 months ago

Thank you for the clarification @wwood. I think I had a misconception about the "relative_abundnace" method.

@LRH1995, although I don't know exactly what you want to do, I think that the "relative_abundnace" method might not serve for your purpose.

LRH1995 commented 10 months ago

Thanks for your clarification @wwood and advice @ilnamkang. Sorry for not describing the problem clearly. I retrieved six novel MAGs from recent work. I plan to estimate and compare their relative abundance in the global ocean using metagenomes collected from public databases.

ilnamkang commented 10 months ago

@LRH1995, I did some analyses to get more clear understanding of the "relative_abundance" method, just for me. If interested, you can find my analyses at https://github.com/ilnamkang/CoverM_test.

I'm not sure, but it is likely that "relative_abundance" value of a specific genome can be affected by the inclusion of other genomes. Further, the sizes and abundances of other genomes may also have effects.

I think you may consider using "count" or "reads_per_base" for your analyses, which may be followed by normalization.

LRH1995 commented 10 months ago

Thank you for your clear test and advice.

wwood commented 10 months ago

Hi,

You are right, the abundance is affected by inclusion of the other genomes. Right now in order to calculate relative abundance, it assumes that the genome sizes of the genomes in the set with coverage are the same size (on average) as the (hypothetical) set of genomes where the unmapped reads come from. This assumption stems from the way that genomes with non-zero coverage are split up amongst the % of reads mapped, not 100%.

So if you add another genome, it won't have exactly the average size, so the numbers change.

HTH

ilnamkang commented 10 months ago

Hi,

Thanks for your explanation.