wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
311 stars 31 forks source link

Best measurement for conducting differential abundance analysis #177

Closed Somebodyatthdoor closed 1 year ago

Somebodyatthdoor commented 1 year ago

Hi,

I've been trying to work out what output from coverm would be best for conducting differential abundance analysis of genomes between groups, but I have been struggling to get my head around it. Most differential abundance methods (eg. ANCOM-BC, Deseq2) require input that has not been corrected for sequencing depth, they require raw counts. So from my understanding, measures such as % relative abundance, RPKM and TPM would not be appropriate. Could you advise on what might be the best method within coverm? Perhaps something that corrects for genome size but not for sequencing depth?

Cheers, Laura

itsmisterbrown commented 1 year ago

hi Laura,

I'm happy to take a stab at this. My approach is to generate two sample matrices, one in TPM, which is the optimal method for comparing relative abundance across samples, and a second matrix with the raw read counts (eg. from coverm contig -m count. I use the count matrix to determine the raw read counts for each sample, then I calculate the geometric mean of read counts for each sample type (probably not relevant if you're only considering say, stool, for example; then you could just take the geomeans of the full dataset). With this number in hand, I convert to TPM matrix to relative abundances (since each sample in the TPM matrix sums to the same value), and multiply the relative abundance of each taxon in each sample by that geometric mean read count. This gives you a normalized read count for each taxon in each sample. Happy to share some code if that was a little too hand wavey?

Somebodyatthdoor commented 1 year ago

hi Laura,

I'm happy to take a stab at this. My approach is to generate two sample matrices, one in TPM, which is the optimal method for comparing relative abundance across samples, and a second matrix with the raw read counts (eg. from coverm contig -m count. I use the count matrix to determine the raw read counts for each sample, then I calculate the geometric mean of read counts for each sample type (probably not relevant if you're only considering say, stool, for example; then you could just take the geomeans of the full dataset). With this number in hand, I convert to TPM matrix to relative abundances (since each sample in the TPM matrix sums to the same value), and multiply the relative abundance of each taxon in each sample by that geometric mean read count. This gives you a normalized read count for each taxon in each sample. Happy to share some code if that was a little too hand wavey?

Hi @itsmisterbrown

Thanks for the info, that sounds like a really good approach. Would you mind sharing your code, so that I can make sure I am understanding fully?

Cheers, Laura

itsmisterbrown commented 1 year ago

shoot, sorry for the delay in responding, Laura. I use phyloseq for managing my count matrices and metadata, so the following would work within that framework but should also illustrate the gist of it.

#convert to relative abundance
phyloseq.relative <-  phyloseq::transform_sample_counts(phyloseq.tpm, fun = function (x) x/sum(x)) 

#standardize counts
phyloseq.normCounts <- phyloseq::transform_sample_counts(phyloseq.relative, fun = function(x) round(x * 7855226))

where 7855226 represents the geometric mean read count for the sample type that I was transforming. In practice, you could use whatever number you like here, I just like to maintain the approximated shifts in variance associated with different read depths between distinct sample types in my dataset. If you just had, say, stool, then it's less of a concern.

Somebodyatthdoor commented 1 year ago

Hi @itsmisterbrown,

Thanks, I'll try that out. Though looking further through the coverm documents, I wonder if there may be a simpler way. Essentially for tools like Deseq and ANCOM you need the equivalent of raw counts, but you also need to account for genome size to avoid bias. We couldn't use "count" for this, as it doesn't control for genome size. But perhaps reads_per_base would work? As far as I can tell it does not control for sequence library size, but it does control for genome size, so that sounds like exactly what you need for doing ANCOM/Deseq analysis.

What do you think?

Cheers, Laura

itsmisterbrown commented 1 year ago

I had also played with the idea of the reads per base measure but ended up deciding against it because for many taxa I had fractional units (eg. 5 reads mapped over a 1000bp contig = 0.005), which would have required a sequencing depth normalization to convert to an integer and it just ended up seeming like more trouble than it was worth. The reason that I chose the approach I outlined above is that the TPM calculation generates sequencing depth- and contig length- normalized abundance estimates that sum to a constant across samples. So, relative abundance conversion is simple and then it just becomes a choice of how to normalize to counts. I don't think there's a black and white right/wrong answer here. My stab at it was taking the geometric mean read counts for my various samples types, which each had different mean raw depths, and using that value as the normalized count total. In practice, I suspect that this has a pretty minimal effect, but I wanted to maintain the variance shifts associated with lower raw sequencing depths that the TPM calcs were based on. My hunch is that you could use any random number, say 1000000, and normalize your relative abundances to that total and your counts would turn out just fine.

Somebodyatthdoor commented 1 year ago

Ok, I think I see where you are coming from now. Though from my understanding if you use the same random number across the whole dataset or across a group of samples, then you would end up with counts that were not reflective of raw counts for each sample. So tools like ANCOM and Deseq would assume that all samples within a group had been sequenced to the same depth, and thereby wouldn't correct for variations between sample depths. I'm not sure what the advantage of using the geometric mean is over transforming each sample individually by its total raw counts.

Somebodyatthdoor commented 1 year ago

I'm going to close this now, but just as info for anyone who might stumble upon it: I tried both techniques discussed above (using the geometric mean or individual sample counts) as input for ANCOM-BC2, and they both gave very similar answers. So at least for my dataset (all GIT samples) it made very little difference.

rzhan186 commented 10 months ago

Hello @Somebodyatthdoor and @itsmisterbrown,

Thank you for sharing your valuable insights! I'm currently engaged in calculating the coverage of a set of Metagenome-Assembled Genomes (MAGs) across various metagenomic samples for differential abundance analysis. I have a question regarding the optimal way to determine the read counts using the contigs mode.

Given that my MAGs were produced through a co-assembly approach, I believe using individual metagenomic assemblies as the reference might not be the most suitable method. Therefore, I'm considering the strategy of concatenating all MAGs to create a unified reference. This approach seems more aligned with the nature of my data and the co-assembly method used. Do you think this rationale is sound for accurate read count calculation per sample?

I appreciate your guidance on this!

Best regards.