wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes
http://wwood.github.io/singlem/
GNU General Public License v3.0
131 stars 16 forks source link

interpretting the output from coverM condense #88

Closed lauramason326 closed 2 years ago

lauramason326 commented 2 years ago

Hi Ben, Per our discussion on this post, I ran coverM in condense mode However, I have a few clarifying questions:

  1. What does coverage mean here? Is it useful for undersanding the number of OTUs with this taxonomy in the sample?
  2. How is the taxonomy assigned? What happens if different markers align with different portions of the same gene and the different portions correspond to different taxa?
  3. I am trying to import the data into lefse to obtain differentially enriched OTUs for my treatments. Is the output from condense what I want for this?

Thanks in advance, Laura

wwood commented 2 years ago

Hi Laura,

I take it you mean singlem condense, not coverm.

  1. What does coverage mean here? Is it useful for undersanding the number of OTUs with this taxonomy in the sample?

No not really. Coverage is the estimation of what a genome coverage would be if the whole genome was available and a read mapping was done. It is based on the number of reads that have that taxonomy, and the length of those reads (like kmer coverage)

How is the taxonomy assigned? What happens if different markers align with different portions of the same gene and the different portions correspond to different taxa?

Currently at least, only 1 portion of each gene is in the packages. Not sure how different portions correspond to diff taxa.

I am trying to import the data into lefse to obtain differentially enriched OTUs for my treatments. Is the output from condense what I want for this?

You could get diff abundance taxons, not enriched OTUs.

lauramason326 commented 2 years ago

Hi Ben Thanks for the info. I think you're response may have been cut off though - "No, it isn't helpful except i" is where the post stops - at least on my end. Can you elaborate please? Thank you! Laura

wwood commented 2 years ago

Sorry that last bit was leftovers from an earlier draft - forgot to delete it - have edited now.

Is some part still unclear?

lauramason326 commented 2 years ago

Yes -thank you. Laura

Virginia-Rich commented 2 years ago

Hi @wwood this is Gin, I am working with 2 non-EMERGE scientists (PhD student Laura Mason @lauramason326 , with whom you corresponded above, thank you so much; Laura works with co-supervisor Prof Richard Dick on ag micro at OSU; and now-postdoc Dawson Fairbanks, who is co-supervised by Prof. Rachel Gallery at the University of Arizona), and I had a couple Qs as we work through their SingleM analyses of their metaGs & which I can't find in the readme (apologies if I'm missing them!):

  1. Is SingleM normalizing hits to gene length (or is the query always 60 bp and so inherently comparable among genes)?
  2. And hit # does not get normalized to metaG depth of sequencing, right?
  3. Does SingleM use phylogeny to inform taxonomy assignment or just sequence matching against the refDB?
  4. Re combining data from different marker genes, for e.g. Lefse... would you run Lefse on the results of all the genes separately, and then combine the results (removing duplicates), or make a combined file from all the marker genes, collapse to identical taxonomy, and then averaging or summing across multiple genes with same taxonomy (prolly averaging)?
  5. I was psyched about the condense function you and colleagues describe, which I think is basically (?) doing just that combining I mention above, but the variability in taxonomic resolution among markers is an issue, isn't it (as your team flags)... I was thinking Laura & Dawson could run Lefse on the output from condense, as WELL as running it on the SCMGs separately, since the improved limit of detection and resolution provided by using coverage inferred from all the SCMGs - for the small subset of lineages with consistent taxonomy across all SCMGs - would improve the strength of the discrimination? So, you want to harness that resolving power where it's available. Would you agree?
  6. In Dawson's output from condense, there are samples with very few lineages showing up (e.g. one samples just have one condense output "lineage" with just a root-Archaea assignment!!). That doesn't seem possible. But, maybe that just reflects a community poorly represent by reference taxonomy and so "bounciness" among taxonomic assignments for SCMGs from the same lineage? (In her case, it's montane forest soil, fwiw).

Thanks and so sorry for the slew of Qs!! g

wwood commented 2 years ago

Hi @Virginia-Rich yes the documentation could be better. This is a consequence of it being under active development.

  1. Since the community is modeled as each cell having exactly 1 of each marker gene (treating archaea and bacteria separately), then the number of reads hitting each OTU estimates the fraction of the community that is that OTU. So yes. It is almost always 60bp, though sometimes there is a codon insert of deletion. The coverage column is calculated from the num_hits column, the size of the window (usually 60bp) and the length of the read. So long story short, yes.
  2. Right. Their sum is not 100%.
  3. In the version you are using it does a diamond blastx of each read in the OTU against GTDB reps, taking best hit. Then the taxonomy of the OTU is the "median" taxonomy of this. i.e. The most resolved taxonomy such that 50% of reads agree. I just implemented a version that matches the nucleotide window seqs to GTDB first, and then falls back to blastx when there is no close hit. So, changes coming there.
  4. I would probably just run lefse on the condensed output, but both sound fine. If you don't have much sequencing depth (or plant DNA etc) then individual gene OTU tables might not be precise enough. The condensed version is more sensitive to lower abundance taxons because it is combining results across many genes.
  5. It isn't quite the same because it does a trimmed mean to get rid of noise, and also estimates the abundance of higher tax levels before lower ones i.e. works out Archaea and Bacterial abundances first, and then phyla, and so on. The trimmed mean is calculated on each level so e.g. in there are only 2 bacterial phyla and both fall below the trim threshold, but together they are above it, then the final profile will have bacteria but not phyla.
  6. That doesn't sound like bounciness because the trim is calculated archaea/bacteria as a whole first. Can you send me the input and command used please? Since this is new code it might be good old software bugs.

Sorry for the instability here, but hope that helps.

Also, to your point about some genes having more taxonomic assignment potential than others, definitely agree. I'm trying to devise a condense implementation that more gracefully handles that, there some scratchings on the whiteboard.

I also have an r207 metapackage, if that would help?

dawsonfairbanks commented 2 years ago

Hi @wwood; re. issue 6. here is the code I used running condense with the updated package

singlem pipe --forward ${SAMPLE}_R1.fastq --reverse ${SAMPLE}_R2.fastq --otu-table $outdir/${SAMPLE}.otu_table.csv --threads 20 --singlem-metapackage /fs/project/PAS1212/bioinformatic_tools/singlem_db/S3.0.1.metapackage_20211101.smpkg

singlem summarise --input_otu_tables *.csv --output_otu_table combined.otu_table.csv

singlem summarise --input_otu_tables combined.otu_table.csv --cluster --cluster-id 0.95 --clustered_output_otu_table newcsvfile.csv

singlem condense --input_otu_tables newcsvfile.csv --output_otu_table condense.csv --singlem-metapackage /fs/project/PAS1212/bioinformatic_tools/singlem_db/S3.0.1.metapackage_20211101.smpkg
wwood commented 2 years ago

Thanks Dawson, would you mind emailing me the combined otu table please?