rarefying, clustering OTU tables

wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes

http://wwood.github.io/singlem/

GNU General Public License v3.0

136 stars 17 forks source link

rarefying, clustering OTU tables #35

Closed piwling closed 5 years ago

piwling commented 5 years ago

Hi Ben, I'm a bit confused about singlem.

“singlem summarise --input_otu_tables otu_table.csv --cluster --clustered_output_otu_table clustered.otu_table.csv” For cluster, can I change the cluster standards, such as similarity 0.95?
“singlem summarise --input_otu_tables otu_table.csv other_samples.otu_table.csv --rarefied_output_otu_table rarefied.otu_table.csv --number_to_choose 100” For rarefied, can the input_otu_tables be the clustered.otu_table.csv? And what is the principle to filter the 100 number sequence?
I use the clustered.otu_table.csv as the rarefied input file, and the result file show that its sequences look like the initial sequence not the representative sequence(clustered), So the num_hits is small. But I hope the rarefied resulted file can show the representative sequence(clustered), how can I change that? Thanks a lot. Brynn

wwood commented 5 years ago

HI,

Thanks for your interested and detailed questions.

For cluster, can I change the cluster standards, such as similarity 0.95?

Yes. Use --cluster_id. In case you didn't know, there's a full list of options when you run singlem summarise --full_help.

For rarefied, can the input_otu_tables be the clustered.otu_table.csv?

No, that file is the definition of the clusters. The actual result of the clustering is output on --output_otu_table.

And what is the principle to filter the 100 number sequence?

This is a rarefaction - it is a way of dealing with different size sequencing depths in relative abundances. This is a pretty standard thing in ecology, and it's no different here.

Does that answer all your questions? Thanks.

piwling commented 5 years ago

Hi, Sorry for getting back to you so late. Of course, you have answered my questions and i will try it according to your opinion. Thank you very much.

lauramason326 commented 2 years ago

Hello! I think I have a similar question: I would like to merge the output files from singlem summarise --input_otu_tables otu_table.csv other_samples.otu_table.csv --biom_prefix myprefix to calculate diversity metrics as in Woodcroft, B.J., Singleton, C.M., Boyd, J.A. et al. Nature 560, 49–54 (2018), and to do this, I am trying to get a weighted average of any taxa identified by more than one marker gene. Is the flag --cluster_id 0.95 the way to do this? Thanks! Laura

wwood commented 2 years ago

Hi Laura,

There's no easy way to combine sequences across marker genes (except via taxonomy, which isn't what you want). I'd instead suggest calculating the diversity metric for each marker gene, and then taking the mean of those results for each sample.

HTH, ben

lauramason326 commented 2 years ago

Hi Ben Thanks for the advice - I hadn't thought of that! Laura

lauramason326 commented 2 years ago

Hi again Ben, A follow up Q: I am trying to put together an OTU table (with taxonomy) that is a combination of the biom tables generated for each singlecopy marker gene. I have clustered the OTU table to 0.95 (which I am assuming represents species level clustering), but I am concerned that there are taxonomic repeats between marker genes, and I am unsure how to deal with them. Do you have any advice? Thanks Laura

wwood commented 2 years ago

Hi Laura,

Unfortunately I'm not sure I have a good answer. Multiple OTUs can have the same taxonomy, because taxonomy isn't necessarily down to the species level, for instance. It also isn't clear how you intend to combine the results from the different genes, since the taxonomy for the reads from a single species might be different across the different genes e.g. one might be at species level and another only genus level.

In the dev branch, @rossenzhao implemented a "condense" mode which does provide a single taxonomy table calculated from all of the genes. To do this you'd have to rerun singlem pipe on all your samples again, but luckily the dev pipe is like 95% faster. You'd also get the advantage of using GTDB r202 taxonomy, which is much better.

Let me know if you wanted to go that route and I can provide some further details. ben

lauramason326 commented 2 years ago

Hi Ben Sure, I'd give the condense mode a try! I mainly want to run this data through lefse to pick out differentially enriched OTUs, and I was told that I would need to get the average counts of repeated taxa for that..... Thanks and sorry if this was confusing! Laura

wwood commented 2 years ago

Sure, OK, well check out the dev branch of this repo, and then use this "metapackage" with pipe https://zenodo.org/record/6469357

Then on the output of pipe run condense. HTH. ben