Automate spiec-easi for non amplicon data

pbelmann commented 1 year ago

Thank you for developing this tool! I'm interested in automating spiec-easi as part of a pipeline that takes a table containing the abundances of metagenome assembled genomes as input. However, I have a few questions regarding the use of spiec easi in this context.

1) Since I don't have integer counts but decimal values (mean coverage), would this be a problem?

2) I would like to automate the spiec easi call in my pipeline as much as possible. What I'm doing at the moment is that I increase the nlambda value if the stability value is not close enough to the stability threshold (0.05). However I'm not sure how to handle the lambda.min.ratio parameter. As far as I understand, setting this parameter to lower values, might leads to denser networks. My question is, if it would make sense to set the lambda.min.ratio parameter to a low value, such as 0.001, from the beginning, so that denser networks are always explored.

zdk123 commented 1 year ago

For 1) is the data still compositional? If so, does using a single pseudocount still make sense?

For 2) lambda.min.ratio determines the limits of the path we explore, but not the optimal value of lambda (which is what determines the density of the selected graph). Starting with small values is preferred, unless you have limited compute resources.

pbelmann commented 1 year ago

Thank you for your answer!

For 1) is the data still compositional? If so, does using a single pseudocount still make sense?

The data is compositional because abundance represents a fraction of sequence data mapped to a genome. Instead of using the coverage mean I could also use relative abundance (i.e genome1 20%, genome2 30%) but as far as I understand spiec easi prefers raw counts. Rergarding the pseudocounts I guess it would not harm to add a certain value to all entries in the abundance table. The data will stay compositional.

If the decimals are an issue I could also round the values which might also make sense since for example a mean abundance of 0.3 means that just a fraction of the genome is actually covered and I can not reliably tell if the genome is in the dataset, so I would round this value to zero.

For 2) lambda.min.ratio determines the limits of the path we explore, but not the optimal value of lambda (which is what determines the density of the selected graph). Starting with small values is preferred, unless you have limited compute resources.

Ok thanks, then I will start with a lower value like 0.01.

I have a third question regarding preprocessing. Do you recommend to filter genomes (or OTUs in amplicon data) that have a zero abundance in almost all samples (i.e. >70% of all samples).

zdk123 commented 1 year ago

Rergarding the pseudocounts I guess it would not harm to add a certain value to all entries in the abundance table. The data will stay compositional.

technically that's true, but the pseudocount serves another purpose as the prior to the Laplace smoothing of frequency estimates, and avoiding log[0] during the clr transformation. Adding the pseudo-count after normalization loses the former interpretation.

The decimal values are not the problem per se, but the Laplace smoother assumes the smallest count we could observe is 1. If you have can observe values between 0 and 1, that assumption no longer applies and a different pseudo-count might be better (e.g. a 0.5 pseudo count is also common).

There are more normalization options in the latentcor branch of this repo, and I'm hoping to release this soon.

Do you recommend to filter genomes (or OTUs in amplicon data) that have a zero abundance in almost all samples (i.e. >70% of all samples).

Yes, though I typically leave that filtering up to the user, since it's highly data-dependent.

pbelmann commented 1 year ago

Rergarding the pseudocounts I guess it would not harm to add a certain value to all entries in the abundance table. The data will stay compositional.

technically that's true, but the pseudocount serves another purpose as the prior to the Laplace smoothing of frequency estimates, and avoiding log[0] during the clr transformation. Adding the pseudo-count after normalization loses the former interpretation.

Ok. I understand. I will not normalize the data then.

The decimal values are not the problem per se, but the Laplace smoother assumes the smallest count we could observe is 1. If you have can observe values between 0 and 1, that assumption no longer applies and a different pseudo-count might be better (e.g. a 0.5 pseudo count is also common).

There are more normalization options in the latentcor branch of this repo, and I'm hoping to release this soon.

As long as the spiec-easi version with more normalization options is not released I will round the values which I believe makes sense in my case as described in my previous comment. This way there will be no values between 0 and 1. Once the new version is released it would be interesting to see the difference.

pbelmann commented 1 year ago

For 1) is the data still compositional? If so, does using a single pseudocount still make sense?

The data is compositional because abundance represents a fraction of sequence data mapped to a genome. Instead of using the coverage mean I could also use relative abundance (i.e genome1 20%, genome2 30%) but as far as I understand spiec easi prefers raw counts. Rergarding the pseudocounts I guess it would not harm to add a certain value to all entries in the abundance table. The data will stay compositional.

@zdk123 Sorry, just to be clear. Since you asked about the compositionality of the data. If I divide the values in my matrix by the total number of reads per sample, I will get relative values, so the data is compositional. Should I use the relative values as input or the raw not normalized counts for SpiecEasi?

zdk123 commented 1 year ago

the raw not normalized counts for SpiecEasi?

yes, raw counts

zdk123 / SpiecEasi

Automate spiec-easi for non amplicon data #236