ambient rna.txt variability

bpyenson commented 2 years ago

Hi,

How do I access the the variance of ambient rna calculated for each cluster? Or, is the parameter p used for estimating ambient rna the same across all clusters?

I have run the souporcell pipeline on my data and have found that the ambient_rna.txt varies according to the number of predefined clusters (k). Specifically, ambient rna decreases from 10% for 2 clusters down to 6.5% for 8 or more clusters. I am trying to justify this finding and could use your advice from designing the algorithm. Since ambient rna is measured in souporcell at the level of cell clusters, I suppose that with more predefined clusters, transcripts that seem to be representative of ambient rna in a cell cluster can in fact be divided into another cell cluster. Is this correct? Then, the %ambient RNA in 8 clusters or more represents the 'ground truth' ambient rna in the dataset, whereas the ambient rna in less than 8 clusters is inflated due to suboptimal predefined clusters (k) in the pipeline execution. Is this also correct?

Thanks,

wheaton5 commented 2 years ago

Ambient RNA should be the same for all clusters. It is an experiment-wide effect.

The measurement of ambient RNA depends on the number of clusters because it will be over-estimated if there are more individuals than clusters as the allele fractions wont be 0,0.5,or 1 if a cluster actually contains cells from multiple individuals. You should ideally choose k according to an elbow plot or silhouette score or some other metric and only for the "correct" k should you care about the ambient RNA estimation. But also, I will say that in general ambient RNA is a bit overestimated in souporcell and I'm not sure about its overall accuracy. It is also greatly affected by false positive variants, of which there are many. The statistical model does try to account for this, but cannot do so fully.

I hope this helps.

wheaton5 commented 2 years ago

I think it is probably the best to just assume you have 3% ambient RNA for experiments on liquid source cells (blood) and 8ish % for solid tissue source cells (due to the stress cells go through in the dissociation process).

bpyenson commented 2 years ago

Hi Dr. Heaton,

I very much appreciate your quick and thorough responses. The software you designed is very useful!

I understand from the output that a certain proportion of cell barcodes are classified as doublets vs. singlets vs. others. In my case (from clusters.tsv output), # doublet barcodes was 412, # singlet barcodes was 4259, and # unassigned barcodes was 10. I do not think the ambient RNA was calculated as 10/4681 (0.002%), since the ambient rna% (from ambient_rna.txt) for this analysis experiment was 6.55%.

So, it does not seem that the clusters.tsv output of singlet barcodes output from souporcell are filtered of ambientRNA.

I understand from your paper that the ambientRNA is filtered in the clusters_genotypes.vcf. Or, am I wrong? Regardless, I am having a difficult time processing the vcf at all, or in any downstream analyses like Seurat. Do you have any advice on how to usefully access the output data without ambient RNA (presumably clusters_genotypes.vcf)? Thanks,

mattbcvs commented 2 years ago

I'm also curious on whether/how we obtain an ambient RNA-corrected output

Thanks for the resource!

wheaton5 / souporcell

ambient rna.txt variability #150