Open arutik opened 4 years ago
In my paper I did a sweep and I think I got down to 1-2% one genotype with no problems. Obviously there will be a limit and it will also depend on the data (are the individuals related? how many UMIs/cell etc), but I haven't had a problem with skewed samples if they have enough data per cell.
Best, Haynes
Huh, weird. Having missed this insight in the paper, I ended up doing a computational experiment of my own to try to test the limits, and got less encouraging findings. I randomly sampled 100 cells from one 10X sample, extracted said cells' corresponding BAM contents and flagged the cell barcode to avoid potential overlap with the other sample (~2000-2500 cells, the exact number eludes me), stuck the two BAMs together and ran souporcell with -k 2 --skip_remap True --common_variants souporcell/filtered_2p_1kgenomes_GRCh38.vcf
. The output was not indicative of success - the donor assignment was split 50/50, including in the 100 "contaminant" cells. This was repeated five times, and the closest it got to success was one of the identified genotypes showing up in ~150 cells once. It was still pretty far though, as only ~20 of those were the "contaminant".
Is this the --common_variants
getting in the way of me being able to differentiate this somehow?
I'm guessing there is something else going on here.
I had 3 different skewed donor experiments in the paper.
Fig 2i: 5 synthetically mixed HipSci cell lines with 1000 cells per donor, sweeping a "minority" donor from 1000 cells down to 20 cells. It worked very well with the final 20/1000/1000/1000/1000 experiment identifying all 20 minority singlets in the same cluster (I think it might have also had 2 or 3 doublets assigned as singlets in that cluster).
Fig 3a (top right panel) + Supp Fig 2 (a and b, second panel): Maternal/fetal tissue with one or the other being a minority due to it being placental or decidual tissue. These were of increased difficulty due to the individuals being mother/child related. Out of all 3 experiments, I think 2 minority cells were categorized as majority cluster and 1 majority cell categorized in the other direction. Of course there is no real ground truth here, but because the maternal cells and fetal cells are different cell types, the transcription profile and tsne can be used to visually examine this.
Ok I misremembered this one, but still potentially relevant Supp Fig 7b downsamples total cells down to an average cell per cluster of 40. In that case, the donor with the smallest number of cells was 20 (because I sampled randomly from each donor). At 40*5 cells with the smallest having 20, the ARI was 0.975 so a few errors but still pretty high.
I would try it without common variants and without --skip_remap. In your test data, how many UMI/cell do you have? Also was there anything else weird about the samples in question? Did one have way more UMI/cell than the other for instance? Were they related individuals? Just tell me everything you know about the data and I might have some insight.
Best, Haynes
I don't mean to imply you could make this mistake, but are you absolutely sure those two samples were from different individuals?
Thanks for all the insight, I'll try it without the common variants remap skipping. The person who pointed me at the samples says that they're from different individuals.
False alarm. The collaborator messed up the metadata and I was merging two samples for the same individual. I'm getting a lot more sensible results, even with --common_variants
, once I replaced one of the samples with a different one. Sorry about this, I asked them multiple times and only on the third time did they realise that it's actually the same patient.
Ah great! Best of both worlds. Not your fault. Not my fault. Cheers.
Hi,
Thanks a lot for the discussion, it's really useful.
Do you know what happens if one is unsure about the amount of individuals in a sample?
I have a sample where I expect cells mostly to be from individual A, but there is a possibility that there are some cells from individual B, so I ran souporcell trying to cluster apart 2 individuals, but got something that didn't make much sense. I guess I'm trying to ask if there is any way of being confident that the number of individuals you are forcing is (or is not) the actual number of individuals multiplexed?
Thank you.
I never reported back with findings, this might be of use to you.
I took a ~3000-cell sample from one individual, and mixed in decreasing numbers of cells from a different 10X sample, repeating the downsampling five times each. Souporcell had no trouble recovering the contamination at 50 cells mixed in. At 25 contamination cells, half the time the results lined up pretty well with the contamination and the other half saw the contamination be flagged as part of a ~50-cell cluster with some false positives. 10 contamination cells had all five instances do the ~50-cell cluster with false positives, while 5 contamination cells had that happen half the time and the other half be the scenario I think you're seeing (high unassigned fraction, about half in half between two "donors"). So it seems feasible your data's quite clean.
Is there any particular reason why 50 seems to be the magic number? Something algorithm side?
@arutik this is a general problem with all clustering methods. There are various methods to choose k, but they will all fall apart and not be confident at all in highly skewed clusters.
@ktpolanski this is about what I expect. My experiments had lots of umi per cell so more data makes it easier. There is no reason in the algorithm that 50 would be a magic number. It’s just a signal:noise ratio issue and also whether there are other differences (due to noise or somatic mutations or rna editing or mosaic x inactivation etc) which cause a different split in the data to affect the total log likelihood more than genotype differences.
My experiments had lots of umi per cell so more data makes it easier.
could you give an estimate what sequencing depth (UMI/cell) might be needed to identify 99:1 mix of genotypes? I have 3000-4000 median UMI per cell in my current data, can't demultiplex when I go lower than 8% of minority genotype
@cotedivoir can you let me know your application? I am working on a new tool that might be of interest to you depending on what you need. And if its not a perfect fit, maybe knowing the application could give me ideas on a strategy that would work.
@wheaton5 Thanks for reply! It is cell transplant, two genotypes, need to identify donor cells mixed with host cells. Cannot be sure which percentage of transplant cells will survive over time, so should consider situation when percentage of transplanted cells is very low
@wheaton5 @cotedivoir @arutik @ktpolanski I realize this is an old thread, but it is very relevant to what I am hoping to accomplish and I would love your feedback if you have any. I am interested in using genetics in scRNA-seq data to identify potential contamination. The challenge being that the number of genotypes is unknown. I'm wondering how to interpret results if you force it into thinking there are 2 or 3 or n individuals in the sample? If you have any suggestions regarding methods to determine the optimal k value please share! Thank you!
I've mostly used the elbow plot method showed in the paper figure 2. But this is much less effective when there is a very skewed cluster sizes because the small clusters will effect the loss function much less than the bigger clusters. I am working on a different method that also may fit your use case as well. I'll try to remember and when I have a working version, come back here and link it.
Thanks @wheaton5! It is not clear to me where to find the 'total log likelihood' plotted in figure 2. Would I obtain this by running souporcell with various k parameters and then sum the log likelihood of every cell found in the clusters.tsv file? Thanks again.
Sorry. Its in one of the .out files (souporcell.out or clustering.out) on the final line
I do not see any out files, sorry!
I am guessing it is this: ==> clusters.err <== best total log probability = -1219551.6
I've mostly used the elbow plot method showed in the paper figure 2. But this is much less effective when there is a very skewed cluster sizes because the small clusters will effect the loss function much less than the bigger clusters. I am working on a different method that also may fit your use case as well. I'll try to remember and when I have a working version, come back here and link it.
Hi @wheaton5, I wanted to follow-up re the method you are working on. Is that cellector? I'd love to test it out when you have a working version!
yeah, cellector on my github. It is working very well for skewed datasets with 2 individuals (and should be able to figure out when there is only 1, but i haven't been testing that yet). It is still in development, but definitely ready for beta use.
I'll be adding more statistical analysis of 1 vs 2 individuals and the variants that contribute the most soon.
@wheaton5 I'd love to beta test it on some problematic data I have. Will it work if there is more than 2 individuals? I'm mostly just after cleaning a dataset with contaminating cells from an unknown number of other samples - trying to get the major individual present.
The final step assumes that there are 2 individuals. But the anomaly detection part should be able to pull apart contaminating cells whether they come from multiple individuals or not. It is available in a rough form at https://github.com/wheaton5/cellector/ Currently the python version is in a fairly static state and we are working on the rust version. But right now its not quite to feature parity with the python version. To use the anomaly detection part only you will need to look at the iteration_#.tsv and the column log_likelihood_loci_normalized is the column of interest for now. Find the median and iqr and do median - iqr*5 or 6 and anything that is below that is an anomalous cell.
Hi,
May I ask how does souporcell handle situations with extremely uneven (smth like 95% of cells from individual 1 and 5% - from individual 2, or even more skewed) ratios of cells from different individuals? Generalising, is there a threshold of these ratios that allows to assess the confidence of deconvolution?
Thank you.
Sincerely, Anna Arutyunyan