wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
153 stars 44 forks source link

Ratio of cells from different individuals #55

Open arutik opened 4 years ago

arutik commented 4 years ago

Hi,

May I ask how does souporcell handle situations with extremely uneven (smth like 95% of cells from individual 1 and 5% - from individual 2, or even more skewed) ratios of cells from different individuals? Generalising, is there a threshold of these ratios that allows to assess the confidence of deconvolution?

Thank you.

Sincerely, Anna Arutyunyan

wheaton5 commented 4 years ago

In my paper I did a sweep and I think I got down to 1-2% one genotype with no problems. Obviously there will be a limit and it will also depend on the data (are the individuals related? how many UMIs/cell etc), but I haven't had a problem with skewed samples if they have enough data per cell.

Best, Haynes

ktpolanski commented 4 years ago

Huh, weird. Having missed this insight in the paper, I ended up doing a computational experiment of my own to try to test the limits, and got less encouraging findings. I randomly sampled 100 cells from one 10X sample, extracted said cells' corresponding BAM contents and flagged the cell barcode to avoid potential overlap with the other sample (~2000-2500 cells, the exact number eludes me), stuck the two BAMs together and ran souporcell with -k 2 --skip_remap True --common_variants souporcell/filtered_2p_1kgenomes_GRCh38.vcf. The output was not indicative of success - the donor assignment was split 50/50, including in the 100 "contaminant" cells. This was repeated five times, and the closest it got to success was one of the identified genotypes showing up in ~150 cells once. It was still pretty far though, as only ~20 of those were the "contaminant".

Is this the --common_variants getting in the way of me being able to differentiate this somehow?

wheaton5 commented 4 years ago

I'm guessing there is something else going on here.

I had 3 different skewed donor experiments in the paper.

  1. Fig 2i: 5 synthetically mixed HipSci cell lines with 1000 cells per donor, sweeping a "minority" donor from 1000 cells down to 20 cells. It worked very well with the final 20/1000/1000/1000/1000 experiment identifying all 20 minority singlets in the same cluster (I think it might have also had 2 or 3 doublets assigned as singlets in that cluster).

  2. Fig 3a (top right panel) + Supp Fig 2 (a and b, second panel): Maternal/fetal tissue with one or the other being a minority due to it being placental or decidual tissue. These were of increased difficulty due to the individuals being mother/child related. Out of all 3 experiments, I think 2 minority cells were categorized as majority cluster and 1 majority cell categorized in the other direction. Of course there is no real ground truth here, but because the maternal cells and fetal cells are different cell types, the transcription profile and tsne can be used to visually examine this.

  3. Ok I misremembered this one, but still potentially relevant Supp Fig 7b downsamples total cells down to an average cell per cluster of 40. In that case, the donor with the smallest number of cells was 20 (because I sampled randomly from each donor). At 40*5 cells with the smallest having 20, the ARI was 0.975 so a few errors but still pretty high.

I would try it without common variants and without --skip_remap. In your test data, how many UMI/cell do you have? Also was there anything else weird about the samples in question? Did one have way more UMI/cell than the other for instance? Were they related individuals? Just tell me everything you know about the data and I might have some insight.

Best, Haynes

wheaton5 commented 4 years ago

I don't mean to imply you could make this mistake, but are you absolutely sure those two samples were from different individuals?

ktpolanski commented 4 years ago

Thanks for all the insight, I'll try it without the common variants remap skipping. The person who pointed me at the samples says that they're from different individuals.

ktpolanski commented 4 years ago

False alarm. The collaborator messed up the metadata and I was merging two samples for the same individual. I'm getting a lot more sensible results, even with --common_variants, once I replaced one of the samples with a different one. Sorry about this, I asked them multiple times and only on the third time did they realise that it's actually the same patient.

wheaton5 commented 4 years ago

Ah great! Best of both worlds. Not your fault. Not my fault. Cheers.

arutik commented 4 years ago

Hi,

Thanks a lot for the discussion, it's really useful.

Do you know what happens if one is unsure about the amount of individuals in a sample?

I have a sample where I expect cells mostly to be from individual A, but there is a possibility that there are some cells from individual B, so I ran souporcell trying to cluster apart 2 individuals, but got something that didn't make much sense. I guess I'm trying to ask if there is any way of being confident that the number of individuals you are forcing is (or is not) the actual number of individuals multiplexed?

Thank you.

ktpolanski commented 4 years ago

I never reported back with findings, this might be of use to you.

I took a ~3000-cell sample from one individual, and mixed in decreasing numbers of cells from a different 10X sample, repeating the downsampling five times each. Souporcell had no trouble recovering the contamination at 50 cells mixed in. At 25 contamination cells, half the time the results lined up pretty well with the contamination and the other half saw the contamination be flagged as part of a ~50-cell cluster with some false positives. 10 contamination cells had all five instances do the ~50-cell cluster with false positives, while 5 contamination cells had that happen half the time and the other half be the scenario I think you're seeing (high unassigned fraction, about half in half between two "donors"). So it seems feasible your data's quite clean.

Is there any particular reason why 50 seems to be the magic number? Something algorithm side?

wheaton5 commented 4 years ago

@arutik this is a general problem with all clustering methods. There are various methods to choose k, but they will all fall apart and not be confident at all in highly skewed clusters.

@ktpolanski this is about what I expect. My experiments had lots of umi per cell so more data makes it easier. There is no reason in the algorithm that 50 would be a magic number. It’s just a signal:noise ratio issue and also whether there are other differences (due to noise or somatic mutations or rna editing or mosaic x inactivation etc) which cause a different split in the data to affect the total log likelihood more than genotype differences.

cotedivoir commented 1 year ago

My experiments had lots of umi per cell so more data makes it easier.

could you give an estimate what sequencing depth (UMI/cell) might be needed to identify 99:1 mix of genotypes? I have 3000-4000 median UMI per cell in my current data, can't demultiplex when I go lower than 8% of minority genotype

wheaton5 commented 1 year ago

@cotedivoir can you let me know your application? I am working on a new tool that might be of interest to you depending on what you need. And if its not a perfect fit, maybe knowing the application could give me ideas on a strategy that would work.

cotedivoir commented 1 year ago

@wheaton5 Thanks for reply! It is cell transplant, two genotypes, need to identify donor cells mixed with host cells. Cannot be sure which percentage of transplant cells will survive over time, so should consider situation when percentage of transplanted cells is very low

ONeillMB1 commented 11 months ago

@wheaton5 @cotedivoir @arutik @ktpolanski I realize this is an old thread, but it is very relevant to what I am hoping to accomplish and I would love your feedback if you have any. I am interested in using genetics in scRNA-seq data to identify potential contamination. The challenge being that the number of genotypes is unknown. I'm wondering how to interpret results if you force it into thinking there are 2 or 3 or n individuals in the sample? If you have any suggestions regarding methods to determine the optimal k value please share! Thank you!

wheaton5 commented 11 months ago

I've mostly used the elbow plot method showed in the paper figure 2. But this is much less effective when there is a very skewed cluster sizes because the small clusters will effect the loss function much less than the bigger clusters. I am working on a different method that also may fit your use case as well. I'll try to remember and when I have a working version, come back here and link it.

ONeillMB1 commented 11 months ago

Thanks @wheaton5! It is not clear to me where to find the 'total log likelihood' plotted in figure 2. Would I obtain this by running souporcell with various k parameters and then sum the log likelihood of every cell found in the clusters.tsv file? Thanks again.

wheaton5 commented 11 months ago

Sorry. Its in one of the .out files (souporcell.out or clustering.out) on the final line

ONeillMB1 commented 11 months ago
image

I do not see any out files, sorry!

ONeillMB1 commented 11 months ago

I am guessing it is this: ==> clusters.err <== best total log probability = -1219551.6

ONeillMB1 commented 4 months ago

I've mostly used the elbow plot method showed in the paper figure 2. But this is much less effective when there is a very skewed cluster sizes because the small clusters will effect the loss function much less than the bigger clusters. I am working on a different method that also may fit your use case as well. I'll try to remember and when I have a working version, come back here and link it.

Hi @wheaton5, I wanted to follow-up re the method you are working on. Is that cellector? I'd love to test it out when you have a working version!

wheaton5 commented 4 months ago

yeah, cellector on my github. It is working very well for skewed datasets with 2 individuals (and should be able to figure out when there is only 1, but i haven't been testing that yet). It is still in development, but definitely ready for beta use.

wheaton5 commented 4 months ago

I'll be adding more statistical analysis of 1 vs 2 individuals and the variants that contribute the most soon.

ONeillMB1 commented 2 months ago

@wheaton5 I'd love to beta test it on some problematic data I have. Will it work if there is more than 2 individuals? I'm mostly just after cleaning a dataset with contaminating cells from an unknown number of other samples - trying to get the major individual present.

wheaton5 commented 2 months ago

The final step assumes that there are 2 individuals. But the anomaly detection part should be able to pull apart contaminating cells whether they come from multiple individuals or not. It is available in a rough form at https://github.com/wheaton5/cellector/ Currently the python version is in a fairly static state and we are working on the rust version. But right now its not quite to feature parity with the python version. To use the anomaly detection part only you will need to look at the iteration_#.tsv and the column log_likelihood_loci_normalized is the column of interest for now. Find the median and iqr and do median - iqr*5 or 6 and anything that is below that is an anomalous cell.