wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
160 stars 45 forks source link

Parameters to consider #19

Closed kerimsecener closed 4 years ago

kerimsecener commented 4 years ago

Hi,

Thanks for this amazing tool. I have managed to replicate the placenta dataset from your paper without no problem ! I am also working on single-cell placenta datasets and I have tried running souporcell on them. Unfortunately, while I was expecting to see only a few maternal cells, souporcell predicted that my dataset (~3000 cells) contained half maternal, half placenta cells (which is not possible, having verified with female specific XIST expression (only a few maternal cells are present)). So, I was wondering what parameters could be involved in producing a result like this.

Placenta_Villi

Thanks,

Kerim

wheaton5 commented 4 years ago

Hi Kerim,

Sorry about this. I would try using the 1kgenomes common variants file (linked in the github readme) in the --common_variants option. This usually solves these types of bad clusterings. Usually this only happens with low UMI counts. Is that the case here? The other reason poor clustering can occur is large numbers of donors (not the case here) which --restarts helps with (and a new version I haven't released yet has algorithm changes that dramatically help this).

Best, Haynes

wheaton5 commented 4 years ago

I think we resolved this over email (feel free to reopen if that is not the case). Also v2.0 is now available.

kerimsecener commented 4 years ago

Hi Haynes,

I have tried using the 1kgenomes common variants file. The UMI count is 7,261,692 with an average of 2160 per gene, which I think is quite enough right ? The only difference is that mine is nuclei sequencing data: do you think that might be the cause for this random clustering somehow ?

Best,

Kerim

wheaton5 commented 4 years ago

Hi Kerim,

Do you know the median UMI / cell? The total UMI don't tell me that much unless I know several other statistics of the data. I'm guessing it is just above 2k assuming 3k cells and 90% of reads in cell barcodes. I have had many people run single nuclei successfully, but not necessarily on related individuals or highly skewed cells/donor. You mentioned that only a few maternal cells are present? How many would you say? souporcell has done very well in my downsampling of one cluster all the way down to 20 cells in the minority cluster, but then again that dataset had 25k UMI/cell and the individuals were not related. It may well be that a different clustering is overall a better likelihood under the probabilistic model (even though it is not capturing the differences you care about). You could also try the new version of souporcell which has improvements to overcome local optima.

So yeah, how many maternal cells? If it is under 50 with somewhat lower UMI/cell and related individuals I would not be surprised that might cause the problem. Need more signal/noise.

wheaton5 commented 4 years ago

I could imagine a system that penalizes cluster centers being near one another which might split out things better, but I'm not sure of how to do that in the E/M framework. I would need to go back to gradient descent. I'll think on this.

kerimsecener commented 4 years ago

So I'm expecting 40-50 maternal cells in total, and I have 1944 median UMI/cell, which is very less compared to your dataset. So I guess that would cause the problem. I will try the new version just in case.