wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
168 stars 46 forks source link

Estimated doublet count/percent input for souporcell #65

Closed mesnger closed 4 years ago

mesnger commented 4 years ago

Hello. wheaton. Thanks in advance for the great resource you provide.

I made a single cell library with 10x Chromium by overloading cell with concentration of 57000 and obtained roughly 30000 cells. 16 samples were pooled (not quite evenly, approximately 8 samples evenly pooled to 35000 and other 8 samples evenly pooled to 22000) in the library. I estimated the doublet rate to be around 24~26% (calculation from satijalab costpercell) but souporcell gave me about 17.9% doublet and unassigned combined, with ambient RNA content of 18.6%. Although overloading of cell would contriubute to higher ambient RNA concentration, as I used cultured cells for 10x library preparation with minimal downtime, I do not think the ambient RNA percentage it overly high.

I am afraid of the possibility that souporcell called some doublets as singlets with high ambient RNA content, and made a call of lower doublet rate and higher ambient rna fraction, Which you briefly mentioned from #14 and #30 .

Is there a way to tell souporcell the loading cell concentration/estimated doublet count/fraction? I assume that the estimates should not vary much from the real counts, and especially not lower doublet rates in real world situations. I thought of simply hard filtering cells to estimated doublet rate via log_probability of singlet and doublet but the ambient RNA part would not be correct.

wheaton5 commented 4 years ago

Hello mesnger,

Well, first off I can tell you have done your homework. That loading is roughly the loading you would use to maximize total singlet cell barcodes. I disagree with satijalab and have better estimates using poisson modeling instead of linear modeling, but in this regime it is basically good enough. Just not sure why, given that lab's statistical modeling expertise, they choose linear which is clearly not how droplet loading works. Though assuming a linear model, this isn't optimal. So clearly they have done the correct modeling to find the optimum and then backed it up with a linear model on their website? Anyway, souporcell has not been tested in this type of setup. Currently there is no way to tell souporcell an estimate of doublet rate or ambient RNA fraction. In my defense, demuxlet has the problem that as ambient RNA goes up, doublet calling drastically goes up to the point at which with 10% ambient RNA nearly all cells are called doublets. The more multiplexed samples there are, the harder doublet detection becomes. How many median UMI/cell do you have?

It is true that a higher than expected ambient RNA detection would be expected when the % doublet + unassigned is higher than detected.

I am happy to work with you on this issue, but there is no simple solution. You are working on the edge of several limiting factors for 10x and souporcell which are poorly tested at this time.

Best, Haynes

wheaton5 commented 4 years ago

If there is enough signal to adjust the % doublets and unassigned called by souporcell, you may be able to fix ambient RNA detection by doing that separately with a modified input file. In the time since I have completed souporcell, I have come to the conclusion that % ambient RNA is quite difficult to measure in this way given the biases from multiplets and empty barcodes. With cell culture I would assume roughly 1-2% ambient RNA. With tissue I would generally assume 5-8% ambient RNA. For the purposes of cleaning up the expression profile I would use these numbers by default. For necrotic tissue it's anyone's guess.

mesnger commented 4 years ago

Thanks Haynes.

My data has 11,000 UMI/cell when containing singlets and doublets, and 8,000 UMIs, when counting only singlets and after removing 18.6% ambient RNA with SoupX. I expect true singlets would have slightly more UMI as 18% removal is too strict.

Also, can I get a clarification?

If there is enough signal to adjust the % doublets and unassigned called by souporcell, you may be able to fix ambient RNA detection by doing that separately with a modified input file. Do I need to remove all the doublet called cells from the matrix and run souporcell again? Or does "enough signal" literally mean more sequencing depth?

cheers, mesnger

wheaton5 commented 4 years ago

The clustering file output from troublet and input to consensus.py has things labeled as singlet, double, or unassigned. If you wanted to alter that file to label more cells as doublets, you would then rerun consensus.py for a more accurate ambient RNA estimation.

I agree that ambient RNA % is too high and that the doublet detection is too low. Not sure what to do about it though.

The UMI/cell is pretty good. Just the experimental design is quite a bit beyond what we have tried before. I could take a look at it, but I suspect it would take me a fair amount of time and I couldn't guarantee any fix. I realize this is the sort of dream of what these methods could attain, but at a certain point idk its like adding complexity to solve complexity. When there are guarantees of correctness at different levels (like in circuit design) added complexity can disambiguate things, but when there aren't guarantees, each level not only perpetuates the previous level's errors, but often amplifies them through the incorrect assumptions it makes about previous steps.

mesnger commented 4 years ago

Sorry for the late reply. I am also not perfectly sure about the ambient RNA % and doublet rate balance, as expectation and results may not always go together. I am trying more overloading with differing pooling counts and loading cell counts. I hope multiple experiments may solve some complexity. Also, I will try the consensus.py rerun, although at current status, the only way is to stringently filter doublets.