Quality/Cell Filtering Before Running Doublet Programs

Hi,

I think this is more of a "discussion" than an "issue," but I have a habit of posting under "issues" and I don't think "discussions" are enabled for this repository.

Essentially, in the discussion for Xi and Li 2021 it is mentioned that is mentioned as open-question 3: "How to distinguish doublets from droplets contaminated by ambient mRNA?" I think that is also consistent with the DoubletFinder feedback for this issue.

I think I have some evidence from a few datasets below that support this idea. I think those scRNA-Seq examples also could consistent with there being difficulties in automating a benchmark, which is consistent with my belief that there should be at least some methods testing for every project (from a bulk RNA-Seq perspective, I have some additional details from more samples here).

So, I think these are my main questions:

1) Am I correct that you did not run any cell / quality filtering before testing the doublet methods?

2) If somebody is trying to figure out what is best to try out before running a doublet detection method, do you have any general recommendations?

For example, do you have any feedback that might be related to the examples that I provide below?

Quality / Cell Filtering for 3 Samples

Again, I think trying to lock down 1 strategy for all samples may be problematic.

However, if I try a couple strategies, I think I can generate a qualitative impression of what you might call "quality" (or at least some sense of effort needed for troubleshooting):

Sample	Cell Ranger	EmptyDrops (DropletUtils Implementation, Cell Ranger UMI)	DropletQC (Cell Ranger UMI)
PBMC-5k (10x Genomics)	4,956 cells	4,866 cells	4,889 viable cells (67 empty droplets)
pdx-MULTI (SRP192397) (used in this study)	9,146 cells	7,514 cells (~20% less than Cell Ranger)	8,880 viable cells (266 empty droplets)
COVID19-lung (PRJNA688216)	9,186 cells	12,408 cells (more than Cell Ranger implementation)	3,554 viable cells (5,632 empty droplets) (skipped analysis for clusters, used for estimating damaged cells)

In other words, I might have a ranking as follows:

a) PBMC-5k (least complications for troubleshooting) b) pdx-MULTI c) COVID19-lung

Effect on Doublet Detection

For scDblFinder, you can see a table with the resulting estimated doublet fractions. Maybe this is most viable for methods that use gradient boosted trees (bcds and scDblFinder, if I understand correctly).

Sample	Cell Ranger	EmptyDrops (DropletUtils Implementation, Cell Ranger UMI)	DropletQC (Cell Ranger UMI)
PBMC-5k (10x Genomics)	6% doublets	[not tested]	[not tested]
pdx-MULTI (SRP192397) (used in this study)	12.4% doublets	9.6% doublets	[not tested]
COVID19-lung (PRJNA688216)	13.2% doublets	[not tested]	6.5% doublets

I also have plots for all of the original Cell Ranger cells, but I will just show selected examples where I have tested running some quality filtering (with a method that is different for pdx-MULTI and the COVID19-lung sample):

pdx-MULTI (additional EmptyDrops filtering):

DoubletFinder (Original Cell Ranger): FullReads-DoubletFinder-plot2

DoubletFinder (DropletUtils-EmptyDrops Filtered): FullReads-DoubletFinder-EmptyDrops-plot2

scds (Original Cell Ranger): FullReads-scds-DoubletScores

scds (DropletUtils-EmptyDrops Filtered): FullReads-scds-DoubletScores-EmptyDrops

COVID19-lung (additional DropletQC filtering):

DoubletFinder (Original Cell Ranger): FullReads-DoubletFinder-plot2

DoubletFinder (DropletQC Filtered): FullReads-DoubletFinder-DropletQC-plot2

scds (Original Cell Ranger): FullReads-scds-DoubletScores

scds (DropletQC Filtered): FullReads-scds-DoubletScores-DropletQC

Correlation with Percent Human Reads (for pdx-MULTI)

Qualitatively, maybe EmptyDrops filtering removes some cells that are ~50% human?

Red points correspond to those that would be removed by the DropletUtils implementation of EmptyDrops in the original Cell Ranger set of cells. There is no coloring among the smaller set of cells that are provided to the doublet program after additional filtering.

pdx-MULTI (additional EmptyDrops filtering):

DoubletFinder (Original Cell Ranger): Full-CR-percent_human_vs_DoubletFinder

DoubletFinder (DropletUtils-EmptyDrops Filtered): Full-EmptyDrops-percent_human_vs_DoubletFinder

scds (Original Cell Ranger): Full-CR-percent_human_vs_scds

scds (DropletUtils-EmptyDrops Filtered): Full-EmptyDrops-percent_human_vs_scds

scDblFinder (Original Cell Ranger): Full-CR-percent_human_vs_scDblFinder

scDblFinder (DropletUtils-EmptyDrops Filtered): Full-EmptyDrops-percent_human_vs_scDblFinder

Correlation with UMI Counts (for pdx-MULTI)

Most directly, I think samples with lower total UMI tend to be filtered with the smaller set of EmptyDrops cells.

pdx-MULTI (additional EmptyDrops filtering):

DoubletFinder (Original Cell Ranger):

Full-CR-nCount_RNA_vs_DoubletFinder

DoubletFinder (DropletUtils-EmptyDrops Filtered): Full-EmptyDrops-nCount_RNA_vs_DoubletFinder

scds (Original Cell Ranger): Full-CR-nCount_RNA_vs_scds

scds (DropletUtils-EmptyDrops Filtered): Full-EmptyDrops-nCount_RNA_vs_scds

scDblFinder (Original Cell Ranger): Full-CR-nCount_RNA_vs_scDblFinder

scDblFinder (DropletUtils-EmptyDrops Filtered): Full-EmptyDrops-nCount_RNA_vs_scDblFinder

If it helps, here are some templates for the analysis that I ran:

EmptyDrops (DropletUtils implementation):

library(DropletUtils)

sampleID = "[provide label]"
count_dir_10x = "/path/to/raw_feature_bc_matrix"

sce = read10xCounts(count_dir_10x)
sce_counts = counts(sce)

set.seed(0)
br.out = barcodeRanks(sce_counts)

# follows code from https://bioconductor.org/packages/release/bioc/vignettes/DropletUtils/inst/doc/DropletUtils.html
png(paste(sampleID,"-BarcodeRanks.png",sep=""))
plot(br.out$rank, br.out$total, log="xy", xlab="Rank", ylab="Total")
o <- order(br.out$rank)
lines(br.out$rank[o], br.out$fitted[o], col="red")

abline(h=metadata(br.out)$knee, col="dodgerblue", lty=2)
abline(h=metadata(br.out)$inflection, col="forestgreen", lty=2)
legend("bottomleft", lty=2, col=c("dodgerblue", "forestgreen"), 
    legend=c("knee", "inflection"))
dev.off()

knee_threshold = metadata(br.out)$knee

e.out = emptyDrops(sce_counts)

is.cell <- e.out$FDR <= 0.01
print(sum(is.cell, na.rm=TRUE))

e.out$FDR[is.na(e.out$FDR)]=1
rownames(e.out) = 1:nrow(e.out)
kept_cell_index = rownames(e.out)[e.out$FDR < 0.01]
kept_cells = colData(sce)$Barcode[as.numeric(kept_cell_index)]
write.table(data.frame(Barcode = kept_cells), paste(sampleID,"-EmptyDrops.txt",sep=""), quote=F, sep="\t", row.names=F)

print(table(Limited=e.out$Limited, Significant=is.cell))

png(paste(sampleID,"-EmptyDrops.png",sep=""))
plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
    xlab="Total UMI count", ylab="-Log Probability")
dev.off()

Droplet QC:

CR_dir = "/path/to/CellRanger/SampleID/outs"
ed_plot = "[sampleID]-DropletQC-empty_drops_estimation.png"
output_table = "[sampleID]-nuclear_fraction.txt"

library(DropletQC)
library(DropletUtils)

filtered_dir = paste(CR_dir,"/filtered_feature_bc_matrix",sep="")
clusters.file = paste(CR_dir,"/analysis/clustering/graphclust/clusters.csv",sep="")

clusters = read.csv(clusters.file)

sce = read10xCounts(filtered_dir)
sce_counts = counts(sce)
set.seed(0)
Barcodes = colData(sce)$Barcode
br.out = barcodeRanks(sce_counts)

nf1 = nuclear_fraction_tags(outs = CR_dir, tiles = 1, cores = 1, verbose = FALSE)

print(table(Barcodes == rownames(nf1)))
print(table(Barcodes == clusters$Barcode))
nf_umi = data.frame(nuclear_fraction_droplet_qc=nf1, umi_count=br.out$total)

nf1.ep = identify_empty_drops(nf_umi = nf_umi, include_plot = TRUE, plot_name=ed_plot, plot_path=".")
print(table(nf1.ep$cell_status))
write.table(data.frame(CB=rownames(nf1.ep),nf1.ep), output_table, quote=F, sep="\t")

nf1.dc = identify_damaged_cells(nf_umi_ed_ct=data.frame(nf1.ep, cell_type=clusters$Cluster), verbose=FALSE, output_plots = FALSE)
print(table(nf1.dc$df$cell_status))
write.table(data.frame(CB=rownames(nf1.dc$df),nf1.dc$df), output_table, quote=F, sep="\t")

DoubletFinder:

library(DropletUtils)

sampleID = "[provide label]"
count_dir_10x = "/path/to/filtered_feature_bc_matrix"
max.pc = 20
prop.doublets = 0.1 #I prefer to export the scores and assess this afterwards

sce = read10xCounts(count_dir_10x)
sce_counts = counts(sce)
sce_counts = as.matrix(sce_counts)

library(Seurat)
colnames(sce_counts) = colData(sce)$Barcode
meta.table = data.frame(Sample=colData(sce)$Barcode)
rownames(meta.table) = colData(sce)$Barcode
sce2 = CreateSeuratObject(sce_counts, project="DoubletFinder",min.cells = 3,
                    min.genes = 2000, is.expr=1, meta.data = meta.table)
print(dim(sce2))

sce2 = NormalizeData(sce2)
sce2 = FindVariableFeatures(sce2)
sce2 = ScaleData(sce2)
sce2 = RunPCA(sce2)

sce2=JackStraw(sce2, prop.freq=0.1)
sce2=ScoreJackStraw(sce2, dims = 1:20)
JackStrawPlot(sce2, dims = 1:20)
library(ggplot2)
ggsave(paste(sampleID,"-Seurat_sig_PCA_plot.png",sep=""))

print(dim(sce2))

#stop()

library(DoubletFinder)

#based upon https://github.com/xnnba1984/DoubletCollection/blob/master/R/FindDoublets.R
sweep.vector = paramSweep_v3(sce2, PCs = 1:max.pc, sct = FALSE)
sweep.table = summarizeSweep(sweep.vector)
png(paste(sampleID,"-DoubletFinder-find.pK-BCmvn.png",sep=""))
bcmvn = find.pK(sweep.table)
dev.off()

pK = bcmvn$pK[which.max(bcmvn$BCmetric)]
pK = as.numeric(levels(pK))[pK]
source("doubletFinder_v3.R")
sce2 = doubletFinder_v3(sce2,PCs = 1:max.pc, pK = pK, nExp = round(ncol(sce2) * prop.doublets))
df_result = sce2@meta.data
write.table(data.frame(Barcode=rownames(df_result), df_result), paste(sampleID,"-DoubletFinder.txt",sep=""), quote=F, row.names=F, sep="\t")
print(table(df_result[,6]))
print(table(df_result[,6])/nrow(df_result))

png(paste(sampleID,"-DoubletFinder.png",sep=""))
den = density(df_result[,5])
plot(den$x, den$y, type="l", xlab = "DoubletFinder score", ylab = "Density",
            col="blue", main = "DoubletFinder")
dev.off()

png(paste(sampleID,"-DoubletFinder-plot2.png",sep=""))
den = density(df_result[,5], from=0, to=1)
plot(den$x, den$y, type="l", xlab = "DoubletFinder score", ylab = "Density",
            col="blue", main = "DoubletFinder")
dev.off()

scds:

library(DropletUtils)
library(scds)

sampleID = "[provide label]"
count_dir_10x = "/path/to/filtered_feature_bc_matrix"

sce = read10xCounts(count_dir_10x)
sce_counts = counts(sce)
sce2 = SingleCellExperiment::SingleCellExperiment(assays = list(counts = sce_counts))

sce2 = cxds(sce2)
sce2 = bcds(sce2)
sce2 = cxds_bcds_hybrid(sce2)

png(paste(sampleID,"-scds-DoubletScores.png",sep=""), width=600, height=200)
par(mfcol=c(1,3))
#cxds
den = density(sce2$cxds_score)
plot(den$x, den$y, type="l", xlab = "cxds score", ylab = "Density",
            col="blue", main = "cxds")
#bcds
den = density(sce2$bcds_score)
plot(den$x, den$y, type="l", xlab = "bcds score", ylab = "Density",
            col="blue", main = "bcds")
#hybrid
den = density(sce2$hybrid_score)
plot(den$x, den$y, type="l", xlab = "hybrid score", ylab = "Density",
            col="blue", main = "hybrid")
dev.off()

output.table = data.frame(Barcode = colData(sce)$Barcode, colData(sce2))
write.table(output.table, paste(sampleID,"-scds-DoubletScores.txt",sep=""), quote=F, sep="\t", row.names=F)

scDblFinder:

library(DropletUtils)
library(scDblFinder)

sampleID = "[provide label]"
count_dir_10x = "/path/to/filtered_feature_bc_matrix"

sce = read10xCounts(count_dir_10x)
sce_counts = counts(sce)
sce2 = SingleCellExperiment::SingleCellExperiment(assays = list(counts = sce_counts))

result.scDblFinder_random = scDblFinder(sce2, clusters=FALSE)
print(table(colData(result.scDblFinder_random)$scDblFinder.class))

result.computeDoubletDensity = computeDoubletDensity(sce2)

png(paste(sampleID,"-scDblFinder-DoubletScores.png",sep=""), width=600, height=200)
par(mfcol=c(1,3))

den = density(colData(result.scDblFinder_random)$scDblFinder.score)
plot(den$x, den$y, type="l", xlab = "Doublet Score", ylab = "Density",
            col="blue", main = "scDblFinder_random")

den = density(colData(result.scDblFinder_random)$scDblFinder.weighted)
plot(den$x, den$y, type="l", xlab = "Doublet Weighted Score", ylab = "Density",
            col="blue", main = "scDblFinder_random")

den = density(result.computeDoubletDensity)
plot(den$x, den$y, type="l", xlab = "Doublet Score", ylab = "Density",
            col="blue", main = "computeDoubletDensity")
dev.off()

output.table = data.frame(Barcode = colData(sce)$Barcode, colData(result.scDblFinder_random))
write.table(output.table, paste(sampleID,"-scDblFinder_random-Doublet_Stats.txt",sep=""), quote=F, sep="\t", row.names=F)

output.table = data.frame(Barcode = colData(sce)$Barcode, result.computeDoubletDensity)
write.table(output.table, paste(sampleID,"-computeDoubletDensity-Doublet_Score.txt",sep=""), quote=F, sep="\t", row.names=F)

Please let me know if you have any questions and/or if you need any additional information.

Hi,

I think this is more of a "discussion" than an "issue," but I have a habit of posting under "issues" and I don't think "discussions" are enabled for this repository.

Essentially, in the discussion for Xi and Li 2021 it is mentioned that is mentioned as open-question 3: "How to distinguish doublets from droplets contaminated by ambient mRNA?" I think that is also consistent with the DoubletFinder feedback for this issue.

I think I have some evidence from a few datasets below that support this idea. I think those scRNA-Seq examples also could consistent with there being difficulties in automating a benchmark, which is consistent with my belief that there should be at least some methods testing for every project (from a bulk RNA-Seq perspective, I have some additional details from more samples here).

So, I think these are my main questions:

1) Am I correct that you did not run any cell / quality filtering before testing the doublet methods?

2) If somebody is trying to figure out what is best to try out before running a doublet detection method, do you have any general recommendations?

For example, do you have any feedback that might be related to the examples that I provide below?

Quality / Cell Filtering for 3 Samples

Again, I think trying to lock down 1 strategy for all samples may be problematic.

However, if I try a couple strategies, I think I can generate a qualitative impression of what you might call "quality" (or at least some sense of effort needed for troubleshooting):

Sample Cell Ranger EmptyDrops (DropletUtils Implementation, Cell Ranger UMI) DropletQC (Cell Ranger UMI) PBMC-5k (10x Genomics) 4,956 cells 4,866 cells 4,889 viable cells (67 empty droplets) pdx-MULTI (SRP192397) (used in this study) 9,146 cells 7,514 cells (~20% less than Cell Ranger) 8,880 viable cells (266 empty droplets) COVID19-lung (PRJNA688216) 9,186 cells 12,408 cells (more than Cell Ranger implementation) 3,554 viable cells (5,632 empty droplets)

(skipped analysis for clusters, used for estimating damaged cells) In other words, I might have a ranking as follows:

a) PBMC-5k (least complications for troubleshooting) b) pdx-MULTI c) COVID19-lung

Effect on Doublet Detection

For scDblFinder, you can see a table with the resulting estimated doublet fractions. Maybe this is most viable for methods that use gradient boosted trees (bcds and scDblFinder, if I understand correctly).

Sample Cell Ranger EmptyDrops (DropletUtils Implementation, Cell Ranger UMI) DropletQC (Cell Ranger UMI) PBMC-5k (10x Genomics) 6% doublets [not tested] [not tested] pdx-MULTI (SRP192397) (used in this study) 12.4% doublets 9.6% doublets [not tested] COVID19-lung (PRJNA688216) 13.2% doublets [not tested] 6.5% doublets I also have plots for all of the original Cell Ranger cells, but I will just show selected examples where I have tested running some quality filtering (with a method that is different for pdx-MULTI and the COVID19-lung sample):

pdx-MULTI (additional EmptyDrops filtering):

DoubletFinder (Original Cell Ranger):

DoubletFinder (DropletUtils-EmptyDrops Filtered):

scds (Original Cell Ranger):

scds (DropletUtils-EmptyDrops Filtered):

COVID19-lung (additional DropletQC filtering):

DoubletFinder (Original Cell Ranger):

DoubletFinder (DropletQC Filtered):

scds (Original Cell Ranger):

scds (DropletQC Filtered):

Correlation with Percent Human Reads (for pdx-MULTI)

Qualitatively, maybe EmptyDrops filtering removes some cells that are ~50% human?

Red points correspond to those that would be removed by the DropletUtils implementation of EmptyDrops in the original Cell Ranger set of cells. There is no coloring among the smaller set of cells that are provided to the doublet program after additional filtering.

pdx-MULTI (additional EmptyDrops filtering):

DoubletFinder (Original Cell Ranger):

DoubletFinder (DropletUtils-EmptyDrops Filtered):

scds (Original Cell Ranger):

scds (DropletUtils-EmptyDrops Filtered):

scDblFinder (Original Cell Ranger):

scDblFinder (DropletUtils-EmptyDrops Filtered):

Correlation with UMI Counts (for pdx-MULTI)

Most directly, I think samples with lower total UMI tend to be filtered with the smaller set of EmptyDrops cells.

pdx-MULTI (additional EmptyDrops filtering):

DoubletFinder (Original Cell Ranger):

DoubletFinder (DropletUtils-EmptyDrops Filtered):

scds (Original Cell Ranger):

scds (DropletUtils-EmptyDrops Filtered):

scDblFinder (Original Cell Ranger):

scDblFinder (DropletUtils-EmptyDrops Filtered):

Thank you for your insightful comments. In terms of your first question, the answer is no. There are two considerations. First, different datasets may have different quality control processes before publishing. Also, we are not pretending we know the "best" quality control for each one. So we decide to trust the original authors of those datasets and their way to process the data. Second, adding quality control first will exponentially increase the complexity of the benchmark, because different choices in those steps may significantly change the doublet detection result. Since we are the first to systematically benchmark those methods, only focusing on doublet detection and isolating other pre-steps may be a good move in practice. It can help quickly rank the methods. But as you mentioned later, the interaction between doublet detection and quality control matters, which I agree should be explored in future studies.

For the second question, I don't think I have a good answer. There are so many preprocessing that may affect the doublet detection and I didn't see any systematic discussion on their exact effect and sequence. I even saw people ask if they should do imputation before doublet detection. With this benchmark, at least I think we can get a sense of which methods are the top ones. Then if there are also well-recognized methods in quality control, then those methods can be used to "benchmark" if and how we use quality control before doublet detection. This may be a good topic to do some interesting research on.

I think your analysis seems interesting. I am not clear what kind of quality control those three datasets used in their original studies. Did they already check the empty droplet? I remember that should be a basic quality control process and should be done routinely? It seems that removing empty droplets has limited impact on scds but more on the other two methods. Did you check the quantitive measurement for doublet identification, like auprc? Is there huge difference before and after removing empty droplets on those measurements? In the Correlation with Percent Human Reads figures, a lot of droplets with high doublet scores are predicted as empty. But high doublet score usually comes with high UMI, so predicting high UMI as empty seems a little counter-intuitive.

Thank you very much for your responses!

I started from the .fastq.gz files, so that could be different than if you started from the count tables provided in GEO (especially if the lab has already performed noticeable filtering of cells, and uploaded something other than what is provided by Cell Ranger). For simplicity, I only used Cell Ranger when re-processing pdx-MULTI, but you generally can get some noticeable differences with some datasets if you use STARsolo and/or UMItools (with the recommended alignment using regular STAR).

I started to look at the pdx-MULTI study because of this paper, and I used a mm10+hg38 reference from 10x. I had already downloaded raw data and started to test the other 2 datasets for some earlier testing (using a custom human + SARS-COV-2 reference). The other 2 datasets do not have paired cell hashing data (or other metric to define doublets).

I created plots to always to 0 to 1 for DoubletFinder, but I did not predefine the x-axis for sdcs. Some differences for DoubletFinder are arguably more clear if the scale fits the values that exist. However, that makes the plot more comparable between conditions/samples.

For pdx-MULTI, I believe that there is a noticeable change in the shape of the cxds plot. For the COVID19-lung sample, the shape of cxds is arguably similar, but I the x-axis changes (within the labels from R, the largest tick mark plotted is 200,000 before DropletQC filtering and 80,000 after DropletQC filtering). I am not sure if that is a possible disadvantage to scaling the cxds scores to be between 0 and 1 when calculating the sum for the "hybrid" scores.

Within scds, I think that I can see how the shape might look similar for bcds, which I think might be intentional if the method helps define bimodal peaks near 0 and 1? However, if you were to define doublets as cells with close to 50% human reads, then I think there is some rough depletion. I think you used the MULTI-Seq cell hashing assignments instead of the species assignment for pdx-MULTI. If that library has 9 samples, then the true doublet rate should be much higher than that estimated using the species genome alignment rates. However, even if the fraction of the cells is small, I could create a table of cells with percent human estimates between 40% and 60% and I could see how many are filtered by EmptyDrops (versus the fraction filtered near the highest or lowest 20%, where the total counts are substantially higher). If I do this, then I will add a comment. I think that could be a useful measure, particularly for bcds and scDblFinder. However, even if there is some extra depletion, I don't think the doublet scores were especially high for all such cells defined based upon the human+mouse alignment rate to the human chromosomes (especially if you then look at how the filtered cells fall when doublet scores are plotted against the total UMI).

At least for these samples, I don't think any empty cells had high UMI values. In theory, maybe there could be damaged cells with high UMI counts, but DropletQC requires cell type assignments for that part (and the ~50% decrease in cells for the COVID19-lung sample comes from DropletQC's estimates of empty cells alone).

On the flip side, you would be correct: the additional empty cells predicted by open-source 3rd party programs tend to have low total UMI counts. If the doublets usually had high total UMI counts, then there shouldn't be too much overlap. However, you can see a number of cells with low UMI counts that were predicted to be doublets.

For DoubletFinder, the cells with the larger total UMI tended to have pANN > 0.25 (roughly halfway up the y-axis when plotting against either the percent human pdx-MULTI reads or the total UMI per cell for the pdx-MULTI sample). However, the extra positive correlation after quality filtering was largely within the main cluster of cells. So, at least for that sample, it seems like you were identifying a very large number of cells with "normal" total UMI counts if you wanted to capture those outliers with high total UMI counts. Also, for scds and scDblFinder, there was not even that much of a difference for the outlier cells in terms of the high total UMI counts. I think that is consistent with the predictive power being low for doublet finding methods in most samples, which you make very clear in your paper.

I think you have 2 samples with species mixes with extra complications in your paper (one of them being pdx-MULTI, which is the reason why I wanted to look more at that sample). I think you listed them in different categories (other than "species mix") because they had an experimental way to define doublets. However, if I consider hm-6k and hm-12k, then I think the AUROC might look kind of OK in Figure 1B but the AUPRC looked lower for lsize. I thought that lsize should have been the total UMI, but I am not sure if I misunderstood anything. Or, I also made a certain assumption that the "quality" metrics for hm-6k and hm-12k should be somewhat close to PBMC-5k, and I am not sure if some other assumption that I made might not be correct. Due to a variety of reasons, I probably won't be looking into this myself for hm-6k and hm-12k.

Also, in general, I am taking extra time to look into this paper for a Journal Club Presentation.

Would you be OK if I Tweeted about this discussion? I don't know how many people will click the link and then respond. However, I would like to do as much as I can to prepare before that presentation.

Finally, please let me know if you still have any questions and/or if you would like for me to provide additional information (if possible). Unless you would prefer to have additional discussion and/or leave the ticket open for others visitors to your GitHub page to see and respond to, I will close this "issue". While I hope there is additional feedback, I don't want to give visitors the impression that the issue was a specific problem that was not resolved (and that was not my expectation to begin with).

Thank you for your comments!

No problem and please go ahead to share the discussion. I think you raised a lot of important questions, and I don't have perfect answers to them. I think only the total UMI (lsize) or the number of expressed genes (ngene) cannot represent doublets. The mainstream methods try to merge two cells to mimic doublets and predict. But based on the benchmark result, it seems they also cannot perfectly represent the true doublets. I feel like there may be a more complicated mechanism in the formation of doublets, which is different from simply merging two random cells and adding or averaging their UMI. I think scDblFinder did a good try to introduce a more complex mechanism, and they do exhibit an advantage over other older methods. I am not sure if they got the idea from scratch or were motivated by experimentalists. But I believe the closer the artificial doublet to the real one, the better the method should be. And the key is to find the doublet formation mechanism, which is difficult to solve only by computation people, but the experimentalists. Also, the doublets label in the benchmark data set may not reflect the formation in the real world, because they are limited to certain techniques and species. So any result would be an approximation to an arbitrary real dataset. But at this stage, we have to treat them as the golden standard.

Thank you very much for your additional feedback!

I guess I will close the "issue," so that it is clear that you provided prompt assistance. I will share for additional potential feedback, but I will wait a little. So, if you think of anything else, then you still have a chance to see the content before others.

Also, I was not sure if you wanted to have the full content from my 1st post copied in your earlier response. If not, you still have a chance to edit that.

In the meantime, I ran the analysis that I think you were asking about (but I used a different measure than AUPRC, given the small number of relevant cells):

bcds (for pdx-MULTI)

High Doublet Probability (Doublet Probability > 0.8)

	Less than 20% Human (Expect Mouse)	Mixed Species	Greater than 80% Human (Expect Human)
Kept Cells	567	6	4
EmptyDrops Filtered Cells	5	1	0

Low Doublet Probability (Doublet Probability < 0.2)

	Less than 20% Human (Expect Mouse)	Mixed Species	Greater than 80% Human (Expect Human)
Kept Cells	4,665	1	395
EmptyDrops Filtered Cells	2,118	7	58

Statistical Test for EmptyDrops Filtering (Human Reads between 40% and 60%)

	Doublet Score > 0.8 (Likely Doublet)	Doublet Score < 0.2 (Likely Singlet)
Kept Cells	6	1
EmptyDrops Filtered Cells	1	7

The Fisher's Exact Test p-value is 0.010.

scDblFinder (for pdx-MULTI)

High Doublet Probability (Doublet Probability > 0.8)

	Less than 20% Human (Expect Mouse)	Mixed Species	Greater than 80% Human (Expect Human)
Kept Cells	823	7	3
EmptyDrops Filtered Cells	24	1	0

Low Doublet Probability (Doublet Probability < 0.2)

	Less than 20% Human (Expect Mouse)	Mixed Species	Greater than 80% Human (Expect Human)
Kept Cells	5,017	2	434
EmptyDrops Filtered Cells	2,082	7	58

Statistical Test for EmptyDrops Filtering (Human Reads between 40% and 60%)

	Doublet Score > 0.8 (Likely Doublet)	Doublet Score < 0.2 (Likely Singlet)
Kept Cells	7	2
EmptyDrops Filtered Cells	1	7

The Fisher's Exact Test p-value is 0.015.

Summary

"Mixed Species" cells were those with between 40% and 60% human reads.

There must be additional doublets. However, if the percent human reads is close to 50%, then I expect the doublet probability should be high.

So, it seems to me like quality filtering helps increase the fraction of cells with ~50% human reads that also have high doublet probabilities. However, the number of such cells to consider for the pdx-MULTI sample is small.

I am not sure how much it will help, but I have posted a link on Twitter:

https://twitter.com/cwarden45/status/1512484515708477444

Also, I have this earlier Biostars post, related to the expected doublet/multiplet rate as a function of total cells:

https://www.biostars.org/p/9467937/

I am not sure how much the true rate can vary from that expectation, but I thought it was potentially relevant for the general discussion.

Thank you again!

Thank you for sharing the information!

During a discussion with 10x, I provided a link to this discussion, but I think there was a "roundedness" to a plot that I didn't have images prepared and might not have explained as well.

So, to follow-up with 10x, here are the Cell Ranger plots for those 3 samples:

PBMC-5k:

CellRanger-PBMC_plot

pdx-MULTI:

CellRanger-pdxMULTI_plot

COVID19-lung:

CellRanger-COVID19_lung_plot

I am trying to say there appeared to be the least difficulty estimating cell counts for the PBMC sample and I think it looks like the drop is relatively sharp for that sample. In contrast, I think the most rounded / gradual decrease is for the COVID19 lung sample (which showed the greatest variation in possible cell counts).

I believe this also matches my expectation that the median UMI / cell can serve as a QC measure (if it is low), even if that also varies as a function of total reads. However, I think that is mostly a separate point.

xnnba1984 / Doublet-Detection-Benchmark