wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
157 stars 45 forks source link

Merging samples error #168

Open Peter-bbh opened 1 year ago

Peter-bbh commented 1 year ago

Hi,

Thanks for a very useful program.

Unfortunately, I have an issue with incorrect assignments after merging samples:

I'd like to use souporcell to find rare donor cells in tissue from bone marrow transplanted patients. Therefore, I have set up a pilot with two tissue samples and a PBMC sample from a female patient. From the Y chromosome expression pattern it is clear that the tissue samples contains a few percent donor cells and the PBMC are almost entirely donor cells. This is a problem for souporcell when running on the individual samples. However, if they are analysed together it should be an easy task for souporcell to differentiate the two individuals.

Here comes my problem: If I concatenate the barcodes for the three samples and merge the bam files soupercell are only able to assign correctly the cells in the first sample. And even more worrying, a small subset of barcodes are by random chance identical for two samples (as expected) and these should then be identified as doublets or at least have the same assignment. But they obtain different assignments and scores?

I have tried to sort and index the merged bam file without any change.

One tested code is as follows:

samtools merge FP03all.bam \ ../CellRanger/count_cellranger_FP03Ex/outs/possorted_genome_bam.bam \ ../CellRanger/count_cellranger_FP03Res/outs/possorted_genome_bam.bam \ ../CellRanger/count_cellranger_FP03PBMC/outs/possorted_genome_bam.bam

samtools sort FP03all.bam -o FP03all.sorted.bam samtools index FP03all.sorted.bam FP03all.sorted.bam.bai

cat ../CellBender/Chimerism_FP03Ex_CellBender_cell_barcodes.csv > FP03_sorted_CB_barcodes.csv cat ../CellBender/Chimerism_FP03Res_CellBender_cell_barcodes.csv >> FP03_sorted_CB_barcodes.csv cat ../CellBender/Chimerism_FP03PBMC_CellBender_cell_barcodes.csv >> FP03_sorted_CB_barcodes.csv

singularity exec --bind /home/projects/cu_10179/data/BBH153_snRNAseq_Chimerism \ /home/projects/cu_10179/data/souporcell/souporcell_latest.sif souporcell_pipeline.py \ -i FP03all.sorted.bam \ -b FP03_barcodes.csv -f genome.fa -t 40 -o FP03_sorted_CB -k 2

The result comes out with all the barcodes still in succession, but only the barcodes from the first sample are correctly assigned. And for the random duplicated barcodes the results are different. Example: barcode status assignment log_prob_singleton log_prob_doublet cluster0 cluster1 GAACGTTAGATGAATC-1 doublet 1/0 -5593.687989 -4541.57077 -5643.142575 -5593.687989 (barcode from sample 1) GAACGTTAGATGAATC-1 singlet 1 -1478.094517 -1669.25268 -2698.976511 -1478.094517 (barcode from sample 2)

It should be noted that there are a total of almost 40,000 cells (in fact nuclei) and the bam file contains 3e9 reads.

Am I missing something or is this a bug?

Best regards, Peter

wheaton5 commented 1 year ago

When merging bam files you should rename cell barcodes with -1, -2, -3 to indicate they are different. But this issue seems beyond this simple detail. Still, I would need a data set to test in order to debug this. I dont know the easiest way to create a minimal dataset for testing. But Im willing to debug if you are willing to share data. Let me know.

Peter-bbh commented 1 year ago

Great.

Yes, I'm willing to share, but I also have not figured out how to make a minimal dataset that can still be used for soupercell. Perhaps filter the bam files on a subset of cell barcodes and only one chromosome? The current dataset takes 19 hours to analyse.

wheaton5 commented 1 year ago

Well, less data also makes clustering harder. I can give you access to my server to transfer data but it sounds like this will take some time both in transfer and debugging. Then another issue is that i dont have funding for this work at this time. If you would consider coauthor on your work in exchange I would appreciate it. If not, I will still attempt to debug, but you understand Im a new assistant faculty and have many obligations. Such is the risk of making a tool that people use lol.

Best, Haynes

wheaton5 commented 1 year ago

But with k=2 and so many umi/cell (inferring from those log likelihood values) maybe one large chrom would be enough. Chrom 1 at least cuts the data in less than 1/10th.

Peter-bbh commented 1 year ago

Hi Hayes,

Thanks for your quick replies :)

I managed to generate a reproducible error using the Demuxlet testing data set.

By inserting a duplicate of the second barcode, AAACATACCGCTAA-1, at position 1076 (between ATTTGCACTTCTGT-1 and CAAACTCTAAGCAA-1) in GSM2560245_barcodes.tsv, the barcodes after ATTTGCACTTCTGT-1 did not fit the data. It seems like all the other data columns are identical to the result without the duplicate, so the duplicate is simply ignored. However, in the barcode column the duplicated barcode is retained and the rest of the data and barcodes are therefore not in sync.

So one solution is to replace the barcode column with barcodes were the duplicates are removed (order retained) or alternatively, simply remove the duplicated barcodes before running souporcell.

Best wishes, Peter