Assigning sample IDs based on a genotyping vcf

nbartonicek commented 3 years ago

Good day, We are about to start using souporcell as the main tool to demux our single cell data, for which we also have genotyping information. It turns out that using --commonvariants provides better results than only genotyping SNPs per se, but then there is the issue of annotating the souporcell clusters with sample IDs. Since we can't use "known_genotypes_sample_names" with --commonvariants, do you have any suggestion of tools and/or procedures what would be the best way to annotate the clusters? Thanks for any help! :)

Nenad Bartonicek, PhD Senior Bioinformatic Officer Centre for Clinical Genomics Garvan Institute of Medical Research 384 Victoria Street Sydney NSW 2010 Australia

wheaton5 commented 3 years ago

Hello.

Let me ask for some clarification. Are you saying that using --common_variants is giving you better results than --known_genotypes or just better than not using either? If it is better than using --known_genotypes, perhaps we should look into that further for a potential bug or just try to understand that. If you just want help assigning sample ids to clusters with fully de novo mode (using neither), without more data of course that is impossible. But what I do have is that if you have multiple experiments and the same or a subset of the same individuals are in those experiments, I have shared_samples.py script to match up which clusters are the same individual in those experiments.

Let me know about my clarification. Otherwise hopefully this helps. And thanks for your interest in souporcell.

Best, Haynes

wheaton5 commented 3 years ago

Sorry, I just reread your question and you say you do have genotype info. That is odd. What is the data source for your genotype information? What makes you say that it performs better with common_variants than known_genotypes? I would like to know your full command. We should rule out sample mixup (it happens at some rate everywhere). In any case, we can use my same script for multiple experiments I think, shared_samples.py, to match clusters to genotypes. It might need minor alteration, we will see. But it may also be helpful in ruling out sample mixup if the loss function to one of the cluster-genotype matches is far worse than the other cluster-genotype loss function values.

nbartonicek commented 3 years ago

Good day,

Thanks so much for your prompt response and all the clarifications!

We have genotype information from UK Biobank Axiom™ Array (ThermoFisher) on two samples that we have independent scRNA data on, and which we have artificially merged to test demultiplexing. From the total of 100.000 on the genotyping array, about 3.000 SNPs are used by souporcell to form "cluster_genotypes.vcf". When I impute from the original batch on Michigan imputation server, souporcell uses ~60.000 SNPs. On the other hand, using --common variants gets up to 170.000 SNPs. We get much better results with imputed SNPs and common variants - see the image below. The command is: singularity exec souporcell.sif souporcell_pipeline.py -b barcodes.tsv --known_genotypes sample.b38.vcf -i possorted_genome_bam.bam -f refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa -t 8 -k 2 -o genotyping

As for the shared_samples.py, sound like that's exactly what we need. Going to fetch it from your github repository and try it out.

Thanks again for your help!

Nenad

wheaton5 commented 3 years ago

Interesting. Yes something is clearly wrong. Maybe reference mismatch, maybe sample mixup. I haven't dealt with snp arrays (I'm old but not that old lol, jk i know it is an actively used tech). I'm busy right now but will try to give this some attention tomorrow afternoon.

nbartonicek commented 3 years ago

Hmm, it can't be sample mixup, cause we do get them right when we call singletons in 99% of cases (see figure below). Not sure it's a reference mismatch, cause there's an overlap of the genotype vcf with your common SNPs vcfs, and it disappears when I shift it by 1 in either direction.

May I just ask, why do you think something is wrong? Could it be that it all boils down to the size of a vcf for calling variants?

wheaton5 / souporcell

Assigning sample IDs based on a genotyping vcf #77