wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
169 stars 46 forks source link

ambiguous results when matching souporcell clusters to genotyped donors #145

Open ccrobertson opened 2 years ago

ccrobertson commented 2 years ago

Thanks for a nice tool and your active support for users. Your answers to previous issues have been super helpful! Would welcome your insight with this one...

I have 10x snRNA-seq data from a pool of 5 donors and known genotypes for 4 out of 5 of these donors (genome-wide genotyping array imputed to the TOPMed imputation panel with the Michigan Imputation Server, so covering the vast majority of common variants, aligned to grch38).

I ran souporcell using the --common_variants option:

$singularity exec workflow/envs/souporcell_latest.sif souporcell_pipeline.py   -i  sample1_possorted_bam.bam -b sample1_pass_qc_barcodes.tsv -f hg38.fa -t 10 --cluster 5  --common_variants common_variants_grch38_fixed.vcf  --skip_remap True -o results_sample1_common_variants

To map the souporcell clusters back to the donors, I merged the resulting cluster_genotypes.vcf with my known donor genotypes (donor_genotypes.vcf) using bcftools and then ran relationship inference on the merged genotypes:

#merge souporcell and donor vcfs
bcftools merge -O z -o souporcell_clusters_and_donors.vcf.gz cluster_genotypes_reformatted.vcf.gz donor_genotypes.vcf.gz
tabix -p vcf souporcell_clusters_and_donors.vcf.gz

#convert merged vcf to plink
plink --vcf souporcell_clusters_and_donors.vcf.gz --double-id --make-bed --out souporcell_clusters_and_donors

#filter merged vcf for missingness
plink --bfile souporcell_clusters_and_donors --geno 0.05 --make-bed --out souporcell_clusters_and_donors_filtered

#run king relationship inference
plink2 --bfile souporcell_clusters_and_donors_filtered --make-king-table --out souporcell_clusters_and_donors_filtered

The merged and filtered genotype file (souporcell_clusters_and_donors_filtered.bed/bim/fam) contains ~300k common variants. This seemed sufficient to identify duplicate subject, so I expected to see 5 unique souporcell_cluster-to-donor matches with kinship>0.35 (indicating MZ twin/duplicate). Or I thought perhaps there would be some noise in souporcell cluster genotypes due to low coverage at some sites, so that maybe the kinship wouldn't quite clear the 0.35 bar but would be substantially higher for one donor than the remaining 4. However, the result is not so clear. I have run this across 8 different experiments (each with a set of 5 donors), and so far have seen anywhere from 0/5 to 3/5 souporcell clusters clearly mapping to a single donor. In some cases, multiple souporcell clusters will both weakly map (e.g., kinship 0.15-0.25) to the same donor.

Troubleshooting I've tried:

  1. instead of calculating relationships using king kinship coefficients, I tried using a simpler pairwise correlation between alt allele counts (Pearson r)
  2. removing reads from genes that are highly expressed in empty droplets from the same experiment (hoping to remove some of the ambient noise in the data that could be throwing of genotype calls for souporcell clusters)
  3. running with bam files from linked atac seq bam files (the GEX bam files are actually from a 10X multiome experiment, so we have atac seq reads from the same nuclei), thinking maybe broader coverage of variants across the genome would help
  4. running with the --known_genotypes options. In this case, since I only have genotypes on 4 out of 5 donors, I set k=4 (understanding this is suboptimal)

So far, the results have always been very similar. In all cases, only a subset of souporcell clusters show high correlation/kinship with a single donor, so I am still unable to confidently match souporcell clusters to donors. So I'm wondering:

(1) is there something about this approach to matching clusters to donors that is flawed? I've considered that maybe bcftools merge is not merging variants properly -- but then I don't think I'd expect to see MZ/duplicate matches for any of the souporcell clusters. (2) is there a better metric for matching souporcell clusters with donors? I see in your shared_samples.py, you use a "loss" metric based on alt allele counts per locus. maybe something like this could be adapted to my case. (3) is there something about my data that is making it difficult for souporcell to accurately cluster nuclei and/or call cluster genotypes accurately?

Thanks for any thoughts/suggestions!

alexlenail commented 2 years ago

I'd like to note I'd also love to see a recommended way to relate known genotypes to SoupOrCell output clusters post-hoc (some option to do --known_genotypes after having already run SoupOrCell without --known_genotypes).

BradBalderson commented 11 months ago

@ccrobertson maybe some of the samples actually do not match the genotypes? i.e. could the sample have been accidentally performed on the wrong individual maybe? I am seeing this issue now, for a larger dataset of 56 individuals. Some genotypes do not match, appears to be a sample mislabelling issue for me.