Open ccrobertson opened 2 years ago
I'd like to note I'd also love to see a recommended way to relate known genotypes to SoupOrCell output clusters post-hoc (some option to do --known_genotypes
after having already run SoupOrCell without --known_genotypes
).
@ccrobertson maybe some of the samples actually do not match the genotypes? i.e. could the sample have been accidentally performed on the wrong individual maybe? I am seeing this issue now, for a larger dataset of 56 individuals. Some genotypes do not match, appears to be a sample mislabelling issue for me.
Thanks for a nice tool and your active support for users. Your answers to previous issues have been super helpful! Would welcome your insight with this one...
I have 10x snRNA-seq data from a pool of 5 donors and known genotypes for 4 out of 5 of these donors (genome-wide genotyping array imputed to the TOPMed imputation panel with the Michigan Imputation Server, so covering the vast majority of common variants, aligned to grch38).
I ran souporcell using the
--common_variants
option:$singularity exec workflow/envs/souporcell_latest.sif souporcell_pipeline.py -i sample1_possorted_bam.bam -b sample1_pass_qc_barcodes.tsv -f hg38.fa -t 10 --cluster 5 --common_variants common_variants_grch38_fixed.vcf --skip_remap True -o results_sample1_common_variants
To map the souporcell clusters back to the donors, I merged the resulting
cluster_genotypes.vcf
with my known donor genotypes (donor_genotypes.vcf
) using bcftools and then ran relationship inference on the merged genotypes:The merged and filtered genotype file (souporcell_clusters_and_donors_filtered.bed/bim/fam) contains ~300k common variants. This seemed sufficient to identify duplicate subject, so I expected to see 5 unique souporcell_cluster-to-donor matches with kinship>0.35 (indicating MZ twin/duplicate). Or I thought perhaps there would be some noise in souporcell cluster genotypes due to low coverage at some sites, so that maybe the kinship wouldn't quite clear the 0.35 bar but would be substantially higher for one donor than the remaining 4. However, the result is not so clear. I have run this across 8 different experiments (each with a set of 5 donors), and so far have seen anywhere from 0/5 to 3/5 souporcell clusters clearly mapping to a single donor. In some cases, multiple souporcell clusters will both weakly map (e.g., kinship 0.15-0.25) to the same donor.
Troubleshooting I've tried:
--known_genotypes
options. In this case, since I only have genotypes on 4 out of 5 donors, I set k=4 (understanding this is suboptimal)So far, the results have always been very similar. In all cases, only a subset of souporcell clusters show high correlation/kinship with a single donor, so I am still unable to confidently match souporcell clusters to donors. So I'm wondering:
(1) is there something about this approach to matching clusters to donors that is flawed? I've considered that maybe bcftools merge is not merging variants properly -- but then I don't think I'd expect to see MZ/duplicate matches for any of the souporcell clusters. (2) is there a better metric for matching souporcell clusters with donors? I see in your shared_samples.py, you use a "loss" metric based on alt allele counts per locus. maybe something like this could be adapted to my case. (3) is there something about my data that is making it difficult for souporcell to accurately cluster nuclei and/or call cluster genotypes accurately?
Thanks for any thoughts/suggestions!