Validation/optimisation of souporcell output in mouse

mattbcvs commented 2 years ago

Hello,

Curious to use this resource to try and genotype nuclei in a set of pooled samples of tissues from C57BL/6J mice. Number of samples in each pool ranges from 2-4. Nuclei are reasonably well sequenced with median depth of 1.8k UMI/nucleus in the lowest depth sample. Couldn't find the answers to a few issues I'm having in the other comments so thought I'd leave them here:

1 Currently I have ambient RNA at ~70% in 3 of the pooled samples which is way over that predicted via CellRanger. These samples have low % mapping to the transcriptome (40-50% of reads) so wondering whether lower quality has hampered variant calling and later clustering. Would a common variants file be helpful? Or maybe increasing stringency on the ones I have called here (increase max depth or something)?

2 The other sample with higher % transcriptome mapping (~70%) has ~12% ambient RNA which matches CellRanger output. This is good but I'm missing whether souporcell does any ambient RNA correction (e.g. like SoupX) or just estimates the % contamination?

3 Trying to think of ways to get some validation of the genotyping calls per cell... thinking it might be useful to examine the quality of variant calls but I'm missing the bit of the output showing which loci were used. Also, is there a part of the output that could be used to generate a PCA (as in the souporcell paper) to help diagnose poor clustering/seperation of genotypes?

4 Hard to compare the outcome of my demuxlet trial run to the snippet of the table provided in the README, including more details of expected outcome would provide some reassurance

Thanks for the resource and engagement on comments of others so far!

wheaton5 commented 2 years ago

Yes, probably poor clustering. Common variants would be useful.
not currently. I plan to add this now that im working on the project again.
loci used are not output. That is a reasonable thing to add as well. The pca in the paper uses the clusters_tmp.tsv and normalizes each row to sum to 1 and then does a pca of that.
Im not sure what u mean here. Could u provide more details?

mattbcvs commented 2 years ago

Thanks, a few replies:

1 I will use a common variants file from here (https://www.mousegenomes.org/). As a plan B I'll probably try doubling min_alt

2 Great!

3 Also great!

4 So in the "Practice/Testing data... " section the demuxlet dataset is suggested as a trial, I did this but then couldn't really tell if I'd got the expected result. e.g. would be useful to see % of cells in each genotype, ambient RNA %, number of loci used for clustering, a PCA etc...

Also thought of a point 5:

5 Another bit of validation, could be useful to have option to create an elbow plot to show the selected k makes sense. I'm running it multiple times to attempt to create one by seeing how/if the final value in clusters.err changes with different ks but maybe this could be easier.

wheaton5 commented 2 years ago

For 5, the souporcell.out or clusters.out file has a number on the final line. That is the total log likelihood. You can use that for elbow plots.

For 4, if you look at the clusters.tsv file, if on each line, the value of the minimum is substantially smaller than the other ones, it is a good clustering. And you can also do the pca like I mentioned for 3.

wheaton5 commented 2 years ago

Obviously you would need to run it multiple times to get those multiple total log likelihood values. One trick would be to run it once, remove the clustering.done file and save the clusters.err (sorry, my previous post was incorrect, this is the file with the total log likelihood on the final line) file for that k. Then you can rerun on the same directory and it will not detect the clustering.done file and thus rerun the clustering with the new k.

mattbcvs commented 2 years ago

Thanks, managed to get a PCA running and it's clear it needs better separation - using a set of common variants from the mouse genome project for the C57BL6NJ strain also hasn't improved clustering separation.

I'm thinking to use the variants found by freebayes but try being a bit more stringent on those being put through to the genotyping stage. Thinking to filter "souporcell_merged_sorted_vcf.vcf" on quality and depth and then supply the .vcf as input for the --common_variants approach.

Is there any filtering by depth or quality score done by souporcell? Assume this would happen at the freebayes step but I have a lot of low quality (median of 0!) low depth variants in my "cluster_genotypes.vcf" so thinking I will have to do this manually...

wheaton5 commented 2 years ago

There probably should be some quality score filtering. There isnt currently. But there is min_alt and min_ref (default 10 each, so default min coverage of 20 which seems reasonable). And in the vartrix stage, we only count alleles on reads with mapq of 30 or higher which is also reasonable. Still some quality score filtering could be good, but I don't think tons of low qual variants get through both of those.

mattbcvs commented 2 years ago

Small update on this, managed to identify the variants used for the clustering, these have much lower depth (DP) and quality (QUAL) than the demuxlet test dataset variants. I get no improvement in PCA separation and much worse % unassigned cells through setting filters on either (min QUAL of 10, and min DP of 50) so I think my initial pool of variants is too poor quality to be used for genotyping.

Probably as there are some low UMI nuclei involved (~20% n each dataset are below 1000 UMI) and that I'm trying to separate closely related mice.

I'm currently trying a run on only the very high UMI nuclei (>3000 UMI) to see if I can get better variants with these.

wheaton5 / souporcell

Validation/optimisation of souporcell output in mouse #154