Decomposing count matrix per donor

ktpolanski commented 1 year ago

Hello,

Recently a bunch of variably niche uses have appeared where going more granular than barcode level is desirable - pulling apart the donor composition of the read space/count matrix would be of relevance. This is admittedly somewhat wishful thinking, but here's the general outline of what I've come up with as souporcell postprocessing.

Load souporcell's cluster_genotypes.vcf and the input BAM (I tend to run with skipping the remap)
Subset the VCF to "interesting" SNPs, defined as ones where the two donors differ in zygosity, allowing for at least partial distinguishing of the genotype of origin
Pull out reads covering this coordinate and ensure they're useful:
- Have CB, UB, and gene call (in starsolo GX/GN) tags
- Ensure the mapping quality at the base is no less than 30, matching what you set for vartrix when you run it
If the above are met, take note of the SNP call and store info from the read as appropriate

If trying to stay as close to what souporcell does as possible, I could mirror the pipeline's vartrix call with --umi added, on the "interesting" SNPs only. This would require some postprocessing to translate the SNP info to genes, and could technically pick up reads not formally quantified via the mapping pipeline.

Do you have any thoughts on the matter? I'm not expecting this to yield anything particularly amazing.

wheaton5 commented 1 year ago

I don't really know what you are trying to accomplish, and this makes it difficult to comment. What is the primary application? All I see is that you are gathering read info on interesting variants, but not sure how you are using this information.

Note that if you aren't using known_genotypes or common_variants, running in skip_remap mode is not advisable. The variant calls on the STAR aligned bam are very bad. If the samples are human, using the provided 1kgenomes common variants file should be pretty good though (better than simply running with remap and without common variants).

ktpolanski commented 1 year ago

I am using --common_variants, yes. Sorry, forgot to specify that.

The idea is to try to post-process a case of Visium from a source with a transplant, and as such feasibly with dual genotypes mixed in some proportion in some spots. As such, ideally what I'd accomplish is adding a donor axis to the standard spot by gene count matrix, which I can do based on the various flags and SNP calls in the reads identified as outlined above. However, this will shrink the gene space in practice as it will be limited to the transcripts that actually have a SNP that can be used to differentiate the genotypes. Putting this sort of hobbled count matrix together with the output of something like cell2location that decomposes the spot's expression (based on single cell populations) could maybe guide what cell populations originate from what genotype in the spots?

wheaton5 commented 1 year ago

Okay, so why is it not enough that the read had the same barcode as a barcode identified as being from one individual?

ktpolanski commented 1 year ago

Because let's assume a spot with multiple genotypes present. Souporcell justly flags it as a doublet. But there are cell populations in there, which come from which source? My clunky strategy may help decipher that if there's some correlation between cell population expression signatures and genotyping outcomes.

This is not the easiest thing to figure out, which is why I've come with my attempt at a plan to see if you've got anything better. Hopefully I've communicated the underlying reason in a manner that's clear enough by now.

wheaton5 commented 1 year ago

Right. I haven't dealt with spacial data yet. I see now the issue. I need to do some reading and thinking on this before giving an opinion. I'll get back to you.

wheaton5 commented 1 year ago

Okay. I guess what you are wanting is just how likely a spot contains 2 genotypes. But I think this would be pretty well indicated by the posterior on whether its called as a doublet. If you want to know what evidence goes into that, I guess digging into which variants contribute the most to whether its a doublet or not would be of use? I can get that information from troublet I think. Might that be a useful additional output? I'm open to an ongoing collaboration if that would be of interest to you.

ktpolanski commented 1 year ago

I guess this depends on what is the exact context of the doublet variants. If this somehow leads to "gene A is pointing toward donor 1, but gene B is pointing toward donor 2" then that sounds like it could be of use. That's what I think I'm accomplishing via the proposed workflow at the top.

It would be an honour to have you involved. That said, I'm aware that you're very unlikely to find someone else showing up here asking for thoughts on their transplant Visium workflow.

wheaton5 / souporcell

Decomposing count matrix per donor #157