wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
164 stars 46 forks source link

Matching Freebayes variants to cell barcodes #73

Open sylvia-science opened 4 years ago

sylvia-science commented 4 years ago

I'm trying to get the variants for each cell in my data, but I see that barcodes are not in the vcf file produced. Is there a way to match the variants with the barcodes?

wheaton5 commented 4 years ago

The alt.mtx and ref.mtx files out of vartrix give the allele counts for each variant for each locus. The format is % comment % comment total_loci total_cells total_entries locus_index cell_index value

where locus_index is the freebayes vcf record 1 indexed and the cell_index is the barcodes.tsv line number 1 indexed.

The value is according to the file. So if cell 50 has 1 alt allele at locus 400 and 0 ref alleles at locus 400 the ref.mtx will have a line 400 50 0 and the alt.mtx will have a line 400 50 1 and this is a sparse format, so if cell 50 didnt observe a ref OR an alt, then no lines in ref.mtx or alt.mtx will exist for locus 400 for cell 50.

I hope this helps. I have gotten this question a lot. If you think of a good intuitive way for me to collect and display this data to people I welcome it. I could make a vcf with an INFO field with something like bc1_value,bc2_value;bc3_value etc where bc1 is a barcode from the barcodes.tsv and value is how many ref alleles it has (and then things after the semicolon the value would be the alt counts). Would that be more intuitive? Seems to me people will have to write their own scripts to extract that info for their own purposes anyway.

sylvia-science commented 4 years ago

Thank you for your reply!

I was looking at the mtx files to solve this, but I was having trouble finding what they actually represent so your response is very helpful and clear. Maybe putting your description of these files on the readme would be enough to guide people.

Balthasar-eu commented 4 years ago

Hi, I asked a similar question before and I also think putting this in the readme would be helpful. On the example dataset you wrote: "the important files are [...]" maybe add a section afterwards and mention the additional files that are not that often used.

I also have a followup, when I tried to use the mtx files to look up the loci, I found that the mtx files have a different number of loci than there are lines in the vcf file. So the max in column 1 is 35000, but vcf only has 34000 lines plus header.

sylvia-science commented 4 years ago

I'd like to get some clarification as well. I noticed that the souporcell_merged_sorted_vcf.vcf file has the same number of rows as the ref and var mtx files, so does that means that's the vcf file that should be used to match variants to locations?

I also looked at the cluster_genotypes.vcf files, but they have a smaller number of rows than the mtx files.

wheaton5 commented 4 years ago

Yes, I'm sorry if I misspoke. souporcell_merged_sorted_vcf.vcf is the vcf that gets sent to vartrix so that is the vcf which the matrix files refer to.

bpyenson commented 2 years ago

Hello,

Thank you so much for this posting and the excellent software. I am learning so much! Two questions below related to this topic.

  1. I am struggling to figure out how to extract the variant information from the appropriate vcf file that shows post-ambientRNA-filtered variants and input this to Seurat for downstream analyses. Any advice would be appreciated.

  2. I am also wondering if my explanation below explains the calculation for ambient_rna.txt in outputs, and clarifies the diffferent vcf files:

Metadata of souporcell_merged_sorted_vcf.vcf (604,752 variants, which is the same number of rows as alt.mtx and ref.mtx) says that it is filtered (FILTER=<ID=PASS,Description=\"All filters passed). This is presumably filtered of ambient RNA?

Is clusters_genotypes.vcf (565,568 variants) then the result of filtering, then? 604,752-565,568/ 604,752= 6.47%, which is is very close (but not exactly) the same as ambient_rna.txt (6.55039761636829%). What is missing from the calculation of ambient_rna.txt?