Closed achamess closed 4 years ago
I now have a script called shared_samples.py so if you do singularity exec souporcell.sif shared_samples.py -h the usage is
usage: shared_samples.py [-h] -1 EXPERIMENT1 -2 EXPERIMENT2 -n SHARED
determine which clusters are the shared clusters between two souporcell runs
with shared samples
optional arguments:
-h, --help show this help message and exit
-1 EXPERIMENT1, --experiment1 EXPERIMENT1
souporcell directory for experiment 1
-2 EXPERIMENT2, --experiment2 EXPERIMENT2
souporcell directory for experiment 2
-n SHARED, --shared SHARED
how many samples should these experiments share?
And the output should be a mapping of clusters in experiment 1 to clusters in experiment 2.
Wow. You’re on it! Way to think of all the needs of users of souporcell. Thanks so much. I’ll try it out.
Sent from my iPhone
On Nov 29, 2019, at 8:29 AM, Haynes Heaton notifications@github.com wrote:
I now have a script called shared_samples.py so if you do singularity exec souporcell.sif shared_samples.py -h the usage is
usage: shared_samples.py [-h] -1 EXPERIMENT1 -2 EXPERIMENT2 -n SHARED
determine which clusters are the shared clusters between two souporcell runs with shared samples
optional arguments: -h, --help show this help message and exit -1 EXPERIMENT1, --experiment1 EXPERIMENT1 souporcell directory for experiment 1 -2 EXPERIMENT2, --experiment2 EXPERIMENT2 souporcell directory for experiment 2 -n SHARED, --shared SHARED how many samples should these experiments share?
And the output should be a mapping of clusters in experiment 1 to clusters in experiment 2.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wheaton5_souporcell_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABYZW4TIGBMS7XRUIYM6WA3QWERKZA5CNFSM4JTAET7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFO7XEY-23issuecomment-2D559807379&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=lJwpPAI7IT90iSLGUiV64LBzqPZocKNXTN7oipRRD_g&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABYZW4VFLL7XU4OPWV3BNMDQWERKZANCNFSM4JTAET7A&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=G1NLkyJ4rSxUS6PTpTcyD36hl9uYdVzvnibuMHe5AFw&e=.
I have tested this for the demuxlet paper datasets and my synthetic mixtures.
Btw. I used the 1000 genomes variants and 50 restarts and the clustering looked very good. From nuclei.
Sent from my iPhone
On Nov 29, 2019, at 8:29 AM, Haynes Heaton notifications@github.com wrote:
I now have a script called shared_samples.py so if you do singularity exec souporcell.sif shared_samples.py -h the usage is
usage: shared_samples.py [-h] -1 EXPERIMENT1 -2 EXPERIMENT2 -n SHARED
determine which clusters are the shared clusters between two souporcell runs with shared samples
optional arguments: -h, --help show this help message and exit -1 EXPERIMENT1, --experiment1 EXPERIMENT1 souporcell directory for experiment 1 -2 EXPERIMENT2, --experiment2 EXPERIMENT2 souporcell directory for experiment 2 -n SHARED, --shared SHARED how many samples should these experiments share?
And the output should be a mapping of clusters in experiment 1 to clusters in experiment 2.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wheaton5_souporcell_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABYZW4TIGBMS7XRUIYM6WA3QWERKZA5CNFSM4JTAET7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFO7XEY-23issuecomment-2D559807379&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=lJwpPAI7IT90iSLGUiV64LBzqPZocKNXTN7oipRRD_g&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABYZW4VFLL7XU4OPWV3BNMDQWERKZANCNFSM4JTAET7A&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=G1NLkyJ4rSxUS6PTpTcyD36hl9uYdVzvnibuMHe5AFw&e=.
I tried to run this, but ran into an error. I think it's looking for a vcf file that doesn't exist.
Here are the outputs in my directories, using the current version of souporcell:
alt.mtx clusters.tsv fastqs.done souporcell_minimap_tagged_sorted.bam
ambient_rna.txt common_variants_covered_tmp.vcf minimap.err souporcell_minimap_tagged_sorted.bam.bai
cluster_genotypes.vcf common_variants_covered.vcf ref.mtx troublet.done
clustering.done consensus.done remapping.done variants.done
clusters.err depth.bed retag.err vartrix.done
clusters_tmp.tsv depth_merged.bed retagging.done
But the script is looking for souporcell_merged_sorted_vcf.vcf.gz
Right, I guess the vcf names are different when run with --common_variants vs run without common variants. I will need to update this. But right now I am on vacation and don't even have a computer on which I can build the singularity container. I think you should be able to get around this by using the cluster_genotypes.vcf files instead. So if you renamed both of the cluster_genotypes.vcf files to souporcell_merged_sorted_vcf.vcf and then gzip them, maaaaybe it will work?
Thanks. No worries. Yeah that was my thought. I’ll rename and see what happens. Thanks for your quick replies. Enjoy your vacation.
From: Haynes Heaton notifications@github.com Reply-To: wheaton5/souporcell reply@reply.github.com Date: Friday, November 29, 2019 at 10:30 AM To: wheaton5/souporcell souporcell@noreply.github.com Cc: Alex Chamessian alexander.chamessian@duke.edu, Author author@noreply.github.com Subject: Re: [wheaton5/souporcell] Consistent assignment between samples that have the same individuals (#23)
Right, I guess the vcf names are different when run with --common_variants vs run without common variants. I will need to update this. But right now I am on vacation and don't even have a computer on which I can build the singularity container. I think you should be able to get around this by using the cluster_genotypes.vcf files instead. So if you renamed both of the cluster_genotypes.vcf files to souporcell_merged_sorted_vcf.vcf and then gzip them, maaaaybe it will work?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wheaton5_souporcell_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABYZW4XOPDFS3ECK2DBSO4DQWE7RRA5CNFSM4JTAET7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFPHNWI-23issuecomment-2D559838937&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=BxydkQ9O9srl9Gen_Y72TaKywFAXhHOIlOf-5d_badI&s=bJkBdanrWmHnWHb3gVMx3eD4UMmIkGR9Dv_gofIz3lM&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABYZW4RPBF5HG6NM2CF56U3QWE7RRANCNFSM4JTAET7A&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=BxydkQ9O9srl9Gen_Y72TaKywFAXhHOIlOf-5d_badI&s=-NuYpr22nvwW95mi6bwZzxNXpYoCep7CXNNfGHd6V_c&e=.
Just leaving this here for the future. No rush. I changed file names. Ran the script. Got this:
Traceback (most recent call last):
File "/opt/souporcell/shared_samples.py", line 63, in <module>
(chr2, pos2, ref2, alt2) = vcf2.readline().strip().split()[0:4]
ValueError: not enough values to unpack (expected 4, got 0)
Don't worry about bothering me, I am working today anyway. Sounds like that file is either empty or only header. Check that (the one in experiment2).
OK. It's not empty.
##fileformat=VCFv4.3
##fileDate=05122018_15h52m43s
##source=IGSRpipeline
##reference=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##contig=<ID=chr4>
##contig=<ID=chr5>
##contig=<ID=chr6>
##contig=<ID=chr7>
##contig=<ID=chr8>
##contig=<ID=chr9>
##contig=<ID=chr10>
##contig=<ID=chr11>
##contig=<ID=chr12>
##contig=<ID=chr13>
##contig=<ID=chr14>
##contig=<ID=chr15>
##contig=<ID=chr16>
##contig=<ID=chr17>
##contig=<ID=chr18>
##contig=<ID=chr19>
##contig=<ID=chr20>
##contig=<ID=chr21>
##contig=<ID=chr22>
##contig=<ID=chrX>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0 1 2 3
1 1389777 . G A . PASS AF=0.02;AC=86;NS=2548;AN=5096;EAS_AF=0.0;EUR_AF=0.05;AFR_AF=0.0;AMR_AF=0.02;SAS_AF=0.01;VT=SNP;DP=20597 GT:AO:RO:T:E:GO:GN ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1 1/1:2:0:-1:-4:-9,-1,-2:-8,0,-1 1/1:1:0:-1:-4:-5,-1,-1:-4,0,-1 ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1
1 1389827 . C T . PASS AF=0.32;AC=1624;NS=2548;AN=5096;EAS_AF=0.02;EUR_AF=0.81;AFR_AF=0.08;AMR_AF=0.45;SAS_AF=0.34;VT=SNP;DP=20733 GT:AO:RO:T:E:GO:GN ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1 1/1:2:0:-1:-4:-9,-1,-2:-8,0,-1 1/1:1:0:-1:-4:-5,-1,-1:-4,0,-1 ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1
1 1721922 . C T . PASS AF=0.75;AC=3815;NS=2548;AN=5096;EAS_AF=0.78;EUR_AF=0.92;AFR_AF=0.5;AMR_AF=0.88;SAS_AF=0.78;VT=SNP;DP=23876 GT:AO:RO:T:E:GO:GN ./.:0:0:-2:-5:-1,-1,-1:-1,-1,-1 1/1:1:0:-2:-5:-5,-1,-1:-4,0,-1 1/1:5:0:-2:-5:-21,-1,-4:-20,0,-3 1/1:4:0:-2:-5:-17,-1,-3:-16,0,-2
1 1726053 . G C . PASS AF=0.78;AC=3962;NS=2548;AN=5096;EAS_AF=0.73;EUR_AF=0.91;AFR_AF=0.67;AMR_AF=0.89;SAS_AF=0.76;VT=SNP;DP=12939 GT:AO:RO:T:E:GO:GN ./.:0:0:-1:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:-1:-3:-1,-1,-1:-1,-1,-1 1/1:3:0:-1:-3:-13,-1,-3:-12,0,-2 ./.:0:0:-1:-3:-1,-1,-1:-1,-1,-1
1 1884183 . G A . PASS AF=0.2;AC=1014;NS=2548;AN=5096;EAS_AF=0.16;EUR_AF=0.23;AFR_AF=0.26;AMR_AF=0.19;SAS_AF=0.12;VT=SNP;DP=17234 GT:AO:RO:T:E:GO:GN ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 1/1:2:0:0:-3:-9,-1,-2:-8,0,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1 2411755 . T C . PASS AF=0.88;AC=4490;NS=2548;AN=5096;EAS_AF=0.89;EUR_AF=0.94;AFR_AF=0.87;AMR_AF=0.88;SAS_AF=0.83;VT=SNP;DP=20078 GT:AO:RO:T:E:GO:GN 1/1:1:0:0:-3:-5,-1,-1:-4,0,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1 3843565 . C T . PASS AF=0.79;AC=4025;NS=2548;AN=5096;EAS_AF=0.48;EUR_AF=0.94;AFR_AF=0.92;AMR_AF=0.87;SAS_AF=0.72;VT=SNP;DP=15702 GT:AO:RO:T:E:GO:GN 1/1:1:0:0:-3:-5,-1,-1:-4,0,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1 3843954 . A G . PASS AF=0.8;AC=4096;NS=2548;AN=5096;EAS_AF=0.48;EUR_AF=0.94;AFR_AF=0.97;AMR_AF=0.88;SAS_AF=0.72;VT=SNP;DP=16972 GT:AO:RO:T:E:GO:GN ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 1/1:1:0:0:-3:-5,-1,-1:-4,0,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1 3867056 . T C . PASS AF=0.53;AC=2700;NS=2548;AN=5096;EAS_AF=0.43;EUR_AF=0.83;AFR_AF=0.29;AMR_AF=0.63;SAS_AF=0.58;VT=SNP;DP=19142 GT:AO:RO:T:E:GO:GN 1/1:1:0:-2:-5:-5,-1,-1:-4,0,-1 1/1:7:0:-2:-5:-29,-1,-5:-28,0,-4 ./.:0:0:-2:-5:-1,-1,-1:-1,-1,-1 1/1:1:0:-2:-5:-5,-1,-1:-4,0,-1
Looking at the code again I see the problem. I pushed a fix but can't make a singularity container on this computer (no admin). If you clone the repo or just download that file, you can run it directly. All it requires is python3 with gzip and argparse modules.
I tried using the script. It took more than 24 hours, so I stopped it, thinking something was wrong. should it take that long? Are you able to make a new singularity build with the updated code?
Must be an infinite loop bug. I'll take a look.
I think I have fixed this, but I don't have test data which has this bug, so here is hoping. The singularity build is now updated. Thanks for your patience and great feedback.
Thanks Haynes. I'll try it out and let you know how it goes.
I ran it. Took like 5 mins. But here is the output. Not sure what to make of that:
locus1 matchset 80242
cell clusters 5492
cell clusters 5154
clusters for experiment1 4
clusters for experiment2 4
experiment1_cluster experiment2_cluster loss
1 1 10.67231087908009
1 0 12.806791507134088
2 1 14.042522528560424
3 1 15.769906644087973
Yeah this is clearly wrong. Sorry. I think I have an off by one error and as a result am not using the informative loci. Those loss values are way too low. I tested this on data with a lot of UMIs and while it worked, the loss values were way lower than I expected.
OK. Thanks for checking. Mine has a lot of UMIs too, for what it's worth. LIke 10K UMI/cell.
Ok, I did find two bug. But actually those loss values might not be too low. The main bug was that I was looking at matched loci, but not the right matched loci (the ones with lots of data). Then the next thing was that I was requiring each experiment to have a certain number of observations, but did not require every cluster to have a certain number of observations. Now I am requiring every cluster to have 8 observations for a loci for the loci to be counted for this. This might fall over when some experiments have a cluster with a very small cluster though.
So explaining the output a bit. Here I had it output 2*shared_samples to show how the loss function for the first (true) shared samples should be muuuuch lower than other cluster pairings.
experiment1_cluster experiment2_cluster loss 2 3 13.555280719863028 4 0 17.26053627757567 0 4 18.42185017489233 3 2 19.511867466770433 1 1 19.85788870668313 0 2 1549.1748800588916 3 4 1575.188470350116 2 2 1578.961712371887 1 2 1582.7255752644771 4 2 1584.9752724054401
So this is just testing two different synthetic mixtures of 5 hipsci cell lines. So cluster 2 in experiment 1 is cluster 3 in experiment 2 and so on for the first 5 lines with low loss values. Then the loss values for incorrect pairings are > 1000.
But now I also realize that because I require just the two clusters in question to have > 100 observations for a locus to count it, not all cluster pairs are being judged by the same set of loci. This will especially be bad for situations where the number of cells per cluster is very skewed. So I need to first come up with a global set of loci which have decent counts in all clusters before computing this loss.
Another thing on this is that for the final match set, I just take the lowest loss value pairs. This was partially as a sanity check because if these are true matches, then matching should be one to one, so if cluster 0 in experiment 1 matches cluster 4 in experiment 2, it should not also match cluster 2 in experiment 2. I let this happen because if it does, that is a sign something is either wrong with the code or the experimental design (they don't actually share as many donors as you thought). But I don't throw an error if this happens, perhaps I should.
Those fixes are up and im making the singularity build now.
Singularity build updated. Let me know how it goes. Even with those bugs I find your results puzzling.
Thanks Haynes. Here is the newest output:
locus1 matchset 286171
cell clusters 5492
cell clusters 5154
clusters for experiment1 4
clusters for experiment2 4
experiment1_cluster experiment2_cluster loss
0 3 3198.3017169046584
3 3 3325.3283535633473
2 3 3430.191326119036
0 0 3487.4457781562696
0 1 3495.7191406556963
0 2 3503.580329775917
1 3 3507.320727018593
3 1 3587.6705098776874
How should I interpret that?
Something is clearly wrong. There isn't a step change between the first 4 and the rest. Are you sure these experiments contain the same individuals? I have done this with replicates, and while the numbers aren't as stark as the example I gave you, the loss values for true pairs were at least 1/10th that of invalid pairs. Maybe this is something wrong with it's use on common_variant option? That doesn't make sense, but I haven't tested it on that. I will do that, but if I get the expected result, the only thing going forward would be to send me your output folders for testing. Sorry about this and thanks again for the patience.
Thanks for taking the time to look at this. I am sure they're same individuals. These are technical replicates of each other, literally same day, same 10x chip, same pooled input sample.
I'm happy to share the outputs with you if you're willing.
One other possible source of error is that I did what you described above, since the souporcell pipeline didn't make the right file outputs:
So if you renamed both of the cluster_genotypes.vcf files to souporcell_merged_sorted_vcf.vcf and then gzip them, maaaaybe it will work?
I took the cluster_genotypes.vcf and made a copy and made souporcell_merged_sorted_vcf.vcf.gz . Any reason why that would be different from the output the pipeline would make?
I'll try this with --common_variants and get back to you. If I can't reproduce, I will get the relevant files from you.
This worked fine for me. So what I need to test yours is, for each experiment, the common_variants_covered.vcf, alt.mtx, and ref.mtx files.
edit: it also requires the clusters.tsv files.
If zipped/gzipped I think these will be small enough to email (whheaton@gmail.com) or if not, you could drop them in a google drive and share with me.
Thanks for sending me the files.
So the common_variants_covered.vcf and the souporcell_merged_sorted_vcf.vcf.gz are not at all the same. exp1: zcat souporcell_merged_sorted_vcf.vcf.gz | wc -l 704360 exp1: wc -l common_variants_covered.vcf 1874832 common_variants_covered.vcf
and that looks similar for experiment 2.
So I have since updated shared_samples.py to work with --common_variants options just by detecting which file exists. So I just deleted the souporcell_merged_sorted_vcf.vcf.gz files and it worked perfectly
experiment1_cluster experiment2_cluster loss 2 2 223.44688037585996 0 1 255.8817310295661 1 0 327.36936809969245 3 3 581.9606487497515 3 2 4981.107508541682 2 3 5093.439512586718 3 1 5232.686144786503 1 1 5305.419008070588
The values are generally higher in this case than mine because all clusters have enough observations in many loci. So this is quite good data. V3 or next gem chemistry I assume. The increased library complexity has really made my life easier. So the first 4 lines are the mapping from exp1 cluster to exp2 cluster.
Hi @wheaton5 Thanks for the shared_samples.py script. Its really helpful.
I tried to use it on my data, two wells each with 4 different samples with 2 shared samples, but I ran into the same issue here. I read the discussion but couldn't figure it out. Really appreciate if you can take a look. Here is my output:
experiment1_cluster experiment2_cluster loss 0 1 1.6833647513942072 1 0 8.76562974638987 1 3 8.847310621688328 1 2 10.870776921689766
as you can see loss is really low
Hello Haynes,
Thank you for the amazing tool!
Is there any way to run the shared_samples script between 3 experiments?
Hi Sayyam,
No, not currently. Seems like a clustering of clusterings... Interesting problem to think about. But currently I'd say run shared_samples each way pairwise and hopefully they are consistent. If not, there is probably something wrong with the clustering or data, so this sounds like a nice gut check. You could also combine data across experiments and run souporcell like that (would need to assign experiment identifier in cell barcode, so add a -1/2/3 to end of CB tag for each experiment. Or you could also essentially make each cluster in each experiment its own cell and reconstruct what the alt.mtx and ref.mtx would look like then run souporcell (not souporcell_pipeline) on that. But that is some fairly complex scripting that I doubt you want to do.
If this is blocking for you, I am looking for new collaborations and we could discuss further.
Best, Haynes
Hello Haynes,
Thank you so much for the suggestions. I'll run a pairwise approach!
Just a note: for situations were we have several libraries with the same mixtures of genotypes we doctor the CB-tags in the BAM file to disambiguate the cellbarcodes (simply prefixing them with the library name), then merge the bamfiles and also the similarly disambiguated barcodes.tsv files, then run souporcell on that merged bam file. The advantage is that you get a better genotype estimate and you use the same ones for both libraries.
@plijnzaad I'm in exactly this situation. Would you be willing to share your scripts for this? I've never amended BAM files or the barcodes before.
@wheaton5 Hi again. I'm trying some 10x multiome data on 5 libraries, all having the same 3 individuals. I ran souporcell on all of them, then tried a pairwise shared assignment. This is what I got:
experiment1_cluster experiment2_cluster loss 0 2 1260.8168001928484 2 1 1350.7020133927413 2 0 1936.0541779152948 1 1 2767.580876688788 0 1 4151.516522205379 2 2 8764.241892732287
Looks not right to me unless I'm reading that wrong. cluster 2 from exp 1 is assigned twice.
Update I ran another pairwise set, using the same experiment 1, but a different experiment 2, and now I think I get a clear and accurate answer
experiment1_cluster experiment2_cluster loss 0 1 1368.5622520419965 2 0 1520.4600025153557 1 2 1547.186411870913 2 2 5989.91234763886 2 1 8805.519739288886 1 0 9656.343401871285
Weird no?
@wheaton5 : I have a question about the output. It seems 4 individuals (--shared 4) are involved. Which rows should be considered? The rows with "high" oder with "low" loss?
Does one want a high or a low loss?
experiment1_cluster experiment2_cluster loss 2 2 223.44688037585996 0 1 255.8817310295661 1 0 327.36936809969245 3 3 581.9606487497515 3 2 4981.107508541682 2 3 5093.439512586718 3 1 5232.686144786503 1 1 5305.419008070588
I also had two experiments with 4 individuals but I used "--shared 2" because I also got 8 lines instead of 4. Therefore I was confused a little bit.
With --shared=2 (but data contains 4 individuals) I got 4 rows.
This are my data:
locus1 matchset 83638 cell clusters 4118 cell clusters 4583 clusters for experiment1 4 clusters for experiment2 4 experiment1_cluster experiment2_cluster loss 1 1 147.73398915292847 2 2 161.96103231157463 0 3 163.45816811325176 3 0 196.79821230361432
With --shared=4 I got this...
locus1 matchset 83638 cell clusters 4118 cell clusters 4583 clusters for experiment1 4 clusters for experiment2 4 experiment1_cluster experiment2_cluster loss 1 1 147.73398915292847 2 2 161.96103231157463 0 3 163.45816811325176 3 0 196.79821230361432 0 2 2638.1945883597723 2 3 2640.9090802577593 3 1 3387.936289148157 1 0 3394.9481022939603
Low loss is the ideal. I wanted to show the distinction between low and high loss to show it was working as intended which is why i give extra lines. Your shared 4 output is ideal. The first 4 lines have relatively low loss followed by much higher loss values. These represent the correct cluster matches.
Hi Dr. Heaton, sorry to bother again on the usage of shared_samples.py
. So I intended to apply this on the cluster assignment of an experiment that failed completing the souporcell pipeline when including --known_genotypes
and --known_genotypes_sample_names
.
I have Experiment1 that has 3 individuals of known genotypes and I could complete the souporcell pipeline with known genotype information, which gives me the following output:
alt.mtx fastqs.done
ambient_rna.txt minimap.err
cluster_genotypes.vcf input_k_known_genotypes.vcf.gz
clustering.done ref.mtx
clusters.err remapping.done
clusters_tmp.tsv retag.err
clusters.tsv retagging.done
common_variants_covered_tmp.vcf souporcell_minimap_tagged_sorted.bam
common_variants_covered.vcf souporcell_minimap_tagged_sorted.bam.bai
consensus.done troublet.done
depth_merged.bed variants.done
vartrix.done
The Experiment2 has 3 individuals of known genotypes (2 of them shared with Experiment1) and the pipeline could be completed without including --known_genotypes
and --known_genotypes_sample_names
. Also, I didn't enable --common_variants
. The output is as follows:
alt.mtx minimap.err
ambient_rna.txt ref.mtx
bcftools.err remapping.done
cluster_genotypes.vcf retag.err
clustering.done retagging.done
clusters.err souporcell_merged_sorted_vcf.vcf.gz
clusters_tmp.tsv souporcell_merged_sorted_vcf.vcf.gz.tbi
clusters.tsv souporcell_minimap_tagged_sorted.bam
consensus.done souporcell_minimap_tagged_sorted.bam.bai
doublets.err troublet.done
fastqs.done variants.done
vartrix.done
shared_samples.py
requires either souporcell_merged_sorted_vcf.vcf.gz or common_variants_covered.vcf present in both' experiments folders, which might not line up with the names of output VCFs in these 2 folders. I read this post and initially, I only renamed Experiment1's cluster_genotypes.vcf and bgzipped into souporcell_merged_sorted_vcf.vcf.gz, and the results doesn't seem right:
locus1 matchset 16182
cell clusters 11087
cell clusters 13475
clusters for experiment1 3
clusters for experiment2 3
experiment1_cluster experiment2_cluster loss
2 2 93.87543847925971
2 1 96.71386482576936
0 2 97.66517526423038
1 2 100.40247252105199
To correct this, I noticed you mentioned that this script was updated to work with --common_variants
, which I did not include when running Experiment2.
So I have since updated shared_samples.py to work with --common_variants options just by detecting which file exists. So I just deleted the souporcell_merged_sorted_vcf.vcf.gz files and it worked perfectly
Just want to double check that re-running the pipeline for Experiment2 (perhaps delete .DONE files after Step2: calling candidate variants) is needed with this option right?
If this is the case, would the corresponding VCF file that I previously used for --known_genotypes
(attempt failed and do not know why) be used for this --common_variants
instead? I see other people mentioned this for running Souporcell (https://github.com/wheaton5/souporcell/issues/126#issuecomment-1296383744).
Thank you so much if you might have a minute looking into this!
I have multiple libraries that are technical replicates of each other, each containing a pool of the same 4 individuals. When I do downstream analysis, I combine them, but I'm assuming the assignments to (0,1,2,3, where k=4) are random, and won't be congruent between samples. Is there a good way to make them the same, so that library 1's cluster 0 = library 2's cluster 0?