wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
153 stars 44 forks source link

Consistent assignment between samples that have the same individuals #23

Closed achamess closed 4 years ago

achamess commented 4 years ago

I have multiple libraries that are technical replicates of each other, each containing a pool of the same 4 individuals. When I do downstream analysis, I combine them, but I'm assuming the assignments to (0,1,2,3, where k=4) are random, and won't be congruent between samples. Is there a good way to make them the same, so that library 1's cluster 0 = library 2's cluster 0?

wheaton5 commented 4 years ago

I now have a script called shared_samples.py so if you do singularity exec souporcell.sif shared_samples.py -h the usage is

usage: shared_samples.py [-h] -1 EXPERIMENT1 -2 EXPERIMENT2 -n SHARED

determine which clusters are the shared clusters between two souporcell runs
with shared samples

optional arguments:
  -h, --help            show this help message and exit
  -1 EXPERIMENT1, --experiment1 EXPERIMENT1
                        souporcell directory for experiment 1
  -2 EXPERIMENT2, --experiment2 EXPERIMENT2
                        souporcell directory for experiment 2
  -n SHARED, --shared SHARED
                        how many samples should these experiments share?

And the output should be a mapping of clusters in experiment 1 to clusters in experiment 2.

achamess commented 4 years ago

Wow. You’re on it! Way to think of all the needs of users of souporcell. Thanks so much. I’ll try it out.

Sent from my iPhone

On Nov 29, 2019, at 8:29 AM, Haynes Heaton notifications@github.com wrote:



I now have a script called shared_samples.py so if you do singularity exec souporcell.sif shared_samples.py -h the usage is

usage: shared_samples.py [-h] -1 EXPERIMENT1 -2 EXPERIMENT2 -n SHARED

determine which clusters are the shared clusters between two souporcell runs with shared samples

optional arguments: -h, --help show this help message and exit -1 EXPERIMENT1, --experiment1 EXPERIMENT1 souporcell directory for experiment 1 -2 EXPERIMENT2, --experiment2 EXPERIMENT2 souporcell directory for experiment 2 -n SHARED, --shared SHARED how many samples should these experiments share?

And the output should be a mapping of clusters in experiment 1 to clusters in experiment 2.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wheaton5_souporcell_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABYZW4TIGBMS7XRUIYM6WA3QWERKZA5CNFSM4JTAET7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFO7XEY-23issuecomment-2D559807379&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=lJwpPAI7IT90iSLGUiV64LBzqPZocKNXTN7oipRRD_g&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABYZW4VFLL7XU4OPWV3BNMDQWERKZANCNFSM4JTAET7A&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=G1NLkyJ4rSxUS6PTpTcyD36hl9uYdVzvnibuMHe5AFw&e=.

wheaton5 commented 4 years ago

I have tested this for the demuxlet paper datasets and my synthetic mixtures.

achamess commented 4 years ago

Btw. I used the 1000 genomes variants and 50 restarts and the clustering looked very good. From nuclei.

Sent from my iPhone

On Nov 29, 2019, at 8:29 AM, Haynes Heaton notifications@github.com wrote:



I now have a script called shared_samples.py so if you do singularity exec souporcell.sif shared_samples.py -h the usage is

usage: shared_samples.py [-h] -1 EXPERIMENT1 -2 EXPERIMENT2 -n SHARED

determine which clusters are the shared clusters between two souporcell runs with shared samples

optional arguments: -h, --help show this help message and exit -1 EXPERIMENT1, --experiment1 EXPERIMENT1 souporcell directory for experiment 1 -2 EXPERIMENT2, --experiment2 EXPERIMENT2 souporcell directory for experiment 2 -n SHARED, --shared SHARED how many samples should these experiments share?

And the output should be a mapping of clusters in experiment 1 to clusters in experiment 2.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wheaton5_souporcell_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABYZW4TIGBMS7XRUIYM6WA3QWERKZA5CNFSM4JTAET7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFO7XEY-23issuecomment-2D559807379&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=lJwpPAI7IT90iSLGUiV64LBzqPZocKNXTN7oipRRD_g&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABYZW4VFLL7XU4OPWV3BNMDQWERKZANCNFSM4JTAET7A&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=AfEX4i4bODm7ANOMbv4heDwg3nNMJA47rxycAbPG9zE&s=G1NLkyJ4rSxUS6PTpTcyD36hl9uYdVzvnibuMHe5AFw&e=.

achamess commented 4 years ago

I tried to run this, but ran into an error. I think it's looking for a vcf file that doesn't exist.

Here are the outputs in my directories, using the current version of souporcell:

alt.mtx                clusters.tsv                     fastqs.done     souporcell_minimap_tagged_sorted.bam
ambient_rna.txt        common_variants_covered_tmp.vcf  minimap.err     souporcell_minimap_tagged_sorted.bam.bai
cluster_genotypes.vcf  common_variants_covered.vcf      ref.mtx         troublet.done
clustering.done        consensus.done                   remapping.done  variants.done
clusters.err           depth.bed                        retag.err       vartrix.done
clusters_tmp.tsv       depth_merged.bed                 retagging.done

But the script is looking for souporcell_merged_sorted_vcf.vcf.gz

wheaton5 commented 4 years ago

Right, I guess the vcf names are different when run with --common_variants vs run without common variants. I will need to update this. But right now I am on vacation and don't even have a computer on which I can build the singularity container. I think you should be able to get around this by using the cluster_genotypes.vcf files instead. So if you renamed both of the cluster_genotypes.vcf files to souporcell_merged_sorted_vcf.vcf and then gzip them, maaaaybe it will work?

achamess commented 4 years ago

Thanks. No worries. Yeah that was my thought. I’ll rename and see what happens. Thanks for your quick replies. Enjoy your vacation.

From: Haynes Heaton notifications@github.com Reply-To: wheaton5/souporcell reply@reply.github.com Date: Friday, November 29, 2019 at 10:30 AM To: wheaton5/souporcell souporcell@noreply.github.com Cc: Alex Chamessian alexander.chamessian@duke.edu, Author author@noreply.github.com Subject: Re: [wheaton5/souporcell] Consistent assignment between samples that have the same individuals (#23)

Right, I guess the vcf names are different when run with --common_variants vs run without common variants. I will need to update this. But right now I am on vacation and don't even have a computer on which I can build the singularity container. I think you should be able to get around this by using the cluster_genotypes.vcf files instead. So if you renamed both of the cluster_genotypes.vcf files to souporcell_merged_sorted_vcf.vcf and then gzip them, maaaaybe it will work?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wheaton5_souporcell_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABYZW4XOPDFS3ECK2DBSO4DQWE7RRA5CNFSM4JTAET7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFPHNWI-23issuecomment-2D559838937&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=BxydkQ9O9srl9Gen_Y72TaKywFAXhHOIlOf-5d_badI&s=bJkBdanrWmHnWHb3gVMx3eD4UMmIkGR9Dv_gofIz3lM&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABYZW4RPBF5HG6NM2CF56U3QWE7RRANCNFSM4JTAET7A&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=j_eoU9TVYa9LTnnoTWWpgQ&m=BxydkQ9O9srl9Gen_Y72TaKywFAXhHOIlOf-5d_badI&s=-NuYpr22nvwW95mi6bwZzxNXpYoCep7CXNNfGHd6V_c&e=.

achamess commented 4 years ago

Just leaving this here for the future. No rush. I changed file names. Ran the script. Got this:

Traceback (most recent call last):
  File "/opt/souporcell/shared_samples.py", line 63, in <module>
    (chr2, pos2, ref2, alt2) = vcf2.readline().strip().split()[0:4]
ValueError: not enough values to unpack (expected 4, got 0)
wheaton5 commented 4 years ago

Don't worry about bothering me, I am working today anyway. Sounds like that file is either empty or only header. Check that (the one in experiment2).

achamess commented 4 years ago

OK. It's not empty.

##fileformat=VCFv4.3
##fileDate=05122018_15h52m43s
##source=IGSRpipeline
##reference=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##contig=<ID=chr2>
##contig=<ID=chr3>
##contig=<ID=chr4>
##contig=<ID=chr5>
##contig=<ID=chr6>
##contig=<ID=chr7>
##contig=<ID=chr8>
##contig=<ID=chr9>
##contig=<ID=chr10>
##contig=<ID=chr11>
##contig=<ID=chr12>
##contig=<ID=chr13>
##contig=<ID=chr14>
##contig=<ID=chr15>
##contig=<ID=chr16>
##contig=<ID=chr17>
##contig=<ID=chr18>
##contig=<ID=chr19>
##contig=<ID=chr20>
##contig=<ID=chr21>
##contig=<ID=chr22>
##contig=<ID=chrX>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  0       1       2       3
1       1389777 .       G       A       .       PASS    AF=0.02;AC=86;NS=2548;AN=5096;EAS_AF=0.0;EUR_AF=0.05;AFR_AF=0.0;AMR_AF=0.02;SAS_AF=0.01;VT=SNP;DP=20597 GT:AO:RO:T:E:GO:GN      ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1 1/1:2:0:-1:-4:-9,-1,-2:-8,0,-1  1/1:1:0:-1:-4:-5,-1,-1:-4,0,-1  ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1
1       1389827 .       C       T       .       PASS    AF=0.32;AC=1624;NS=2548;AN=5096;EAS_AF=0.02;EUR_AF=0.81;AFR_AF=0.08;AMR_AF=0.45;SAS_AF=0.34;VT=SNP;DP=20733     GT:AO:RO:T:E:GO:GN      ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1 1/1:2:0:-1:-4:-9,-1,-2:-8,0,-1  1/1:1:0:-1:-4:-5,-1,-1:-4,0,-1  ./.:0:0:-1:-4:-1,-1,-1:-1,-1,-1
1       1721922 .       C       T       .       PASS    AF=0.75;AC=3815;NS=2548;AN=5096;EAS_AF=0.78;EUR_AF=0.92;AFR_AF=0.5;AMR_AF=0.88;SAS_AF=0.78;VT=SNP;DP=23876      GT:AO:RO:T:E:GO:GN      ./.:0:0:-2:-5:-1,-1,-1:-1,-1,-1 1/1:1:0:-2:-5:-5,-1,-1:-4,0,-1  1/1:5:0:-2:-5:-21,-1,-4:-20,0,-3        1/1:4:0:-2:-5:-17,-1,-3:-16,0,-2
1       1726053 .       G       C       .       PASS    AF=0.78;AC=3962;NS=2548;AN=5096;EAS_AF=0.73;EUR_AF=0.91;AFR_AF=0.67;AMR_AF=0.89;SAS_AF=0.76;VT=SNP;DP=12939     GT:AO:RO:T:E:GO:GN      ./.:0:0:-1:-3:-1,-1,-1:-1,-1,-1 ./.:0:0:-1:-3:-1,-1,-1:-1,-1,-1 1/1:3:0:-1:-3:-13,-1,-3:-12,0,-2        ./.:0:0:-1:-3:-1,-1,-1:-1,-1,-1
1       1884183 .       G       A       .       PASS    AF=0.2;AC=1014;NS=2548;AN=5096;EAS_AF=0.16;EUR_AF=0.23;AFR_AF=0.26;AMR_AF=0.19;SAS_AF=0.12;VT=SNP;DP=17234      GT:AO:RO:T:E:GO:GN      ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  1/1:2:0:0:-3:-9,-1,-2:-8,0,-1   ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1       2411755 .       T       C       .       PASS    AF=0.88;AC=4490;NS=2548;AN=5096;EAS_AF=0.89;EUR_AF=0.94;AFR_AF=0.87;AMR_AF=0.88;SAS_AF=0.83;VT=SNP;DP=20078     GT:AO:RO:T:E:GO:GN      1/1:1:0:0:-3:-5,-1,-1:-4,0,-1   ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1       3843565 .       C       T       .       PASS    AF=0.79;AC=4025;NS=2548;AN=5096;EAS_AF=0.48;EUR_AF=0.94;AFR_AF=0.92;AMR_AF=0.87;SAS_AF=0.72;VT=SNP;DP=15702     GT:AO:RO:T:E:GO:GN      1/1:1:0:0:-3:-5,-1,-1:-4,0,-1   ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1       3843954 .       A       G       .       PASS    AF=0.8;AC=4096;NS=2548;AN=5096;EAS_AF=0.48;EUR_AF=0.94;AFR_AF=0.97;AMR_AF=0.88;SAS_AF=0.72;VT=SNP;DP=16972      GT:AO:RO:T:E:GO:GN      ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  1/1:1:0:0:-3:-5,-1,-1:-4,0,-1   ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1  ./.:0:0:0:-3:-1,-1,-1:-1,-1,-1
1       3867056 .       T       C       .       PASS    AF=0.53;AC=2700;NS=2548;AN=5096;EAS_AF=0.43;EUR_AF=0.83;AFR_AF=0.29;AMR_AF=0.63;SAS_AF=0.58;VT=SNP;DP=19142     GT:AO:RO:T:E:GO:GN      1/1:1:0:-2:-5:-5,-1,-1:-4,0,-1  1/1:7:0:-2:-5:-29,-1,-5:-28,0,-4        ./.:0:0:-2:-5:-1,-1,-1:-1,-1,-1 1/1:1:0:-2:-5:-5,-1,-1:-4,0,-1
wheaton5 commented 4 years ago

Looking at the code again I see the problem. I pushed a fix but can't make a singularity container on this computer (no admin). If you clone the repo or just download that file, you can run it directly. All it requires is python3 with gzip and argparse modules.

achamess commented 4 years ago

I tried using the script. It took more than 24 hours, so I stopped it, thinking something was wrong. should it take that long? Are you able to make a new singularity build with the updated code?

wheaton5 commented 4 years ago

Must be an infinite loop bug. I'll take a look.

wheaton5 commented 4 years ago

I think I have fixed this, but I don't have test data which has this bug, so here is hoping. The singularity build is now updated. Thanks for your patience and great feedback.

achamess commented 4 years ago

Thanks Haynes. I'll try it out and let you know how it goes.

achamess commented 4 years ago

I ran it. Took like 5 mins. But here is the output. Not sure what to make of that:

locus1 matchset 80242
cell clusters 5492
cell clusters 5154
clusters for experiment1 4
clusters for experiment2 4
experiment1_cluster     experiment2_cluster     loss
1       1       10.67231087908009
1       0       12.806791507134088
2       1       14.042522528560424
3       1       15.769906644087973
wheaton5 commented 4 years ago

Yeah this is clearly wrong. Sorry. I think I have an off by one error and as a result am not using the informative loci. Those loss values are way too low. I tested this on data with a lot of UMIs and while it worked, the loss values were way lower than I expected.

achamess commented 4 years ago

OK. Thanks for checking. Mine has a lot of UMIs too, for what it's worth. LIke 10K UMI/cell.

wheaton5 commented 4 years ago

Ok, I did find two bug. But actually those loss values might not be too low. The main bug was that I was looking at matched loci, but not the right matched loci (the ones with lots of data). Then the next thing was that I was requiring each experiment to have a certain number of observations, but did not require every cluster to have a certain number of observations. Now I am requiring every cluster to have 8 observations for a loci for the loci to be counted for this. This might fall over when some experiments have a cluster with a very small cluster though.

So explaining the output a bit. Here I had it output 2*shared_samples to show how the loss function for the first (true) shared samples should be muuuuch lower than other cluster pairings.

experiment1_cluster experiment2_cluster loss 2 3 13.555280719863028 4 0 17.26053627757567 0 4 18.42185017489233 3 2 19.511867466770433 1 1 19.85788870668313 0 2 1549.1748800588916 3 4 1575.188470350116 2 2 1578.961712371887 1 2 1582.7255752644771 4 2 1584.9752724054401

So this is just testing two different synthetic mixtures of 5 hipsci cell lines. So cluster 2 in experiment 1 is cluster 3 in experiment 2 and so on for the first 5 lines with low loss values. Then the loss values for incorrect pairings are > 1000.

But now I also realize that because I require just the two clusters in question to have > 100 observations for a locus to count it, not all cluster pairs are being judged by the same set of loci. This will especially be bad for situations where the number of cells per cluster is very skewed. So I need to first come up with a global set of loci which have decent counts in all clusters before computing this loss.

Another thing on this is that for the final match set, I just take the lowest loss value pairs. This was partially as a sanity check because if these are true matches, then matching should be one to one, so if cluster 0 in experiment 1 matches cluster 4 in experiment 2, it should not also match cluster 2 in experiment 2. I let this happen because if it does, that is a sign something is either wrong with the code or the experimental design (they don't actually share as many donors as you thought). But I don't throw an error if this happens, perhaps I should.

Those fixes are up and im making the singularity build now.

wheaton5 commented 4 years ago

Singularity build updated. Let me know how it goes. Even with those bugs I find your results puzzling.

achamess commented 4 years ago

Thanks Haynes. Here is the newest output:

locus1 matchset 286171
cell clusters 5492
cell clusters 5154
clusters for experiment1 4
clusters for experiment2 4
experiment1_cluster experiment2_cluster loss
0   3   3198.3017169046584
3   3   3325.3283535633473
2   3   3430.191326119036
0   0   3487.4457781562696
0   1   3495.7191406556963
0   2   3503.580329775917
1   3   3507.320727018593
3   1   3587.6705098776874

How should I interpret that?

wheaton5 commented 4 years ago

Something is clearly wrong. There isn't a step change between the first 4 and the rest. Are you sure these experiments contain the same individuals? I have done this with replicates, and while the numbers aren't as stark as the example I gave you, the loss values for true pairs were at least 1/10th that of invalid pairs. Maybe this is something wrong with it's use on common_variant option? That doesn't make sense, but I haven't tested it on that. I will do that, but if I get the expected result, the only thing going forward would be to send me your output folders for testing. Sorry about this and thanks again for the patience.

achamess commented 4 years ago

Thanks for taking the time to look at this. I am sure they're same individuals. These are technical replicates of each other, literally same day, same 10x chip, same pooled input sample.

I'm happy to share the outputs with you if you're willing.

One other possible source of error is that I did what you described above, since the souporcell pipeline didn't make the right file outputs:

So if you renamed both of the cluster_genotypes.vcf files to souporcell_merged_sorted_vcf.vcf and then gzip them, maaaaybe it will work?

I took the cluster_genotypes.vcf and made a copy and made souporcell_merged_sorted_vcf.vcf.gz . Any reason why that would be different from the output the pipeline would make?

wheaton5 commented 4 years ago

I'll try this with --common_variants and get back to you. If I can't reproduce, I will get the relevant files from you.

wheaton5 commented 4 years ago

This worked fine for me. So what I need to test yours is, for each experiment, the common_variants_covered.vcf, alt.mtx, and ref.mtx files.

wheaton5 commented 4 years ago

edit: it also requires the clusters.tsv files.

If zipped/gzipped I think these will be small enough to email (whheaton@gmail.com) or if not, you could drop them in a google drive and share with me.

wheaton5 commented 4 years ago

Thanks for sending me the files.

So the common_variants_covered.vcf and the souporcell_merged_sorted_vcf.vcf.gz are not at all the same. exp1: zcat souporcell_merged_sorted_vcf.vcf.gz | wc -l 704360 exp1: wc -l common_variants_covered.vcf 1874832 common_variants_covered.vcf

and that looks similar for experiment 2.

So I have since updated shared_samples.py to work with --common_variants options just by detecting which file exists. So I just deleted the souporcell_merged_sorted_vcf.vcf.gz files and it worked perfectly

experiment1_cluster experiment2_cluster loss 2 2 223.44688037585996 0 1 255.8817310295661 1 0 327.36936809969245 3 3 581.9606487497515 3 2 4981.107508541682 2 3 5093.439512586718 3 1 5232.686144786503 1 1 5305.419008070588

The values are generally higher in this case than mine because all clusters have enough observations in many loci. So this is quite good data. V3 or next gem chemistry I assume. The increased library complexity has really made my life easier. So the first 4 lines are the mapping from exp1 cluster to exp2 cluster.

mary77 commented 2 years ago

Hi @wheaton5 Thanks for the shared_samples.py script. Its really helpful.

I tried to use it on my data, two wells each with 4 different samples with 2 shared samples, but I ran into the same issue here. I read the discussion but couldn't figure it out. Really appreciate if you can take a look. Here is my output:

experiment1_cluster experiment2_cluster loss 0 1 1.6833647513942072 1 0 8.76562974638987 1 3 8.847310621688328 1 2 10.870776921689766

as you can see loss is really low

Sayyam-Shah commented 1 year ago

Hello Haynes,

Thank you for the amazing tool!

Is there any way to run the shared_samples script between 3 experiments?

wheaton5 commented 1 year ago

Hi Sayyam,

No, not currently. Seems like a clustering of clusterings... Interesting problem to think about. But currently I'd say run shared_samples each way pairwise and hopefully they are consistent. If not, there is probably something wrong with the clustering or data, so this sounds like a nice gut check. You could also combine data across experiments and run souporcell like that (would need to assign experiment identifier in cell barcode, so add a -1/2/3 to end of CB tag for each experiment. Or you could also essentially make each cluster in each experiment its own cell and reconstruct what the alt.mtx and ref.mtx would look like then run souporcell (not souporcell_pipeline) on that. But that is some fairly complex scripting that I doubt you want to do.

If this is blocking for you, I am looking for new collaborations and we could discuss further.

Best, Haynes

Sayyam-Shah commented 1 year ago

Hello Haynes,

Thank you so much for the suggestions. I'll run a pairwise approach!

plijnzaad commented 1 year ago

Just a note: for situations were we have several libraries with the same mixtures of genotypes we doctor the CB-tags in the BAM file to disambiguate the cellbarcodes (simply prefixing them with the library name), then merge the bamfiles and also the similarly disambiguated barcodes.tsv files, then run souporcell on that merged bam file. The advantage is that you get a better genotype estimate and you use the same ones for both libraries.

achamess commented 1 year ago

@plijnzaad I'm in exactly this situation. Would you be willing to share your scripts for this? I've never amended BAM files or the barcodes before.

@wheaton5 Hi again. I'm trying some 10x multiome data on 5 libraries, all having the same 3 individuals. I ran souporcell on all of them, then tried a pairwise shared assignment. This is what I got:

experiment1_cluster experiment2_cluster loss 0 2 1260.8168001928484 2 1 1350.7020133927413 2 0 1936.0541779152948 1 1 2767.580876688788 0 1 4151.516522205379 2 2 8764.241892732287

Looks not right to me unless I'm reading that wrong. cluster 2 from exp 1 is assigned twice.

Update I ran another pairwise set, using the same experiment 1, but a different experiment 2, and now I think I get a clear and accurate answer

experiment1_cluster experiment2_cluster loss 0 1 1368.5622520419965 2 0 1520.4600025153557 1 2 1547.186411870913 2 2 5989.91234763886 2 1 8805.519739288886 1 0 9656.343401871285

Weird no?

robingarcia commented 1 year ago

@wheaton5 : I have a question about the output. It seems 4 individuals (--shared 4) are involved. Which rows should be considered? The rows with "high" oder with "low" loss?

Does one want a high or a low loss?

experiment1_cluster experiment2_cluster loss 2 2 223.44688037585996 0 1 255.8817310295661 1 0 327.36936809969245 3 3 581.9606487497515 3 2 4981.107508541682 2 3 5093.439512586718 3 1 5232.686144786503 1 1 5305.419008070588

I also had two experiments with 4 individuals but I used "--shared 2" because I also got 8 lines instead of 4. Therefore I was confused a little bit.

With --shared=2 (but data contains 4 individuals) I got 4 rows.

This are my data:

locus1 matchset 83638 cell clusters 4118 cell clusters 4583 clusters for experiment1 4 clusters for experiment2 4 experiment1_cluster experiment2_cluster loss 1 1 147.73398915292847 2 2 161.96103231157463 0 3 163.45816811325176 3 0 196.79821230361432

With --shared=4 I got this...

locus1 matchset 83638 cell clusters 4118 cell clusters 4583 clusters for experiment1 4 clusters for experiment2 4 experiment1_cluster experiment2_cluster loss 1 1 147.73398915292847 2 2 161.96103231157463 0 3 163.45816811325176 3 0 196.79821230361432 0 2 2638.1945883597723 2 3 2640.9090802577593 3 1 3387.936289148157 1 0 3394.9481022939603

wheaton5 commented 1 year ago

Low loss is the ideal. I wanted to show the distinction between low and high loss to show it was working as intended which is why i give extra lines. Your shared 4 output is ideal. The first 4 lines have relatively low loss followed by much higher loss values. These represent the correct cluster matches.

Angel-Wei commented 5 months ago

Hi Dr. Heaton, sorry to bother again on the usage of shared_samples.py. So I intended to apply this on the cluster assignment of an experiment that failed completing the souporcell pipeline when including --known_genotypes and --known_genotypes_sample_names . I have Experiment1 that has 3 individuals of known genotypes and I could complete the souporcell pipeline with known genotype information, which gives me the following output:

alt.mtx              fastqs.done
ambient_rna.txt          minimap.err
cluster_genotypes.vcf        input_k_known_genotypes.vcf.gz
clustering.done          ref.mtx
clusters.err             remapping.done
clusters_tmp.tsv         retag.err
clusters.tsv             retagging.done
common_variants_covered_tmp.vcf  souporcell_minimap_tagged_sorted.bam
common_variants_covered.vcf  souporcell_minimap_tagged_sorted.bam.bai
consensus.done           troublet.done
depth_merged.bed         variants.done
vartrix.done

The Experiment2 has 3 individuals of known genotypes (2 of them shared with Experiment1) and the pipeline could be completed without including --known_genotypes and --known_genotypes_sample_names. Also, I didn't enable --common_variants. The output is as follows:

alt.mtx                       minimap.err
ambient_rna.txt                   ref.mtx
bcftools.err                      remapping.done
cluster_genotypes.vcf                 retag.err
clustering.done                   retagging.done
clusters.err                      souporcell_merged_sorted_vcf.vcf.gz
clusters_tmp.tsv                  souporcell_merged_sorted_vcf.vcf.gz.tbi
clusters.tsv                      souporcell_minimap_tagged_sorted.bam
consensus.done                    souporcell_minimap_tagged_sorted.bam.bai
doublets.err                      troublet.done
fastqs.done                   variants.done
vartrix.done

shared_samples.py requires either souporcell_merged_sorted_vcf.vcf.gz or common_variants_covered.vcf present in both' experiments folders, which might not line up with the names of output VCFs in these 2 folders. I read this post and initially, I only renamed Experiment1's cluster_genotypes.vcf and bgzipped into souporcell_merged_sorted_vcf.vcf.gz, and the results doesn't seem right:

locus1 matchset 16182
cell clusters 11087
cell clusters 13475
clusters for experiment1 3
clusters for experiment2 3
experiment1_cluster experiment2_cluster loss
2   2   93.87543847925971
2   1   96.71386482576936
0   2   97.66517526423038
1   2   100.40247252105199

To correct this, I noticed you mentioned that this script was updated to work with --common_variants, which I did not include when running Experiment2.

So I have since updated shared_samples.py to work with --common_variants options just by detecting which file exists. So I just deleted the souporcell_merged_sorted_vcf.vcf.gz files and it worked perfectly

Just want to double check that re-running the pipeline for Experiment2 (perhaps delete .DONE files after Step2: calling candidate variants) is needed with this option right? If this is the case, would the corresponding VCF file that I previously used for --known_genotypes (attempt failed and do not know why) be used for this --common_variants instead? I see other people mentioned this for running Souporcell (https://github.com/wheaton5/souporcell/issues/126#issuecomment-1296383744). Thank you so much if you might have a minute looking into this!