wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
160 stars 46 forks source link

troublet command error #212

Open zchatt opened 10 months ago

zchatt commented 10 months ago

Thank you for this great tool. I am trialing to determine whether we can pool cell-lines for 10X runs and deconvolute using variants. To test I have a xenograft snRNAseq run (10X v3) that is from a single-cell line which I would like to detect using the .vcf from WES of the cell-line. The human population of this xenograft is small and there are only 212 barcodes wihtin my filtered_feature_bc_matrix/barcodes.tsv

I have run (singularity) without specifying "--known_genotypes" or "--known_genotypes_sample_names" and the pipeline runs to completion.

However, when I specify "--known_genotypes" and "--known_genotypes_sample_names" (as below), the pipeline fails. The error message is from the troublets command but I believe something is happening at the vartix step.

There are 4203 variants within my common_variants_covered.vcf. My "--known_genotypes" vcf (WES) has 93219 variants, but the alt.mtx and ref.mtx look strange i.e. dont start from 1 and only have 89 lines.

Scratching my head how to troubleshoot. Any insights would be greatly appreciated.

`# Run script

soup_or_cell_sif="/home/zac/souporcell/souporcell_latest.sif" num_clusters=4 num_threads_to_use=24 output_dir_name="/data/zac/asap_snrnaseq/souporcell/run_HJ377DRX2_12w_RM35vcf4" reference_fasta="/data/zac/asap_snrnaseq/reference/hg38.fa" barcodes_tsv="/data/zac/asap_snrnaseq/snrnaseq/run_HJ377DRX2_12w/outs/filtered_feature_bc_matrix/barcodes.tsv" bam_file="/data/zac/asap_snrnaseq/snrnaseq/run_HJ377DRX2_12w/outs/possorted_genome_bam.bam" vcf="/data/zac/asap_snrnaseq/souporcell/merged_ppmi.feb.1.2015_RM35.vcf" export SINGULARITY_BIND="/data/zac/"

mkdir -p $output_dir_name cd $output_dir_name

singularity exec $soup_or_cell_sif souporcell_pipeline.py \ -i $bam_file \ -b $barcodes_tsv \ -f $reference_fasta \ -t $num_threads_to_use \ -o $output_dir_name \ -k $num_clusters \ --known_genotypes $vcf \ --known_genotypes_sample_names 7_21_RM35 PPMI_SI_3868`

`# Error Message

Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 596, in doublets(args, ref_mtx, alt_mtx, cluster_file) File "/opt/souporcell/souporcell_pipeline.py", line 541, in doublets subprocess.check_call([directory+"/troublet/target/release/troublet", "--alts", alt_mtx, "--refs", ref_mtx, "--clusters", cluster_file], stdout = dub, stderr = err) File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/opt/souporcell/troublet/target/release/troublet', '--alts', '/data/zac/asap_snrnaseq/souporcell/run_HJ377DRX2_12w_RM35vcf4/alt.mtx', '--refs', '/data/zac/asap_snrnaseq/souporcell/run_HJ377DRX2_12w_RM35vcf4/ref.mtx', '--clusters', '/data/zac/asap_snrnaseq/souporcell/run_HJ377DRX2_12w_RM35vcf4/clusters_tmp.tsv']' returned non-zero exit status 101.`

`head alt.mtx

%%MatrixMarket matrix coordinate real general % written by sprs 4203 212 86 316 5 0 316 10 0 316 17 0 316 24 0 316 33 0 316 36 0 316 43 0

head clusters_tmp.tsv

AAACCCAAGCTGGCCT-1 0 -0.6931472 -0.6931472 AACCAACTCTCTATGT-1 0 -0.6931472 -0.6931472 AACGGGAAGTGATCGG-1 0 -0.6931472 -0.6931472 AACGTCAAGCATTGTC-1 0 -0.6931472 -0.6931472 AAGCCATCAATACGCT-1 0 -0.6931472 -0.6931472 AATGAAGTCGTGGACC-1 0 -0.6931472 -0.6931472 AATTCCTGTTTGACAC-1 0 -0.6931472 -0.6931472 AATTTCCCACGTTGGC-1 0 -0.6931472 -0.6931472 AATTTCCCATTAAAGG-1 0 -0.6931472 -0.6931472 ACAACCACAGGCGATA-1 0 -0.6931472 -0.6931472

head clusters.tsv

barcode status assignment log_prob_singleton log_prob_doublet cluster0 cluster1`

wheaton5 commented 10 months ago

Make sure the known genotype vcf is on the same reference as the fasta given (like hg19 vs grch38) and that chrom naming convention is the same (chr1 vs 1). This is most likely. Its finding near 0 overlap in variants. Otherwise paste all .err files here.

zchatt commented 10 months ago

Thanks for quick reply! Yes they are both "chr", ucsc hg38 reference. I've LiftoverVcf (picard) from hg19.

Please see error files attached. Thanks for your help, greatly appreciated.

err_files.zip

wheaton5 commented 10 months ago

hmm, those arent useful. The problem is that there are essentially no variants detected. I give the known_genotypes vcf to vartrix and it looks in those regions.

the command is cmd = ["vartrix", "--mapq", "30", "-b", final_bam, "-c", barcodes, "--scoring-method", "coverage", "--threads", str(args.threads), "--ref-matrix", ref_mtx, "--out-matrix", alt_mtx, "-v", final_vcf, "--fasta", args.fasta]

where final_vcf is a depth filtered (>4 and <100,000) version of known_genotypes vcf. This file should be called common_variants_covered.vcf (same name using common variants or known genotypes). See if it is there and how many non-header lines there are (cat filename | grep -v "#" | wc -l)

wheaton5 commented 10 months ago

if that vcf has lots of variants in it, you can try to rerun the vartrix command yourself. Possibly there was a problem in the liftover? Idk

wheaton5 commented 10 months ago

yeah, because even if it only finds ref alleles, vartrix has output for every location in the input vcf which is the depth filtered vcf (using depth from the bam, not from the dp field or something like that in the vcf)

wheaton5 commented 10 months ago

for the known genotypes loci your bam must not have much coverage or it has way too much coverage

wheaton5 commented 10 months ago

how big is your bam after remapping? and before remapping (your input bam)? when using common variants or known genotypes, remapping is much less needed and you can use --skip_remap TRUE