wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
153 stars 44 forks source link

Clustering issues #181

Open Ben-Cossins opened 1 year ago

Ben-Cossins commented 1 year ago

I have several done files (in alphabetical order: fastqs, remapping, retagging, variants, and vartrix) so think I'm stuck at the clustering stage (also as clusters_tmp.tsv is 0 bytes)

the web summary for the sample from cellranger is
Estimated Number of Cells   22,352
Fraction Reads in Cells 92.7%
Mean Reads per Cell 34,049
Median UMI Counts per Cell  6,316
Median Genes per Cell   2,242
Total Genes Detected    32,236

pipeline stalls with this error message

merging sorted bams
cleaning up tmp samfiles
using known genotypes
16
***** WARNING: File output/p1/depth_merged.bed has inconsistent naming convention for record:
KI270728.1  135669  135670

***** WARNING: File output/p1/depth_merged.bed has inconsistent naming convention for record:
KI270728.1  135669  135670

running vartrix
running souporcell clustering
/opt/souporcell/souporcell/target/release/souporcell -k 8 -a output/p1/alt.mtx -r output/p1/ref.mtx --restarts 100 -b data/p1/raw_barcodes.tsv --min_ref 10 --min_alt 10 --threads 16 --known_genotypes output/p1/common_variants_covered.vcf --known_genotypes_sample_names c301 c207 c101 c272 c119 c143 c028 c218
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
    souporcell(args, ref_mtx, alt_mtx, final_vcf)
  File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
    subprocess.check_call(cmd, stdout = log, stderr = err) 
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/souporcell/target/release/souporcell', '-k', '8', '-a', 'output/p1/alt.mtx', '-r', 'output/p1/ref.mtx', '--restarts', '100', '-b', 'data/p1/raw_barcodes.tsv', '--min_ref', '10', '--min_alt', '10', '--threads', '16', '--known_genotypes', 'output/p1/common_variants_covered.vcf', '--known_genotypes_sample_names', 'c301', 'c207', 'c101', 'c272', 'c119', 'c143', 'c028', 'c218']' returned non-zero exit status 101.

output using head from alt.mtx

%%MatrixMarket matrix coordinate real general
% written by sprs
578434 2127322 46062470
1 61181 0
1 70083 0
1 120954 1
1 154000 0
1 185250 0
1 195664 1
1 258177 1

and from ref.mtx

%%MatrixMarket matrix coordinate real general
% written by sprs
578434 2127322 46062470
1 61181 1
1 70083 1
1 120954 0
1 154000 1
1 185250 1
1 195664 0
1 258177 0

as I said, clusters_tmp.tsv is empty, and clusters.err contains this

total loci used 157371
thread '<unnamed>' panicked at 'thread 'no entry found for key<unnamed>', ' panicked at 'src/main.rsno entry found for key:', 325src/main.rs::30325
:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'no entry found for keyno entry found for keyno entry found for key', ', ', src/main.rssrc/main.rssrc/main.rs:::325325325:::303030

thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'no entry found for keyno entry found for key', ', src/main.rssrc/main.rs::325325::3030

thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30

I'd be most appreciative for some help with this Thanks Ben

wheaton5 commented 1 year ago

Are you giving it the unfiltered barcodes.tsv or the cell containing barcodes file? Seems like unfiltered. That is one problem but might not be the cause of this error.

The line it fails on gets the genotype (GT) from the sample name provided by known genotype samples. So maybe either sample name is wrong or some entries dont have a GT?

first try making sure you are using filtere barcodes file.

Ben-Cossins commented 1 year ago

I was using the raw barcodes, I will rerun with the filtered barcodes instead.

should I filter the vcf to remove incomplete records? (also this vcf is what I received following imputation, should imputed genotypes also be removed, or is it able to process the ./. nomenclature?)

thanks Ben

wheaton5 commented 1 year ago

I think ./. Works, but im not 100% sure

Ben-Cossins commented 1 year ago

reran with the filtered barcodes and hit the same error.

total loci used 147980
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread 'note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>thread '' panicked at 'thread '<unnamed>no entry found for keythread 'thread '<unnamed>' panicked at '', <unnamed>thread '<unnamed>thread '' panicked at 'no entry found for keysrc/main.rs' panicked at '<unnamed>' panicked at 'thread '<unnamed>no entry found for key', :no entry found for key' panicked at 'no entry found for key<unnamed>' panicked at 'thread '', src/main.rs325', no entry found for key', ' panicked at 'no entry found for key<unnamed>src/main.rs::src/main.rs', src/main.rsno entry found for key', ' panicked at ':32530:src/main.rs:', src/main.rsno entry found for key325:
325:325src/main.rs:', :30:325::325src/main.rs30
30:30325::

30
:30325
30
:
30
thread 'thread '<unnamed>thread '<unnamed>' panicked at '<unnamed>' panicked at 'no entry found for key' panicked at 'no entry found for key', no entry found for key', src/main.rs', src/main.rs:src/main.rs:325:325:325:30:30
30

I dont know if it matters but here is the command I ran for it, if I'm missing something. singularity exec --bind /media/ben/Bulk_storage/Souporcell/ bin/souporcell_latest.sif souporcell_pipeline.py -i data/p1/possorted_genome_bam.bam -b data/p1/filtered_barcodes.tsv -f resources/refdata-gex-GRCh38-2020-A/fasta/genome.fa -t 16 -o output/p1 -k 8 --known_genotypes data/genotypes/merge_filtered.reheadered_with_chr.vcf --known_genotypes_sample_names c301 c207 c101 c272 c119 c143 c028 c218

wheaton5 commented 1 year ago

Yeah so now the issue is the known genotypes vcf. Either a sample name is wrong or some entries have no GT field. Probably I should check this, but currently it is assumed to exist. So try to determine if there are such fields and removing them if so.

Ben-Cossins commented 1 year ago

So I processed the VCF to only use genotyped snps only, and require 100% completeness of the data, so all the samples here have the GT field, unfortunately this now results in the pipeline failing before the common variants are done.

I ran it on a different run (same pooled samples, experimentally these were the stimulated cells, just was a smaller file to run)

sorting retagged bam files
merging sorted bams
cleaning up tmp samfiles
using known genotypes
16
***** WARNING: File output/p2/depth_merged.bed has inconsistent naming convention for record:
KI270728.1  135596  135649

terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_replace
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 584, in <module>
    final_vcf = freebayes(args, bam, fasta)
  File "/opt/souporcell/souporcell_pipeline.py", line 396, in freebayes
    subprocess.check_call(["bedtools", "intersect", "-wa", "-a", args.common_variants, "-b", args.out_dir + "/depth_merged.bed"], stdout = vcf)
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bedtools', 'intersect', '-wa', '-a', 'data/genotypes/merge_filtered.reheadered_with_chr.genotyped_only_complete_cases.vcf', '-b', 'output/p2/depth_merged.bed']' died with <Signals.SIGABRT: 6>.
turns out that I was an idiot here, and the vcf file wasn't valid, however rerunning it with a valid VCF file (with complete genotypes) has it stall at the clustering stage.

I'm now in the process of running it with the common genotypes format

Ben-Cossins commented 1 year ago

running the pipeline with the common variants (used the 1000genomes variants from the test dataset (modified to be in NCBI chromosome numbering)) was successful. so it seems to be an issue with the the known genotypes flag.

running vartrix
running souporcell clustering
/opt/souporcell/souporcell/target/release/souporcell -k 8 -a output/p2/alt.mtx -r output/p2/ref.mtx --restarts 100 -b data/p2/filtered_barcodes.tsv --min_ref 10 --min_alt 10 --threads 16
running souporcell doublet detection
running co inference of ambient RNA and cluster genotypes
186327 excluded for potential RNA editing
3040 doublets excluded from genotype and ambient RNA estimation
10313 not used for soup calculation due to possible RNA edit
Initial log joint probability = -1.41911e+06
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
Exception: log_mix: lambda1 is -nan, but must not be nan!  (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan!  (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan!  (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan!  (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan!  (in 'unknown file name' at line 73)

       7       -576701    0.00196475      0.607341           1           1       26   
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance
done
Angel-Wei commented 5 months ago

Thank you @Ben-Cossins and @wheaton5 for a comprehensive discussion on this clustering problem. I wonder if Ben @Ben-Cossins has any trouble shooting updates following up this issue? The reason I ask is that I encountered the same situation in my demultiplexing with known genotypes. The issue I have is pretty similar to what you described after I went through most of the posts for clustering failure. I have a few multiplexed samples (each with 3 or 4 individuals; the # of individuals and genotypes for each individual are all known) sequenced by NovaSeq X scRNA-seq. I have some of the samples went through the souporcell pipeline successfully and some others failed and stalled at the clustering step. I'd really appreciate it if @wheaton5 can also give me some suggestions to solve this issue.

The slurm batch job's error message is as follows:

***** WARNING: File sample4/depth_merged.bed has inconsistent naming convention for record:
KI270728.1  181297  181396

***** WARNING: File sample4/depth_merged.bed has inconsistent naming convention for record:
KI270728.1  181297  181396

Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
    souporcell(args, ref_mtx, alt_mtx, final_vcf)
  File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
    subprocess.check_call(cmd, stdout = log, stderr = err) 
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/souporcell/target/release/souporcell', '-k', '4', '-a', 'sample4/alt.mtx', '-r', 'sample4/ref.mtx', '--restarts', '100', '-b', '/data/CellRangerOuts/sample4/outs/filtered_feature_bc_matrix/barcodes.tsv', '--min_ref', '10', '--min_alt', '10', '--threads', '8', '--known_genotypes', 'sample4/common_variants_covered.vcf', '--known_genotypes_sample_names', 'genotype1', 'genotype2', 'genotype3', 'genotype4']' returned non-zero exit status 101.

I checked that the filtered barcode tsv file was used and donor names and number of donors are correct with right correspondance in my job submission scripts.

The clusters_tmp.tsv is empty and clusters.err is:

total loci used 36592
thread 'thread 'thread '<unnamed>thread '<unnamed><unnamed>' panicked at '<unnamed>' panicked at '' panicked at 'thread 'no entry found for key' panicked at 'no entry found for keyno entry found for key<unnamed>', no entry found for key', ', ' panicked at 'src/main.rs', src/main.rssrc/main.rsno entry found for key:src/main.rs::', 325:325325src/main.rs:325:::30:3030325
30

:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30

The head 10 lines of alt.mtx is:

%%MatrixMarket matrix coordinate real general
% written by sprs
312917 21476 9605316
1 5646 0
1 11283 0
2 1744 0
2 2430 0
2 3217 0
2 5931 0
2 8405 0

The head 10 lines of ref.mtx is:

%%MatrixMarket matrix coordinate real general
% written by sprs
312917 21476 9605316
1 5646 1
1 11283 1
2 1744 1
2 2430 1
2 3217 1
2 5931 1
2 8405 1

We have another multiplexed sample replicate of the same 4 genotyped individuals as this one, and the pipeline was completed for that one. The generation steps for respective VCF files that are used for --known_genotypes flag include: subset, sort, normalize and make biallelic, removal of non-passing and monomorphic variants. I was wondering if this issue perhaps relates to the reference VCF file, because the pipeline could be finished if I resume where it stopped by deleting cluster checkpoint files and re-running singularity exec souporcell.sif souporcell_pipeline.py without specifying reference VCF file and the list of genotype names. But I also feel confused as I do have some other samples completed the souporcell pipeline using the same procedure. I tried to filter the VCF variants by at least 5% MAF using this command bcftools view -q 0.05:minor Passing_Normalized*.vcf and I still got the same error for this sample. I wonder if there is anything more we can try to check the validity of our VCF files? Hi @wheaton5 , in your previous answer

Yeah so now the issue is the known genotypes vcf. Either a sample name is wrong or some entries have no GT field. Probably I should check this, but currently it is assumed to exist. So try to determine if there are such fields and removing them if so.

Does this imply that any variant sites that have missing GT field (./. or 0/0) for all samples should be filtered out? I m really new to demultiplexing and I might not capture this correctly. Sorry for this long post and It would be so helpful to me if I can receive any suggestions on this! Thank you so much!

wheaton5 commented 5 months ago

Hi angel, thanks for your interest in my tool. First off, it seems that you are upset and i politely suggest that you maybe cool off just a bit. While I have applied for grants to support my work on this project, none have yet been funded. And while I am happy to continue its support, I am a new faculty member with many obligations and am not being paid for this.

I do think what you said is correct in that in its current state, variants in which all samples have ./. Or 0/0 should be filtered out. It is challenging when designing a tool to account for all possible data input potentialities. We often make assumptions that turn out to not be 100%. Apologies for that.

With 4 samples, using known genotypes should not be necessary. You can run normally and after the fact assign clusters to known genotypes using the shared_samples.py script available in the repository. If this does not work, I might be able to help debug your issue.

Best Regards

Angel-Wei commented 5 months ago

Hi Dr. Heaton, thank you so much for your quick response. Sorry that I was working quite late and trying to just provide the background info as much as possible in my inquiry and I didn't intend to disturb at a late hour. I wasn't being upset and was just a bit anxious after trouble shooting my errors for a while. Souporcell is a really cool tool and the tutorials were very user-friendly for beginners. Thank you so much for your suggestions and I'll keep working on processing the VCF file a bit and also try re-run the program without known genotype information and do cluster assignment later.

Best Regards