Error in clustering step - panicked at 'no entry found for key'

RachellyN commented 2 years ago

Thanks for a great tool and for being on top of open issues - this helped with a lot of trouble shooting. I now get to an issue I cannot resolve. I'm running Souporcell with known genotypes: /usr/bin/singularity exec --bind /projects/Blood_Cell_Atals/Disease_Data/Pilot_experiments/NextSeq_HCV_buck_pilot/souporcell/souporcell_genetics/ /projects/singularity/souporcell_latest.sif souporcell_pipeline.py -i possorted_genome_bam.bam -b barcodes.tsv.gz --known_genotypes cellSNP.cells_3samp_filt.vcf -k 3 -f ./hg38_no_chr.fa -t 1000 -o .

It runs fine until the clustering step, I then get the following error:

checking modules
imports done
checking bam for expected tags
checking fasta
restarting pipeline in existing directory .
using known genotypes
1000
running vartrix
running souporcell clustering
/opt/souporcell/souporcell/target/release/souporcell -k 3 -a ./alt.mtx -r ./ref.mtx --restarts 100 -b barcodes.tsv.gz --min_ref 10 --min_alt 10 --threads 1000 --known_genotypes ./common_variants_covered.vcf --known_genotypes_sample_names HCV_05H HCV_06-91 HCV_11-14
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
    souporcell(args, ref_mtx, alt_mtx, final_vcf)
  File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
    subprocess.check_call(cmd, stdout = log, stderr = err)
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/souporcell/target/release/souporcell', '-k', '3', '-a', './alt.mtx', '-r', './ref.mtx', '--restarts', '100', '-b', 'barcodes.tsv.gz', '--min_ref', '10', '--min_alt', '10', '--threads', '1000', '--known_genotypes', './common_variants_covered.vcf', '--known_genotypes_sample_names', 'HCV_05H', 'HCV_06-91', 'HCV_11-14']' returned non-zero exit status 101.

alt.mtx and ref.mtx have ~18K lines. Clusters.err contain the following content (it is cut and the full file is attached below):

total loci used 622
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'no entry found for keyno entry found for key', ', src/main.rssrc/main.rs::325325::3030

thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'no entry found for keyno entry found for key', ', src/main.rssrc/main.rs::325325::3030

thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'no entry found for keyno entry found for key', ', src/main.rssrc/main.rs::325325::3030

I have noticed a similar error was reported before, and for some reason the fix was not merged. So I tried to fix this by removing all lines with missing GT from coomon_variants and the known_genotypes files:

grep -vP "\t\.\\:\.\\:\.\\:\.\\:\.\\:\.\t" old2/common_variants_covered.vcf > common_variants_covered.vcf
grep -vP "\t\.\\:\.\\:\.\\:\.\\:\.\\:\.\t" old2/cellSNP.cells_3samp.vcf > cellSNP.cells_3samp_filt.vcf

But I got the same error and clusters.err as before.

I don't know if this matters, but the data includes 4 samples, but the bulk sequencing didn't work for one sample, so I just manually removed it from the knoen_genotypes VCF files (all the SNPs were missing anyways" and I run souporcell with '-k 3'.

I hope you can help me figure this out! Thanks!

clusters.err.txt

micans commented 2 years ago

Hello, we are experiencing a similar error (see below). In our case the input possorted_genome_bam.bam file is 151G, could this be indicative of a memory problem? We are now rerunning with RUST_BACKTRACE=1 in case this is useful. Our code is (with suitably filled in variables, and where our Souporcell image is from 2 April 2020, 990707712 bytes).

<snip>singularity-v3.6.4/bin/singularity exec -B /lustre -B /nfs \
      $im RUST_BACKTRACE=1 souporcell_pipeline.py \
      -o $outdir              \
      -i $psb                 \
      -b $bcd                 \
      -f $fa                  \
      -k $thek                \
      -t $cpu                 \
      --skip_remap True     \
      --no_umi True        \
      --known_genotypes $kg

The reported error is

total loci used 13826
thread '<unnamed>' panicked at 'no entry found for key', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/collections/hash/map.rs:1025:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'no entry found for key', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/collections/hash/map.rs:1025:9
thread '<unnamed>' panicked at 'no entry found for key', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/collections/hash/map.rs:1025:9
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'no entry found for keyno entry found for key', ', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/collections/hash/map.rs/rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/collections/hash/map.rs::10251025::99

wheaton5 commented 2 years ago

Hopefully the full backtrace will help. Is this smartseq2 data?

micans commented 2 years ago

Hi Haynes, it is Chromium single cell 3 prime v3. I should have waited untill we had the backtrace really, but then this issue caught my eye. Will update!

micans commented 2 years ago

Hello,

I've attached the backtrace. It now offers the opportunity to run with RUST_BACKTRACE=full, let me know if that would help!

Thanks, Stijn backtrace.txt

micans commented 2 years ago

The VCF file has lots of fields like this in the genotype field(s):

.:.:.:.:.:.:.:.:.:.:.:.:.:.:.   .:.:.:.:.:.:.:.:.:.:.:.:.:.:.   .:.:.:.:.:.:.:.:.:.:.:.:.:.:.   .:.:.:.:.:.:.:.:.:.:.:.:.:.:.

So this does chime with @RachellyN observations. In our case we have six donors and many colon/period-filled fields. Hm, upon checking the format field, it's FAZ:FCZ:FGZ:FTZ:RAZ:RCZ:RGZ:RTZ:MTR:WTR:DEP:MDR:WDR:VAF:OFS. Does this mean the VCF is defunct?

wheaton5 commented 2 years ago

Sadly there is nothing in my code that this backtrace is pointing to. I'm sure it is my fault, but its difficult to figure out where to start if I don't have a line number. If you could send me each of the .err files maybe that would help. And just gut check things like how many variants come out of the freebayes step? How many counts are there in the alt.mtx and ref.mtx (just send me the first couple of lines for each). How many UMI/cell in this data set? Etc.

micans commented 2 years ago

Thanks Haynes for bearing with me. I completely get that this is not much to go on, I'll have a closer look and gather information and maybe also engage my brain a bit more. Will update again.

wheaton5 commented 2 years ago

Not at all “bearing with u”. U know i highly respect u. Im teaching graph algorithms next semester and markov clustering is on my syllabus. Also, your very humble but insightful description of the discovery process on your website was both inspirational to me and the best description ive ever heard of the creative algorithmic process. I love this quote “The time-span in which the MCL process and MCL algorithm came to me was no more than five minutes. It was something that happened to me, not something wrought by me. It feels like a nice discovery, a phenomenon of nature.” Perhaps to some this would seem hyperbolic, but I think it is spot on. And I wish we had more chance to collaborate when we were both at the sanger. Anyway, Im sure this is my fault and we can figure it out eventually.

micans commented 2 years ago

Dear Haynes, thanks for your kind words -- let me say the only reason I used 'bearing with me' is that I feel my issue report is really lacking information and effort on my part to drill down, and that I am saddling a fellow scientist, colleague and software developer with an inadequate report - part of it is due to divided attention. I remember when you left during the pandemic and made it across the pond .. hope you are keeping well. Souporcell is used frequently at cellular genetics, is is a much valued tool. We are running now with a different VCF, I'll keep an eye on it.

wxicu commented 2 years ago

Hallo， I have met the same issue when I ran souporcell on the tutorial dataset of demuxlet on github with known genotype, either skipped or not skipped the remap step. The dataset is small with only 500 droplets. So I think that is not an issue with memory. Is there any solution by far?

Thanks, Xichen

SimonDMurray commented 1 year ago

Hi @wxicu,

I worked with Micans on the issue when we experienced it in the summer and have since experienced it one more time. In the summer it failed because there was no Genotype field in the VCF (GT). Adding this field fixed the error for us.

The other time it failed, the issue was using a VCF with 3 genotypes when souporcell was submitted with a K value of 2. Remaking the VCF with only the 2 appropriate genotypes solved it that time.

I hope this helps!

Thanks, Simon

SimonDMurray commented 1 year ago

Hi @wheaton5,

I have received a similar error and was wondering is you had any insight?

The error output is:

Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 560, in <module>
    souporcell(args, ref_mtx, alt_mtx, final_vcf)
  File "/opt/souporcell/souporcell_pipeline.py", line 500, in souporcell
    subprocess.check_call(cmd, stdout = log, stderr = err) 
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/souporcell/target/release/souporcell', '-k', '2', '-a', 'soc/alt.mtx', '-r', 'soc/ref.mtx', '--restarts', '100', '-b', 'barcodes.tsv', '--min_ref', '4', '--min_alt', '4', '--threads', '8', '--known_genotypes', 'soc/common_variants_covered.vcf', '--known_genotypes_sample_names', 'FX1074', 'FX1111']' returned non-zero exit status 101.

The clusters.err file when run with RUST_TRACEBACK=FULL is:

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
stack backtrace:
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:329:32

I am running with the --known_genotypes option and a VCF that has been generated from old bed files and converted to GRCh38 using pyliftover. The donor value is 2 and there are 2 genotypes in the VCF.

I have checked chromosome nomenclature, chromosome lengths as well as running both with and without additional contigs in the VCF and am not sure what to do next.

Do you have any advice?

wheaton5 / souporcell

Error in clustering step - panicked at 'no entry found for key' #137