wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
168 stars 46 forks source link

Exception: int variable contained non-int values #111

Open bensesbg opened 3 years ago

bensesbg commented 3 years ago

Greetings! I was wondering if you might be able to help resolve an issue we are encountering during the consensus.py step which is generating the following output/error:

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_c58d6755a445ee1723e096eb7e36ea75 NOW. 14884452 excluded for potential RNA editing 25990 doublets excluded from genotype and ambient RNA estimation 0 not used for soup calculation due to possible RNA edit Traceback (most recent call last): File "/opt/souporcell/consensus.py", line 348, in fit = sm.optimizing(data=counts_dat) File "/usr/local/lib/python3.8/site-packages/pystan/model.py", line 542, in optimizing fit = self.fit_class(data, seed) File "pystan_yvxd5ae2/stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_7285297220659911018.pyx", line 479, in stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_7285297220659911018.StanFit4Model.cinit RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=cluster_allele_counts_soup; base type=int (in 'unknown file name' at line 9)

Our current workflow calls the compile_stan_model.py and consensus.py steps of the pipeline with the following commands: python3.8 /opt/souporcell/compile_stan_model.py && python3.8 /opt/souporcell/consensus.py -a out_matrix.mtx -c clusters.tsv -r ref_matrix.mtx -v 1000G_acan_hg38_snps_mainchr.vcf --soup_out ambient_rna.txt --vcf_out cluster_genotypes.vcf --output_dir .

This seems to work for the majority of our samples, but there appears to be an edge case that throws this error in a couple of them. Any help you can provide to assist us in determining the cause of this would be highly appreciated.

TessaGillett commented 3 years ago

Did you ever find out what causes this? I'd be very interested to know

slowkow commented 3 years ago

I am seeing a very similar error right now:

29689 doublets excluded from genotype and ambient RNA estimation
0 not used for soup calculation due to possible RNA edit

Traceback (most recent call last):
  File "/opt/souporcell/consensus.py", line 348, in <module>
    fit = sm.optimizing(data=counts_dat)
  File "/opt/conda/lib/python3.6/site-packages/pystan/model.py", line 542, in optimizing
    fit = self.fit_class(data, seed)
  File "stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947.pyx", line 459, in stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947.StanFit4Model.__cinit__
RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=cluster_allele_counts; base type=int  (in 'unknown file name' at line 8)

Do we need to tell PyStan that the cluster_allele_counts variable contains integers?

wheaton5 commented 3 years ago

Pystan version changes things and pystan version is also sensitive to python version. If you use my conda environment i think this should go away.

pl-ki commented 1 year ago

I see exactly the same issue, running souporcell in the singularity container that I downloaded a couple weeks ago. Several samples have worked fine, now this error:

169910 excluded for potential RNA editing
5971 doublets excluded from genotype and ambient RNA estimation
0 not used for soup calculation due to possible RNA edit
Traceback (most recent call last):
  File "/opt/souporcell/consensus.py", line 348, in <module>
    fit = sm.optimizing(data=counts_dat)
  File "/usr/local/envs/py36/lib/python3.6/site-packages/pystan/model.py", line 472, in optimizing
    fit = self.fit_class(data, seed)
  File "stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947.pyx", line 459, in stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947.StanFit4Model.__cinit__
RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=cluster_allele_counts_soup; base type=int  (in 'unknown file name' at line 9)
pl-ki commented 1 year ago

I see exactly the same issue, running souporcell in the singularity container that I downloaded a couple weeks ago. Several samples have worked fine, now this error:

169910 excluded for potential RNA editing
5971 doublets excluded from genotype and ambient RNA estimation
0 not used for soup calculation due to possible RNA edit
Traceback (most recent call last):
  File "/opt/souporcell/consensus.py", line 348, in <module>
    fit = sm.optimizing(data=counts_dat)
  File "/usr/local/envs/py36/lib/python3.6/site-packages/pystan/model.py", line 472, in optimizing
    fit = self.fit_class(data, seed)
  File "stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947.pyx", line 459, in stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947.StanFit4Model.__cinit__
RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=cluster_allele_counts_soup; base type=int  (in 'unknown file name' at line 9)

I notice in the 'clusters.tsv' file for this sample that basically all cells are 'unassigned':

Count   Assignment
      3 doublet 0/1
      1 singlet 0
      1 status  assignment
   5739 unassigned  0
    104 unassigned  0/1
     86 unassigned  1
     39 unassigned  1/0

Can the absence of a singlet with class=1 be the cause of the error?