weberlab-hhu / Helixer

Using Deep Learning to predict gene annotations
GNU General Public License v3.0
139 stars 20 forks source link

Failed to add_ngs_coverage #106

Closed EdgarLW closed 8 months ago

EdgarLW commented 9 months ago

Hello, I've been trying to incorporate RNA-Seq data to the h5 data files using the add_ngs_coverage.py as mentioned in the tutorial, but it seems I've been encountering some issues. This is more or less the same for the five h5 files I've been using.

[(True, 55360)]
start, end 0 55360
(b'Chr02', 0, 7012)
Chr02: chunks from 0-7012
(b'Chr01', 7012, 12008)
Chr01: chunks from 7012-12008
[E::bgzf_uncompress] Inflate operation failed: invalid distance too far back
[E::bgzf_read] Read block operation failed with error 1 after 0 of 4 bytes
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/shared/ifbstor1/software/miniconda/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/shared/ifbstor1/software/miniconda/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/helixer_user/Helixer/helixer/evaluation/add_ngs_coverage.py", line 333, in cov_by_chrom
    for read in htseqbam.fetch(region="{}:1-{}".format(chromosome, length)):
  File "/shared/home/ewaschburger/.local/lib/python3.9/site-packages/HTSeq/__init__.py", line 920, in fetch
    for pa in self.sf.fetch(reference, start, end, region):
  File "pysam/libcalignmentfile.pyx", line 2107, in pysam.libcalignmentfile.IteratorRowRegion.__next__
OSError: truncated file
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/helixer_user/Helixer/helixer/evaluation/add_ngs_coverage.py", line 581, in <module>
    main(args.species,
  File "/home/helixer_user/Helixer/helixer/evaluation/add_ngs_coverage.py", line 509, in main
    cage_coverage_from_coord_to_h5(
  File "/home/helixer_user/Helixer/helixer/evaluation/add_ngs_coverage.py", line 441, in cage_coverage_from_coord_to_h5
    coverage_out = p.map(cov_by_chrom, mapargs)
  File "/shared/ifbstor1/software/miniconda/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/shared/ifbstor1/software/miniconda/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
OSError: truncated file

As the error states, I've checked for any signs of corruption in the bam files, but all

samtools (1.15.1)
samtools quickcheck
samtools flagstats
samtools stats
picard (2.23.5)
picard ValidateSamFile

have returned no errors, leading me to believe it is not related to the bam files.

Commands and software versions are listed below:

$ singularity pull --dir $HOME/sif docker://gglyptodon/helixer-docker:helixer_v0.3.2_cuda_11.8.0-cudnn8
$ singularity exec --nv $HOME/sif/helixer* \
    python3 /home/helixer_user/Helixer/helixer/evaluation/add_ngs_coverage.py \
    -s $sp --unstranded --bam $sp/input/$sp.bam --h5-data h5s/$sp/$sp.h5 \
    --dataset-prefix rnaseq --threads 32

(checked using 1 thread as well)

$ singularity version
apptainer version 1.1.7-1

CUDA Version 11.8.0 Python 3.9.2

$ pip list
Package                Version
---------------------- ---------
brotlipy               0.7.0
certifi                2022.12.7
cffi                   1.14.5
charset-normalizer     2.1.1
colorama               0.4.6
conda                  22.9.0
conda-package-handling 1.9.0
cryptography           38.0.3
Cython                 3.0.2
h5py                   3.9.0
h5tree                 1.0
HTSeq                  2.0.4
idna                   3.4
libmambapy             0.27.0
mamba                  0.27.0
numpy                  1.26.0
pip                    22.3.1
pycosat                0.6.4
pycparser              2.21
pyOpenSSL              22.1.0
pysam                  0.21.0
PySocks                1.7.1
requests               2.28.1
ruamel-yaml-conda      0.15.80
setuptools             65.5.1
termcolor              2.3.0
toolz                  0.12.0
tqdm                   4.64.1
urllib3                1.26.11
wheel                  0.38.4

Any help or leads are very much appreciated. Thanks in advance!

EdgarLW commented 9 months ago

Update. After taking a look at the manual installation docs, the pip packages have been updated and the list is much bigger, yet the issue still persists.

alisandra commented 9 months ago

Hmmm, this isn't an error I recognize.

As I see it is occurring during multiprocessing and just to narrow it down, can you try it with --threads=1 and let me know how that goes?

(the multithreading is anyways only helpful when adding more than one bam file at the same time)

felicitas215 commented 9 months ago

I had that error before. For me the cause was either a truncated BAM file or a truncated H5 file. Can you open/look into the H5 file you want to add coverage to with h5py or the BAM file with samtools view?

EdgarLW commented 8 months ago

After explicity stating --threads 1 helixer ran from start to finish without issues. Thank you very much! I assume this means both BAM and H5 files are not truncated as the error suggests?

alisandra commented 8 months ago

Glad to hear it!

I assume so too, but I would have to check more carefully to know.