weberlab-hhu / Helixer

Using Deep Learning to predict gene annotations
GNU General Public License v3.0
139 stars 20 forks source link

helixer_post_bin not correctly installed? #116

Closed LeoVincenzi closed 3 months ago

LeoVincenzi commented 4 months ago

Hi, I'm encountering issues in performing the analysis in 3 steps running Helixer on a genome. Everything seems to have worked well during the first 2 steps, but in the third one, the command 'helixer_post_bin' does not appear as installed and I cannot find it in any subdirectory of the Helixer folder. I've seen that the main() of Helixer.py should recall the command, but it has worked with any problems/errors, so I would assume that it has found it. Any help in understanding would be great!

Thank you, Leo

soi commented 4 months ago

Hello @LeoVincenzi ,

Please provide the exact commands you were using and how Helixer was installed so we can track down the problem.

All the best, Felix

alisandra commented 3 months ago

Hi Leo,

If you are not using the container option (Docker / Singularity / Apptainer) then helixer_post_bin needs to be installed separately according to the instructions here:

https://github.com/TonyBolger/HelixerPost

Cheers, Ali

LeoVincenzi commented 3 months ago

Hi @alisandra, I was running Helixer via Singularity, but HelixerPost wasn't found. I tried to install it separately (works fine). Anyway, giving now the .h5 file previously obtained, it returns me the following error:

thread 'main' panicked at helixer_post_bin/src/main.rs:30:10:
Failed to open input files: Duplicate Value: Block Start 0 at index 348016 already occurred at index 347968
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

I haven't seen any similar issue around. I don't get to which file the problem could be related. Cheers, Leo

LeoVincenzi commented 3 months ago

To @soi, sorry for my delayed response. I performed the two following command: fasta2h5.py --species my_genome --h5-output-path my_genome.h5 --fasta-path my_genome.fa and HybridModel.py --load-model-path ~/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 my_genome.h5 --overlap -v

For the installation, I followed the instructions in https://github.com/gglyptodon/helixer-docker

Leo

alisandra commented 3 months ago

Hi Leo,

I am honestly a little confused and am trying to think about the best way to debug this. It's definitely unexpected that helixer_post_bin is not installed when running the singularity image.

In general, it may or may not be worth rerunning e.g. the singularity pull part to see if there wasn't some problem in downloading. But since you got helixer_post_bin installed in the end, I guess that's not really the place to start.

Instead, and as HelixerPost's error indicates there is a problem with the input file, I would start by checking the contents of both intermediate files, with e.g.

# check the size (in particular that the files are non-empty)
ls -sh my_genome.h5
ls -sh predictions.h5
# results of each should be > 0, 
# with the former being in the same order of magnitude as the input fasta, 
# and the latter being substantially larger

# check the expected datasets are there
# the following command is from the ubuntu package `hdf5-tools`, install if necessary
h5ls my_genome.h5/data
# here you expect a list of datasets with e.g. 'X', 'y', 'phase', 'species', 'seqids', and more...
h5ls predictions.h5
# here you expect two datasets, 'predictions', and 'predictions_phase'

If your results don't match the comments on what is expected, try rerunning the step that generated the file in question; and repeating the ls and h5ls commands. If expectations are then matched, you can directly continue, if not, then at least we've narrowed it down; check all errors and output for the command in question, and please share it with us for help interpreting it.

Cheers, Ali

LeoVincenzi commented 3 months ago

Hi Ali, thank you for your answer. I tried the input you gave me. Well, about the so-far generated file, I observe nothing strange concerning your indications:

Size: 
3.6G Genome.fa
3.7G genome.h5
73G predictions.h5

Using h5ls, I see for genome.h5: 
X                        Dataset {356756/Inf, 21384, 4}
seqids                   Dataset {356756/Inf}
species                  Dataset {356756/Inf}
start_ends               Dataset {356756/Inf, 2}

and for predictions.h5:
predictions              Dataset {356756/Inf, 21384, 4}
predictions_phase        Dataset {356756/Inf, 21384, 4}

So everything seems to be fine till now.

I tried to install again HelixerPost as reported in TOny Bolger instructions: the problem seems to come during the cargo building:

error: failed to run custom build command for 'hdf5-sys v0.8.1'
 --- stderr
  thread 'main' panicked at /home/nanopore/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hdf5-sys-0.8.1/build.rs:548:13:
  Unable to locate HDF5 root directory and/or headers.
  note: run with 'RUST_BACKTRACE=1' environment variable to display a backtrace

I tried to decrypt it but unsuccessfully. Any tips would be useful.

Leo

alisandra commented 3 months ago

Ok, I think I misunderstood that the install had completed successfully before.

So let's take a step back then as you should not be having to install HelixerPost when using singularity. We'll double check things on our end, but in the mean time, and as you'll probably be faster, I'd encourage you to try re-pulling the singularity image. It's at least worth a shot.

LeoVincenzi commented 3 months ago

Okay, I have retried with singularity running the unique sommand:

singularity run --nv helixer-docker_helixer_v0.3.2_cuda_11.8.0-cudnn8.sif Helixer.py --fasta-path genome.fa.gz --lineage land_plant --gff-output-path genome_helixer.gff3

It starts well but then it stopped at:

.........
Testing whether helixer_post_bin is correctly installed
HelixerPost <genome.h5> <predictions.h5> <windowSize> <edgeThresh> <peakThresh> <minCodingLength> <gff>
Helixer.py config loaded. Starting FASTA to H5 conversion.
storing temporary files under /tmp/tmpbom1bkbi
Traceback (most recent call last):
  File "/usr/local/bin/Helixer.py", line 248, in <module>
    main()
  File "/usr/local/bin/Helixer.py", line 206, in main
    controller.export_fasta_to_h5(chunk_size=args.subsequence_length, compression=args.compression,
  File "/usr/local/lib/python3.8/dist-packages/helixer/export/exporter.py", line 115, in export_fasta_to_h5
    for i, (seqid, seq) in enumerate(fasta_seqs):
  File "/usr/local/lib/python3.8/dist-packages/geenuff/applications/importer.py", line 1034, in parse_fasta
    for fasta_header, seq in fp.read_fasta(seq_file):
  File "/usr/local/lib/python3.8/dist-packages/dustdas/fastahelper.py", line 64, in read_fasta
    fasta = text_or_gzip_open(fasta, "r")
  File "/usr/local/lib/python3.8/dist-packages/dustdas/fastahelper.py", line 11, in text_or_gzip_open
    with gzip.open(path, 'r') as tmp:
  File "/usr/lib/python3.8/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'genome.fa.gz'

At least, that confirms that helixer_post_bin is found in the singularity image, but it's like unable to find the starting genome. It's not a problem of permission I think because all the folders are 777. I'm asking myself if the analysis should be run inside the folder of Helixer installation (/home/leo/.local/share/Helixer/).

alisandra commented 3 months ago

Taking a guess here:

Singularity automatically mounts the user's home directory, such that if you're files are stored in or in a sub-directory of your home directory (/home/leo/), I would expect it to find the file in so far as the path is valid from your working directory; and if you're getting the error in such a case, let me know.

However, Singularity does not automatically mount other directories, such as /mnt/data/ or /gpfs/data/ or where ever large files may be stored on a given personal or hpc system. If this is the case, you will need to tell Singularity to mount the directory by adding something like --bind /your/path/here. Find more information here.

alisandra commented 3 months ago

When in doubt, trying full instead of relative paths is surely not a bad idea either (particularly if you were already working in your home directory / it's not a binding issue).

LeoVincenzi commented 3 months ago

Hi @alisandra, I tried running with singularity directly from the /home/leo/.local/share/Helixer/ directory and I must say that it started working well, doing all the first steps, up to Neural Networking. However, he then stopped again at the post-processing stage:

Neural network prediction done. Starting post processing.
thread 'main' panicked at 'Failed to open input files: Duplicate Value: Block Start 0 at index 116012 already occurred at index 115996', helixer_post_bin/src/main.rs:30:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

An error occurred during post processing. Exiting.

I do not understand what is related to that index error. Further, I see that any intermediate file (.h5) has been maintained in the working directory. Let me know.

LeoVincenzi commented 3 months ago

An update: I succeed in running with singularity on a smaller fasta, getting the final gff3! Now I'm retrying on the total genome. I was wondering: isn't Helixer able to return also a repeat annotation?

alisandra commented 3 months ago

Glad to hear it!

Regarding the full genome, and the previous error:

thread 'main' panicked at 'Failed to open input files: Duplicate Value: Block Start 0 at index 116012 already occurred at index 115996', helixer_post_bin/src/main.rs:30:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

An error occurred during post processing. Exiting.

A possible cause of this would be the input fasta file having duplicate sequence IDs (after truncating to the first space). I'm adding an upfront check and more interpretable error to the todo list now. Until then the fix is at least easy, make sure the IDs are unique (including after truncation).

If that's not it, let me know.

alisandra commented 3 months ago

Regarding

isn't Helixer able to return also a repeat annotation?

Unfortunately not, while it could totally be done in theory (and I love this idea :+1: ), getting the training set together would require time and expertise I don't have available at the moment.

LeoVincenzi commented 3 months ago

Hi @alisandra I've managed the error by running the pipeline separately for each chromosome and then merging the final gff3 files. I will also take a look at the IDs anyway. So now everything works! Thank you for all the support you gave and all the explanations.

Cheers, Leo