Splitting up assemblies and running helixer

weberlab-hhu / Helixer

Using Deep Learning to predict gene annotations

GNU General Public License v3.0

165 stars 28 forks source link

Splitting up assemblies and running helixer #110

Closed Toffeeladd closed 9 months ago

Toffeeladd commented 10 months ago

Hi Guys,

Thank you for such a great software!

I am trying to annotate 39 plant accessions each about 4GB and I am only able to run a GPU node of my cluster for a maximum of 4 days. Unfortunately my jobs are timing out so I have resorted to splitting the genome into its respective contigs and running Helixer on each. My question is by doing this does it effect the quality of the prediction? I understand the program uses the external land_plant dataset but does the model also use the input data (i.e. the assembly itself) as part of the prediction/training of the model?

I just want to make sure I am not effecting the quality of my annotations by splitting my assemblies up and using Helixer in this way.

Any pointers would be greatly appreciated!

Cheers,

Noah

soi commented 10 months ago

Hello @Toffeeladd,

Helixer's predictions are a bit worse towards the end of an input sequence. This means the first and last couple thousand bases will on average have slightly worse annotations (see Figure 2 in https://doi.org/10.1093/bioinformatics/btaa1044). By splitting up the sequence more often, there will of course be more endpoints from Helixer's point of view.

However it seems like annotating a 4 GB genome using a GPU node should be possible to accomplish in 4 days. What GPU(s) are you using and did you change any of the default hyperparameters? Especially the parameters concerning overlapping have an outsized impact on performance and in your case it could be reasonable to relax them a bit.

Best,

Felix

Toffeeladd commented 10 months ago

Hi Felix,

Thank you for the quick response and sorry for my delayed response, the GPU nodes available on my cluster are Nvidia A100 and Nvidia H100 nodes: https://docs.hpc.shef.ac.uk/en/latest/stanage/cluster_specs.html#stanage-gpu-specs

I have not yet tried changing the default hyperparameters as I see the recommendations on the README are to do with adjusting for gene length but I cant seem to find the default overlapping parameters so I am unsure of what to relax them too?

Thank you for your help

Noah

soi commented 10 months ago

Hi Noah,

given that you have 80GB GPUs at your disposal I would suggest increasing the --batch-size parameter first to at least 128, maybe much more depending on the --subsequence-length parameter (if all others are unchanged). In general --batch-size should be as large as possible for maximum GPU utilization.

This should improve performance massively.

Best, Felix

Toffeeladd commented 10 months ago

Hi Felix,

Thank you for the advice, I retried with an --batch-size 128 however it still timed out. I am going to try some higher --batch-sizes and see if I can adjust the --subsequence-length parameter accordingly to my species.

Cheers, Noah

Toffeeladd commented 9 months ago

Hi Felix,

After increasing the --batch-size to 512 it has managed to complete within the 4 days!

Thank you for your help!

Noah

Andy-B-123 commented 2 months ago

Hi, sorry to bump this old issue, I was just hoping to get clarification on this. Does Helixer work on each input fasta contig independently? From the initial explanation it sounds like it's concatenating things?

I'm hoping to parallelise the annotation run by splitting my input multi-fasta assembly into individual single-fasta files and running Helixer on each - would that lead to different results than running on the full assembly? I'm okay with small annotation errors at the start and end of contigs, I am looking at doing full chromosomes and so the start and ends are generally telomeric. Thank you!

I love Helixer btw, it's getting some multi-copy gene families more correct than RefSeq (I'm looking mainly at invertebrates), very useful