Open lpzoaa opened 6 days ago
Hi, thank you for posting this issue. I looked into your log file and the reference/Helixer annotation you use for fine tuning seems to be of poor quality. Your fine tuning run shows a perfect F1 score (1) for the intergenic regions and every other class (UTR, Exon, Intron) has a score of 0. These scores don't change during training. Looking at the confusion matrices, it seems that your data mostly contains integenic regions. Did you check the annotation gff3 file you got from Helixer or could you upload a zipped version of it? Which species does "Ac" refer to?
Hi,
"Ac" refers to onion (Allium cepa), which has a large genome size of 16 Gb. I have reviewed the GFF3 file obtained from Helixer and extracted gene sequences based on the provided annotations. Helixer performed well for our genome, successfully predicting nearly 80% of the genes, which is better compared to other de novo annotation tools.
I attempted to fine-tune the model using only one chromosome following the aforementioned process (since the entire genome is too large), but encountered an error. I suspect the issue lies in the fact that the reference, which is 2.3 Gb in size, contains only 5000 genes. This suggests that the intergenic regions make up the majority of the reference, and the 'filter-to-most-certain.py' script filtered out 20% of the results from Helixer, leaving only a small number of genes for training.
I also tried using gene annotations predicted by TransDecoder for training, and this worked successfully.
However, the new model I generated was only 8.5 MB in size, whereas the land_plant_v0.3_a_0080.h5 model is 25 MB. Additionally, the performance of the new model was not as good as the original—it only predicted 60% of the genes, which is even worse than before. Could this be because I only used one chromosome for training?
I would appreciate any advice you could offer.
Hi,
First of all, I'm happy to hear that Helixer performs well for your onion genome.
It could very well be that the fine tuning doesn't perform well, when you only use one chromosome instead of the entire genome. Is there a reason why you didn't use the entire genome? When retraining takes too long because of the genome size, you can try setting the --batch-size
parameter higher than 50 (but just as high as your GPU allows, because it runs out of GPU RAM eventually, so maybe you need to try out different batch sizes).
Hi, I am trying to fine-tuning the model with the command below.
singularity run --nv ~/software/singularity/helixer.sif fasta2h5.py --species Ac --h5-output-path Ac.h5 --fasta-path Ac.fa
singularity run --nv ~/software/singularity/helixer.sif HybridModel.py --load-model-path $HOME/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 --test-data Ac.h5 --overlap --val-test-batch-size 32 -v
singularity run --nv ~/software/singularity/helixer.sif helixer_post_bin Ac.h5 predictions.h5 100 0.1 0.8 60 Ac.post.gff3
singularity run --nv ~/software/singularity/helixer.sif import2geenuff.py --fasta Ac.fa --gff3 Ac.post.gff3 --db-path Ac.post.sqlite3 --log-file Ac.log --species Ac
singularity run --nv ~/software/singularity/helixer.sif geenuff2h5.py --h5-output-path Ac.post.h5 --input-db-path Ac.post.sqlite3
python3 filter-to-most-certain.py --write-by 6415200 --h5-to-filter Ac.post.h5 --predictions predictions.h5 --keep-fraction 0.2 --output-file Ac.post.filtered.h5
mkdir -p Ac_data
python3 n90_train_val_split.py --write-by 6415200 --h5-to-split Ac.post.filtered.h5 --output-pfx Ac_data/
singularity run --nv ~/software/singularity/helixer.sif HybridModel.py -v --batch-size 50 --val-test-batch-size 100 -e100 --class-weights "[0.7, 1.6, 1.2, 1.2]" --transition-weights "[1, 12, 3, 1, 12, 3]" --learning-rate 0.0001 --resume-training --fine-tune --load-model-path $HOME/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 --data-dir Ac_data --save-model-path ./Ac_best_model.h5
But I encountered the following error while fine-tuning the model.
/usr/local/lib/python3.8/dist-packages/helixer/prediction/Metrics.py:74: RuntimeWarning: invalid value encountered in divide normalized_cm = self.cm / class_sums[:, None] # expand by one dim so broadcast work properly
slurm-9993101.log Could you please help me understand what might be causing this issue? Thank you.