Open Wenfei-Xian opened 7 months ago
Hi @Wenfei-Xian
Not sure if this comes too late, but to trouble shoot this I would first recommend to use h5tree
.
(h5tree is from here: https://github.com/johnaparker/h5tree, and can be installed with pip install h5tree).
so
h5tree -va Col-CC.v2.predictions_training.h5
h5tree -va Col-CC.v2.fa_predictions.h5
This is to check whether the h5 files have 0-length datasets, or non-matching lengths between predictions and the training data. If so the causal error (perhaps an incomplete run) probably occurred before (and I'd progressively check previous h5 files, and try rerunning from one that looks good before further trouble shooting).
If that didn't make sense or you don't see anything suspicious, please simply post the output of the htree
commands here.
Hi @alisandra,
i got the same error when performing fine tune with RNA seq data (using hisat2 for alignment), it's a ~5Gb plant genome, followed are the commands,
fasta2h5.py --species Ang_v1 --h5-output-path Ang_v1.h5 --fasta-path Ang_hardsoft.fa
HybridModel.py --load-model-path $HOME/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 --test-data Ang_v1.h5 --overlap --val-test-batch-size 32 -v
helixer_post_bin Ang_v1.h5 predictions.h5 100 0.1 0.8 60 Ang_v1_helixer.gff3
import2geenuff.py --fasta Ang_hardsoft.fa --gff3 Ang_v1_helixer.gff3 --db-path Ang_v1.sqlite3 --log-file Ang_v1.log --species Ang_v1
geenuff2h5.py --h5-output-path Ang_v1_helixer_post.h5 --input-db-path Ang_v1.sqlite3
cp Ang_v1_helixer_post.h5 Ang_v1_helixer_post_backup.h5
python3 add_ngs_coverage.py -s Ang_v1 --second-read-is-sense-strand --bam out.sorted.bam --h5-data Ang_v1_helixer_post.h5 --dataset-prefix rnaseq --threads 20
python3 filter-to-most-certain.py --write-by 6415200 --h5-to-filter Ang_v1_helixer_post.h5 --predictions predictions.h5 --keep-fraction 0.2 --output-file filtered.h5
the error information of filter-to-most-certain.py is,
ptg024299l: chunks from 452064-452068
(b'ptg023992l', 452068, 452072)
ptg023992l: chunks from 452068-452072
(b'ptg022744l', 452072, 452076)
ptg022744l: chunks from 452072-452076
(b'ptg024273l', 452076, 452080)
ptg024273l: chunks from 452076-452080
(b'ptg024019l', 452080, 452084)
ptg024019l: chunks from 452080-452084
(b'ptg023811l', 452084, 452088)
ptg023811l: chunks from 452084-452088
(b'ptg020677l', 452088, 452092)
ptg020677l: chunks from 452088-452092
(b'ptg023630l', 452092, 452096)
ptg023630l: chunks from 452092-452096
(b'ptg023926l', 452096, 452098)
ptg023926l: chunks from 452096-452098
(b'ptg024354l', 452098, 452100)
ptg024354l: chunks from 452098-452100
(b'ptg012681l', 452100, 452102)
ptg012681l: chunks from 452100-452102
(b'ptg020814l', 452102, 452104)
ptg020814l: chunks from 452102-452104
(b'ptg019385l', 452104, 452106)
ptg019385l: chunks from 452104-452106
(b'ptg023819l', 452106, 452108)
ptg023819l: chunks from 452106-452108
(b'ptg021327l', 452108, 452110)
ptg021327l: chunks from 452108-452110
(b'ptg009655l', 452110, 452112)
ptg009655l: chunks from 452110-452112
(b'ptg015198l', 452112, 452114)
ptg015198l: chunks from 452112-452114
selecting 90422 with average normalized distances below in each genic proportion ranking [0.0036187248643845867]
INFO: the following arrays will be copied in their entirety and not be subset,
these are expected to relate to metadata:
['evaluation/rnaseq_meta/bam_files']
Traceback (most recent call last):
File "filter-to-most-certain.py", line 117, in <module>
main(args)
File "filter-to-most-certain.py", line 102, in main
copy_groups_recursively(h5_in, h5_out, skip_arrays=skip_groups, start_i=si, end_i=si + max_n_chunks,
File "n90_train_val_split.py", line 121, in copy_groups_recursively
h5_in.visititems(maybe_copy_some_data)
File "conda/Helixer/env/lib/python3.8/site-packages/h5py/_hl/group.py", line 668, in visititems
return h5o.visit(self.id, proxy)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 354, in h5py.h5o.visit
File "h5py/h5o.pyx", line 301, in h5py.h5o.cb_obj_simple
File "conda/Helixer/env/lib/python3.8/site-packages/h5py/_hl/group.py", line 667, in proxy
return func(name, self[name])
File "n90_train_val_split.py", line 119, in maybe_copy_some_data
copy_some_data(h5_in, h5_out, name, mask, start_i, end_i)
File "n90_train_val_split.py", line 105, in copy_some_data
keep_idxs = keep_idxs[mask]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 300
i used h5tree to check,
h5tree -va Ang_v1_helixer_post.h5
Ang_v1_helixer_post.h5 (2 objects, 4 attributes)
│ ├── geenuff_commit v0.3.2-19-g72bcb23
│ ├── helixer_commit v0.3.2-19-g72bcb23
│ ├── input_path Angiopteris_v1.sqlite3
│ ├── timestamp 2024-08-08 19:59:14.802168
├── data (12 objects)
│ ├── X (452114, 21384, 4), float16
│ ├── err_samples (452114,), bool
│ ├── fully_intergenic_samples (452114,), bool
│ ├── gene_lengths (452114, 21384), uint32
│ ├── is_annotated (452114,), bool
│ ├── phases (452114, 21384, 4), int8
│ ├── sample_weights (452114, 21384), int8
│ ├── seqids (452114,), |S50
│ ├── species (452114,), |S25
│ ├── start_ends (452114, 2), int64
│ ├── transitions (452114, 21384, 6), int8
│ └── y (452114, 21384, 4), int8
└── evaluation (3 objects)
├── rnaseq_coverage (452114, 21384, 1), int64
├── rnaseq_meta (1 object)
│ └── bam_files (1,), |S512
└── rnaseq_spliced_coverage (452114, 21384, 1), int64
h5tree -va predictions.h5
predictions.h5 (2 objects, 6 attributes)
│ ├── model_config {"class_name": "Functional", "config": {"name": "model", "layers": [{"class_name": "InputLayer", "config": {"batch_input_shape": [null, null, 4], "dtype": "float32", "sparse": false, "ragged": false, "name": "main_input"}, "name": "main_input", "inbound_nodes": []}, {"class_name": "Conv1D", "config": {"name": "conv1d", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d", "inbound_nodes": [[["main_input", 0, 0, {}]]]}, {"class_name": "BatchNormalization", "config": {"name": "batch_normalization", "trainable": true, "dtype": "float32", "axis": [2], "momentum": 0.99, "epsilon": 0.001, "center": true, "scale": true, "beta_initializer": {"class_name": "Zeros", "config": {}}, "gamma_initializer": {"class_name": "Ones", "config": {}}, "moving_mean_initializer": {"class_name": "Zeros", "config": {}}, "moving_variance_initializer": {"class_name": "Ones", "config": {}}, "beta_regularizer": null, "gamma_regularizer": null, "beta_constraint": null, "gamma_constraint": null}, "name": "batch_normalization", "inbound_nodes": [[["conv1d", 0, 0, {}]]]}, {"class_name": "Conv1D", "config": {"name": "conv1d_1", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d_1", "inbound_nodes": [[["batch_normalization", 0, 0, {}]]]}, {"class_name": "BatchNormalization", "config": {"name": "batch_normalization_1", "trainable": true, "dtype": "float32", "axis": [2], "momentum": 0.99, "epsilon": 0.001, "center": true, "scale": true, "beta_initializer": {"class_name": "Zeros", "config": {}}, "gamma_initializer": {"class_name": "Ones", "config": {}}, "moving_mean_initializer": {"class_name": "Zeros", "config": {}}, "moving_variance_initializer": {"class_name": "Ones", "config": {}}, "beta_regularizer": null, "gamma_regularizer": null, "beta_constraint": null, "gamma_constraint": null}, "name": "batch_normalization_1", "inbound_nodes": [[["conv1d_1", 0, 0, {}]]]}, {"class_name": "Conv1D", "config": {"name": "conv1d_2", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d_2", "inbound_nodes": [[["batch_normalization_1", 0, 0, {}]]]}, {"class_name": "BatchNormalization", "config": {"name": "batch_normalization_2", "trainable": true, "dtype": "float32", "axis": [2], "momentum": 0.99, "epsilon": 0.001, "center": true, "scale": true, "beta_initializer": {"class_name": "Zeros", "config": {}}, "gamma_initializer": {"class_name": "Ones", "config": {}}, "moving_mean_initializer": {"class_name": "Zeros", "config": {}}, "moving_variance_initializer": {"class_name": "Ones", "config": {}}, "beta_regularizer": null, "gamma_regularizer": null, "beta_constraint": null, "gamma_constraint": null}, "name": "batch_normalization_2", "inbound_nodes": [[["conv1d_2", 0, 0, {}]]]}, {"class_name": "Conv1D", "config": {"name": "conv1d_3", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d_3", "inbound_nodes": [[["batch_normalization_2", 0, 0, {}]]]}, {"class_name": "Reshape", "config": {"name": "reshape", "trainable": true, "dtype": "float32", "target_shape": [-1, 864]}, "name": "reshape", "inbound_nodes": [[["conv1d_3", 0, 0, {}]]]}, {"class_name": "Bidirectional", "config": {"name": "bidirectional", "trainable": true, "dtype": "float32", "layer": {"class_name": "LSTM", "config": {"name": "lstm", "trainable": true, "dtype": "float32", "return_sequences": true, "return_state": false, "go_backwards": false, "stateful": false, "unroll": false, "time_major": false, "units": 128, "activation": "tanh", "recurrent_activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}, "shared_object_id": 29}, "recurrent_initializer": {"class_name": "Orthogonal", "config": {"gain": 1.0, "seed": null}, "shared_object_id": 30}, "bias_initializer": {"class_name": "Zeros", "config": {}, "shared_object_id": 31}, "unit_forget_bias": true, "kernel_regularizer": null, "recurrent_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "recurrent_constraint": null, "bias_constraint": null, "dropout": 0.0, "recurrent_dropout": 0.0, "implementation": 2}}, "merge_mode": "concat"}, "name": "bidirectional", "inbound_nodes": [[["reshape", 0, 0, {}]]]}, {"class_name": "Bidirectional", "config": {"name": "bidirectional_1", "trainable": true, "dtype": "float32", "layer": {"class_name": "LSTM", "config": {"name": "lstm_1", "trainable": true, "dtype": "float32", "return_sequences": true, "return_state": false, "go_backwards": false, "stateful": false, "unroll": false, "time_major": false, "units": 128, "activation": "tanh", "recurrent_activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}, "shared_object_id": 35}, "recurrent_initializer": {"class_name": "Orthogonal", "config": {"gain": 1.0, "seed": null}, "shared_object_id": 36}, "bias_initializer": {"class_name": "Zeros", "config": {}, "shared_object_id": 37}, "unit_forget_bias": true, "kernel_regularizer": null, "recurrent_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "recurrent_constraint": null, "bias_constraint": null, "dropout": 0.0, "recurrent_dropout": 0.0, "implementation": 2}}, "merge_mode": "concat"}, "name": "bidirectional_1", "inbound_nodes": [[["bidirectional", 0, 0, {}]]]}, {"class_name": "Bidirectional", "config": {"name": "bidirectional_2", "trainable": true, "dtype": "float32", "layer": {"class_name": "LSTM", "config": {"name": "lstm_2", "trainable": true, "dtype": "float32", "return_sequences": true, "return_state": false, "go_backwards": false, "stateful": false, "unroll": false, "time_major": false, "units": 128, "activation": "tanh", "recurrent_activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}, "shared_object_id": 41}, "recurrent_initializer": {"class_name": "Orthogonal", "config": {"gain": 1.0, "seed": null}, "shared_object_id": 42}, "bias_initializer": {"class_name": "Zeros", "config": {}, "shared_object_id": 43}, "unit_forget_bias": true, "kernel_regularizer": null, "recurrent_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "recurrent_constraint": null, "bias_constraint": null, "dropout": 0.0, "recurrent_dropout": 0.0, "implementation": 2}}, "merge_mode": "concat"}, "name": "bidirectional_2", "inbound_nodes": [[["bidirectional_1", 0, 0, {}]]]}, {"class_name": "Dense", "config": {"name": "dense", "trainable": true, "dtype": "float32", "units": 72, "activation": "linear", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "dense", "inbound_nodes": [[["bidirectional_2", 0, 0, {}]]]}, {"class_name": "TFOpLambda", "config": {"name": "tf.split", "trainable": true, "dtype": "float32", "function": "split"}, "name": "tf.split", "inbound_nodes": [["dense", 0, 0, {"num_or_size_splits": 2, "axis": -1}]]}, {"class_name": "Reshape", "config": {"name": "reshape_1", "trainable": true, "dtype": "float32", "target_shape": [-1, 9, 4]}, "name": "reshape_1", "inbound_nodes": [[["tf.split", 0, 0, {}]]]}, {"class_name": "Reshape", "config": {"name": "reshape_2", "trainable": true, "dtype": "float32", "target_shape": [-1, 9, 4]}, "name": "reshape_2", "inbound_nodes": [[["tf.split", 0, 1, {}]]]}, {"class_name": "Activation", "config": {"name": "genic", "trainable": true, "dtype": "float32", "activation": "softmax"}, "name": "genic", "inbound_nodes": [[["reshape_1", 0, 0, {}]]]}, {"class_name": "Activation", "config": {"name": "phase", "trainable": true, "dtype": "float32", "activation": "softmax"}, "name": "phase", "inbound_nodes": [[["reshape_2", 0, 0, {}]]]}], "input_layers": [["main_input", 0, 0]], "output_layers": [["genic", 0, 0], ["phase", 0, 0]]}}
│ ├── model_md5sum f0e00efcbea83c66b69258d11119a691 /home/lwh/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5
│ ├── model_path /home/lwh/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5
│ ├── n_bases_removed 0
│ ├── test_data_path Angiopteris_yunnanensis_v1.h5
│ ├── timestamp 2024-08-07 19:59:57.633520
├── predictions (452114, 21384, 4), float16
└── predictions_phase (452114, 21384, 4), float16
0 groups, 2 datasets
It would greatly appreciate you can provide on how to solve these errors!
Hey, many thanks for this awesome tool !!!
I try to fine tune for Arabidopsis thaliana with more RNA seq data. Below are the commands I used, but I got the error when I ran filter-to-most-certain.py
https://raw.githubusercontent.com/weberlab-hhu/helixer_scratch/master/data_scripts/filter-to-most-certain.py https://raw.githubusercontent.com/weberlab-hhu/helixer_scratch/master/data_scripts/n90_train_val_split.py
commands:
Error message of filter-to-most-certain.py