weberlab-hhu / Helixer

Using Deep Learning to predict gene annotations
GNU General Public License v3.0
172 stars 29 forks source link

boolean index did not match indexed array along dimension 0 #125

Open Wenfei-Xian opened 7 months ago

Wenfei-Xian commented 7 months ago

Hey, many thanks for this awesome tool !!!

I try to fine tune for Arabidopsis thaliana with more RNA seq data. Below are the commands I used, but I got the error when I ran filter-to-most-certain.py

https://raw.githubusercontent.com/weberlab-hhu/helixer_scratch/master/data_scripts/filter-to-most-certain.py https://raw.githubusercontent.com/weberlab-hhu/helixer_scratch/master/data_scripts/n90_train_val_split.py

commands:

singularity exec -B /tmp/global2/wxian/software/Helixer_fine_tuning:/tmp/global2/wxian/software/Helixer_fine_tuning --no-home ../helixer-docker_helixer_v0.3.2_cuda_11.8.0-cudnn8.sif fasta2h5.py --subsequence-length 213840 --species Arabidopsis_thaliana --h5-output-path Col-CC.v2.fa.h5 --fasta-path Col-CC.v2.fa

singularity exec --nv -B /tmp/global2/wxian/software/Helixer_fine_tuning:/tmp/global2/wxian/software/Helixer_fine_tuning --no-home ../helixer-docker_helixer_v0.3.2_cuda_11.8.0-cudnn8.sif HybridModel.py --load-model-path /tmp/global2/wxian/software/Helixer/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 --test-data Col-CC.v2.fa.h5 --prediction-output-path Col-CC.v2.fa_predictions.h5 --overlap --overlap-offset 106920  --batch-size 9 --val-test-batch-size 9 -v

singularity exec --nv -B /tmp/global2/wxian/software/Helixer_fine_tuning:/tmp/global2/wxian/software/Helixer_fine_tuning --no-home ../helixer-docker_helixer_v0.3.2_cuda_11.8.0-cudnn8.sif helixer_post_bin Col-CC.v2.fa.h5 Col-CC.v2.fa_predictions.h5 100 0.1 0.8 60 Col-CC.v2.fa.helixer.gff3

singularity exec -B /tmp/global2/wxian/software/Helixer_fine_tuning:/tmp/global2/wxian/software/Helixer_fine_tuning --no-home ../helixer-docker_helixer_v0.3.2_cuda_11.8.0-cudnn8.sif import2geenuff.py --fasta Col-CC.v2.fa --gff3 Col-CC.v2.fa.helixer.gff3 --db-path Col-CC.v2.sqlite3 --log-file Col-CC.v2.log --species Arabidopsis_thaliana

singularity exec -B /tmp/global2/wxian/software/Helixer_fine_tuning:/tmp/global2/wxian/software/Helixer_fine_tuning --no-home ../helixer-docker_helixer_v0.3.2_cuda_11.8.0-cudnn8.sif geenuff2h5.py --h5-output-path Col-CC.v2.predictions_training.h5 --input-db-path Col-CC.v2.sqlite3 --subsequence-length 213840

cp Col-CC.v2.predictions_training.h5 Col-CC.v2.predictions_training.backup.h5

python3 ../Helixer/helixer/evaluation/add_ngs_coverage.py -s Arabidopsis_thaliana --unstranded --bam RNA_seq_stress/SRX1882551.sorted.bam --h5-data Col-CC.v2.predictions_training.h5 --dataset-prefix rnaseq --threads 128

python3 filter-to-most-certain.py --write-by 6415200 --h5-to-filter Col-CC.v2.predictions_training.h5 --predictions Col-CC.v2.fa_predictions.h5 --keep-fraction 0.2 --output-file Col-CC.v2.fa.filtered.h5

Error message of filter-to-most-certain.py

selecting 265 with average normalized distances below in each genic proportion ranking [0.032440142162364384, 0.034895248784137675, 0.03238402543958099, 0.03226711560044893, 0.015221076505798728]
INFO: the following arrays will be copied in their entirety and not be subset,
these are expected to relate to metadata:
 ['evaluation/rnaseq_meta/bam_files']
Traceback (most recent call last):
  File "/tmp/global2/wxian/software/Helixer_fine_tuning/filter-to-most-certain.py", line 116, in <module>
    main(args)
  File "/tmp/global2/wxian/software/Helixer_fine_tuning/filter-to-most-certain.py", line 101, in main
    copy_groups_recursively(h5_in, h5_out, skip_arrays=skip_groups, start_i=si, end_i=si + max_n_chunks,
  File "/tmp/global2/wxian/software/Helixer_fine_tuning/n90_train_val_split.py", line 121, in copy_groups_recursively
    h5_in.visititems(maybe_copy_some_data)
  File "/tmp/global2/wxian/conda/envs/htseq/lib/python3.10/site-packages/h5py/_hl/group.py", line 668, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 355, in h5py.h5o.visit
  File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "/tmp/global2/wxian/conda/envs/htseq/lib/python3.10/site-packages/h5py/_hl/group.py", line 667, in proxy
    return func(name, self[name])
  File "/tmp/global2/wxian/software/Helixer_fine_tuning/n90_train_val_split.py", line 119, in maybe_copy_some_data
    copy_some_data(h5_in, h5_out, name, mask, start_i, end_i)
  File "/tmp/global2/wxian/software/Helixer_fine_tuning/n90_train_val_split.py", line 105, in copy_some_data
    keep_idxs = keep_idxs[mask]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 30
alisandra commented 4 months ago

Hi @Wenfei-Xian

Not sure if this comes too late, but to trouble shoot this I would first recommend to use h5tree. (h5tree is from here: https://github.com/johnaparker/h5tree, and can be installed with pip install h5tree).

so

h5tree -va Col-CC.v2.predictions_training.h5
h5tree -va Col-CC.v2.fa_predictions.h5

This is to check whether the h5 files have 0-length datasets, or non-matching lengths between predictions and the training data. If so the causal error (perhaps an incomplete run) probably occurred before (and I'd progressively check previous h5 files, and try rerunning from one that looks good before further trouble shooting).

If that didn't make sense or you don't see anything suspicious, please simply post the output of the htree commands here.

WenhaoLiu0218 commented 3 months ago

Hi @alisandra,

i got the same error when performing fine tune with RNA seq data (using hisat2 for alignment), it's a ~5Gb plant genome, followed are the commands,

fasta2h5.py --species Ang_v1 --h5-output-path Ang_v1.h5 --fasta-path Ang_hardsoft.fa
HybridModel.py --load-model-path $HOME/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5 --test-data Ang_v1.h5 --overlap --val-test-batch-size 32 -v
helixer_post_bin Ang_v1.h5 predictions.h5 100 0.1 0.8 60 Ang_v1_helixer.gff3
import2geenuff.py --fasta Ang_hardsoft.fa --gff3 Ang_v1_helixer.gff3 --db-path Ang_v1.sqlite3 --log-file Ang_v1.log  --species Ang_v1
geenuff2h5.py --h5-output-path Ang_v1_helixer_post.h5  --input-db-path Ang_v1.sqlite3
cp Ang_v1_helixer_post.h5 Ang_v1_helixer_post_backup.h5
python3 add_ngs_coverage.py -s Ang_v1 --second-read-is-sense-strand --bam out.sorted.bam --h5-data Ang_v1_helixer_post.h5 --dataset-prefix rnaseq --threads 20
python3 filter-to-most-certain.py --write-by 6415200  --h5-to-filter Ang_v1_helixer_post.h5 --predictions predictions.h5  --keep-fraction 0.2 --output-file filtered.h5

the error information of filter-to-most-certain.py is,

ptg024299l: chunks from 452064-452068
(b'ptg023992l', 452068, 452072)
ptg023992l: chunks from 452068-452072
(b'ptg022744l', 452072, 452076)
ptg022744l: chunks from 452072-452076
(b'ptg024273l', 452076, 452080)
ptg024273l: chunks from 452076-452080
(b'ptg024019l', 452080, 452084)
ptg024019l: chunks from 452080-452084
(b'ptg023811l', 452084, 452088)
ptg023811l: chunks from 452084-452088
(b'ptg020677l', 452088, 452092)
ptg020677l: chunks from 452088-452092
(b'ptg023630l', 452092, 452096)
ptg023630l: chunks from 452092-452096
(b'ptg023926l', 452096, 452098)
ptg023926l: chunks from 452096-452098
(b'ptg024354l', 452098, 452100)
ptg024354l: chunks from 452098-452100
(b'ptg012681l', 452100, 452102)
ptg012681l: chunks from 452100-452102
(b'ptg020814l', 452102, 452104)
ptg020814l: chunks from 452102-452104
(b'ptg019385l', 452104, 452106)
ptg019385l: chunks from 452104-452106
(b'ptg023819l', 452106, 452108)
ptg023819l: chunks from 452106-452108
(b'ptg021327l', 452108, 452110)
ptg021327l: chunks from 452108-452110
(b'ptg009655l', 452110, 452112)
ptg009655l: chunks from 452110-452112
(b'ptg015198l', 452112, 452114)
ptg015198l: chunks from 452112-452114
selecting 90422 with average normalized distances below in each genic proportion ranking [0.0036187248643845867]
INFO: the following arrays will be copied in their entirety and not be subset,
these are expected to relate to metadata:
 ['evaluation/rnaseq_meta/bam_files']
Traceback (most recent call last):
  File "filter-to-most-certain.py", line 117, in <module>
    main(args)
  File "filter-to-most-certain.py", line 102, in main
    copy_groups_recursively(h5_in, h5_out, skip_arrays=skip_groups, start_i=si, end_i=si + max_n_chunks,
  File "n90_train_val_split.py", line 121, in copy_groups_recursively
    h5_in.visititems(maybe_copy_some_data)
  File "conda/Helixer/env/lib/python3.8/site-packages/h5py/_hl/group.py", line 668, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 354, in h5py.h5o.visit
  File "h5py/h5o.pyx", line 301, in h5py.h5o.cb_obj_simple
  File "conda/Helixer/env/lib/python3.8/site-packages/h5py/_hl/group.py", line 667, in proxy
    return func(name, self[name])
  File "n90_train_val_split.py", line 119, in maybe_copy_some_data
    copy_some_data(h5_in, h5_out, name, mask, start_i, end_i)
  File "n90_train_val_split.py", line 105, in copy_some_data
    keep_idxs = keep_idxs[mask]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 300

i used h5tree to check,

h5tree -va Ang_v1_helixer_post.h5

Ang_v1_helixer_post.h5  (2 objects, 4 attributes)
│   ├── geenuff_commit  v0.3.2-19-g72bcb23
│   ├── helixer_commit  v0.3.2-19-g72bcb23
│   ├── input_path  Angiopteris_v1.sqlite3
│   ├── timestamp  2024-08-08 19:59:14.802168
├── data  (12 objects)
│   ├── X  (452114, 21384, 4), float16
│   ├── err_samples  (452114,), bool
│   ├── fully_intergenic_samples  (452114,), bool
│   ├── gene_lengths  (452114, 21384), uint32
│   ├── is_annotated  (452114,), bool
│   ├── phases  (452114, 21384, 4), int8
│   ├── sample_weights  (452114, 21384), int8
│   ├── seqids  (452114,), |S50
│   ├── species  (452114,), |S25
│   ├── start_ends  (452114, 2), int64
│   ├── transitions  (452114, 21384, 6), int8
│   └── y  (452114, 21384, 4), int8
└── evaluation  (3 objects)
    ├── rnaseq_coverage  (452114, 21384, 1), int64
    ├── rnaseq_meta  (1 object)
    │   └── bam_files  (1,), |S512
    └── rnaseq_spliced_coverage  (452114, 21384, 1), int64
h5tree -va predictions.h5

predictions.h5  (2 objects, 6 attributes)
│   ├── model_config  {"class_name": "Functional", "config": {"name": "model", "layers": [{"class_name": "InputLayer", "config": {"batch_input_shape": [null, null, 4], "dtype": "float32", "sparse": false, "ragged": false, "name": "main_input"}, "name": "main_input", "inbound_nodes": []}, {"class_name": "Conv1D", "config": {"name": "conv1d", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d", "inbound_nodes": [[["main_input", 0, 0, {}]]]}, {"class_name": "BatchNormalization", "config": {"name": "batch_normalization", "trainable": true, "dtype": "float32", "axis": [2], "momentum": 0.99, "epsilon": 0.001, "center": true, "scale": true, "beta_initializer": {"class_name": "Zeros", "config": {}}, "gamma_initializer": {"class_name": "Ones", "config": {}}, "moving_mean_initializer": {"class_name": "Zeros", "config": {}}, "moving_variance_initializer": {"class_name": "Ones", "config": {}}, "beta_regularizer": null, "gamma_regularizer": null, "beta_constraint": null, "gamma_constraint": null}, "name": "batch_normalization", "inbound_nodes": [[["conv1d", 0, 0, {}]]]}, {"class_name": "Conv1D", "config": {"name": "conv1d_1", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d_1", "inbound_nodes": [[["batch_normalization", 0, 0, {}]]]}, {"class_name": "BatchNormalization", "config": {"name": "batch_normalization_1", "trainable": true, "dtype": "float32", "axis": [2], "momentum": 0.99, "epsilon": 0.001, "center": true, "scale": true, "beta_initializer": {"class_name": "Zeros", "config": {}}, "gamma_initializer": {"class_name": "Ones", "config": {}}, "moving_mean_initializer": {"class_name": "Zeros", "config": {}}, "moving_variance_initializer": {"class_name": "Ones", "config": {}}, "beta_regularizer": null, "gamma_regularizer": null, "beta_constraint": null, "gamma_constraint": null}, "name": "batch_normalization_1", "inbound_nodes": [[["conv1d_1", 0, 0, {}]]]}, {"class_name": "Conv1D", "config": {"name": "conv1d_2", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d_2", "inbound_nodes": [[["batch_normalization_1", 0, 0, {}]]]}, {"class_name": "BatchNormalization", "config": {"name": "batch_normalization_2", "trainable": true, "dtype": "float32", "axis": [2], "momentum": 0.99, "epsilon": 0.001, "center": true, "scale": true, "beta_initializer": {"class_name": "Zeros", "config": {}}, "gamma_initializer": {"class_name": "Ones", "config": {}}, "moving_mean_initializer": {"class_name": "Zeros", "config": {}}, "moving_variance_initializer": {"class_name": "Ones", "config": {}}, "beta_regularizer": null, "gamma_regularizer": null, "beta_constraint": null, "gamma_constraint": null}, "name": "batch_normalization_2", "inbound_nodes": [[["conv1d_2", 0, 0, {}]]]}, {"class_name": "Conv1D", "config": {"name": "conv1d_3", "trainable": true, "dtype": "float32", "filters": 96, "kernel_size": [12], "strides": [1], "padding": "same", "data_format": "channels_last", "dilation_rate": [1], "groups": 1, "activation": "relu", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "conv1d_3", "inbound_nodes": [[["batch_normalization_2", 0, 0, {}]]]}, {"class_name": "Reshape", "config": {"name": "reshape", "trainable": true, "dtype": "float32", "target_shape": [-1, 864]}, "name": "reshape", "inbound_nodes": [[["conv1d_3", 0, 0, {}]]]}, {"class_name": "Bidirectional", "config": {"name": "bidirectional", "trainable": true, "dtype": "float32", "layer": {"class_name": "LSTM", "config": {"name": "lstm", "trainable": true, "dtype": "float32", "return_sequences": true, "return_state": false, "go_backwards": false, "stateful": false, "unroll": false, "time_major": false, "units": 128, "activation": "tanh", "recurrent_activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}, "shared_object_id": 29}, "recurrent_initializer": {"class_name": "Orthogonal", "config": {"gain": 1.0, "seed": null}, "shared_object_id": 30}, "bias_initializer": {"class_name": "Zeros", "config": {}, "shared_object_id": 31}, "unit_forget_bias": true, "kernel_regularizer": null, "recurrent_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "recurrent_constraint": null, "bias_constraint": null, "dropout": 0.0, "recurrent_dropout": 0.0, "implementation": 2}}, "merge_mode": "concat"}, "name": "bidirectional", "inbound_nodes": [[["reshape", 0, 0, {}]]]}, {"class_name": "Bidirectional", "config": {"name": "bidirectional_1", "trainable": true, "dtype": "float32", "layer": {"class_name": "LSTM", "config": {"name": "lstm_1", "trainable": true, "dtype": "float32", "return_sequences": true, "return_state": false, "go_backwards": false, "stateful": false, "unroll": false, "time_major": false, "units": 128, "activation": "tanh", "recurrent_activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}, "shared_object_id": 35}, "recurrent_initializer": {"class_name": "Orthogonal", "config": {"gain": 1.0, "seed": null}, "shared_object_id": 36}, "bias_initializer": {"class_name": "Zeros", "config": {}, "shared_object_id": 37}, "unit_forget_bias": true, "kernel_regularizer": null, "recurrent_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "recurrent_constraint": null, "bias_constraint": null, "dropout": 0.0, "recurrent_dropout": 0.0, "implementation": 2}}, "merge_mode": "concat"}, "name": "bidirectional_1", "inbound_nodes": [[["bidirectional", 0, 0, {}]]]}, {"class_name": "Bidirectional", "config": {"name": "bidirectional_2", "trainable": true, "dtype": "float32", "layer": {"class_name": "LSTM", "config": {"name": "lstm_2", "trainable": true, "dtype": "float32", "return_sequences": true, "return_state": false, "go_backwards": false, "stateful": false, "unroll": false, "time_major": false, "units": 128, "activation": "tanh", "recurrent_activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}, "shared_object_id": 41}, "recurrent_initializer": {"class_name": "Orthogonal", "config": {"gain": 1.0, "seed": null}, "shared_object_id": 42}, "bias_initializer": {"class_name": "Zeros", "config": {}, "shared_object_id": 43}, "unit_forget_bias": true, "kernel_regularizer": null, "recurrent_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "recurrent_constraint": null, "bias_constraint": null, "dropout": 0.0, "recurrent_dropout": 0.0, "implementation": 2}}, "merge_mode": "concat"}, "name": "bidirectional_2", "inbound_nodes": [[["bidirectional_1", 0, 0, {}]]]}, {"class_name": "Dense", "config": {"name": "dense", "trainable": true, "dtype": "float32", "units": 72, "activation": "linear", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}, "name": "dense", "inbound_nodes": [[["bidirectional_2", 0, 0, {}]]]}, {"class_name": "TFOpLambda", "config": {"name": "tf.split", "trainable": true, "dtype": "float32", "function": "split"}, "name": "tf.split", "inbound_nodes": [["dense", 0, 0, {"num_or_size_splits": 2, "axis": -1}]]}, {"class_name": "Reshape", "config": {"name": "reshape_1", "trainable": true, "dtype": "float32", "target_shape": [-1, 9, 4]}, "name": "reshape_1", "inbound_nodes": [[["tf.split", 0, 0, {}]]]}, {"class_name": "Reshape", "config": {"name": "reshape_2", "trainable": true, "dtype": "float32", "target_shape": [-1, 9, 4]}, "name": "reshape_2", "inbound_nodes": [[["tf.split", 0, 1, {}]]]}, {"class_name": "Activation", "config": {"name": "genic", "trainable": true, "dtype": "float32", "activation": "softmax"}, "name": "genic", "inbound_nodes": [[["reshape_1", 0, 0, {}]]]}, {"class_name": "Activation", "config": {"name": "phase", "trainable": true, "dtype": "float32", "activation": "softmax"}, "name": "phase", "inbound_nodes": [[["reshape_2", 0, 0, {}]]]}], "input_layers": [["main_input", 0, 0]], "output_layers": [["genic", 0, 0], ["phase", 0, 0]]}}
│   ├── model_md5sum  f0e00efcbea83c66b69258d11119a691  /home/lwh/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5
│   ├── model_path  /home/lwh/.local/share/Helixer/models/land_plant/land_plant_v0.3_a_0080.h5
│   ├── n_bases_removed  0
│   ├── test_data_path  Angiopteris_yunnanensis_v1.h5
│   ├── timestamp  2024-08-07 19:59:57.633520
├── predictions  (452114, 21384, 4), float16
└── predictions_phase  (452114, 21384, 4), float16

0 groups, 2 datasets

It would greatly appreciate you can provide on how to solve these errors!