Question about the structure of HDF5 file

ShanshanGu533 commented 1 year ago

Hi! It is a bit difficult for me to understand the structure of HDF5 file produced by the second helixer step (prediction of base-wise probabilities with the Deep Learning based model (helixer/prediction/HybridModel.py). The predictions.h5 has two arrays. One is predictions and another one is predictions_phase. The dimension size of these arrays are 54572 x 21384 x 4, but i am not sure what is the meaning of each dimension. Can you help me with this? Thank you very much!

alisandra commented 1 year ago

Dear ShanshanGu533,

Thanks for your question.

Helixer divides the genome into subsequences for processing so that they can fit in the GPU memory.

So the genome you have has 54572 of these subsequences of length 21384bp each. The last dimension is categories.

For predictions the categories are [intergenic, UTR, CDS, intron]

For predictions_phase the categories are [None, 0, 1, 2], where the network has been trained to predict 'None' for any non-CDS region; and within the CDS regions the network has been trained to predict the phase of each basepair in the codon.

You would need the information from the input data to track which basepair in the h5 file is which basepair in the genome; and be warned that there are some tricky details there. I just updated this doc https://github.com/weberlab-hhu/Helixer/blob/main/docs/h5_data.md, that would be the place to start.

Of course the easiest option, assuming you're using Helixer for gene calling and not something experimental, is to simply let HelixerPost take care of those details for you.

ShanshanGu533 commented 1 year ago

Hi Alisandra,

Thanks for your reply and the new explanation doc! It's really helpful.

Hope you have a nice day.

Best wishes, Shanshan

On Mon, 29 May 2023 at 03:53, Alisandra Denton @.***> wrote:

Dear ShanshanGu533,

Thanks for your question.

Helixer divides the genome into subsequences for processing so that they can fit in the GPU memory.

So the genome you have has 54572 of these subsequences of length 21384bp each. The last dimension is categories.

For predictions the categories are [intergenic, UTR, CDS, intron]

For predictions_phase the categories are [None, 0, 1, 2], where the network has been trained to predict 'None' for any non-CDS region; and within the CDS regions the network has been trained to predict the phase of each basepair in the codon.

You would need the information from the input data to track which basepair in the h5 file is which basepair in the genome; and be warned that there are some tricky details there. I just updated this doc https://github.com/weberlab-hhu/Helixer/blob/main/docs/h5_data.md, that would be the place to start.

Of course the easiest option, assuming you're using Helixer for gene calling and not something experimental, is to simply let HelixerPost take care of those details for you.

— Reply to this email directly, view it on GitHub https://github.com/weberlab-hhu/Helixer/issues/98#issuecomment-1566378161, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6OR3N2YOQYM4CQ2A4WJWH3XIP6SXANCNFSM6AAAAAAXRV4DDE . You are receiving this because you authored the thread.Message ID: @.***>

weberlab-hhu / Helixer

Question about the structure of HDF5 file #98