y-hwang / gLM

Genomic language model predicts protein co-regulation and function
https://www.biorxiv.org/content/10.1101/2023.04.07.536042v3
Other
65 stars 8 forks source link

"list index out of range" when analysing short contigs #2

Closed zwmuam closed 8 months ago

zwmuam commented 11 months ago

Your gLM is fascinating.

Unfortunately, I ran into a problem.

If I try to embed my own proteins/contigs, the "glm_embed.py" crushes if any contig has LESS than 30 genes (I think...).

This seems to be an index error in the sequence_id <> embedding matching loop connected with the fact that in the "batch.pkl" missing proteins are just represented by 0's and generate embeddings that no longer have any protein names to match.

Could you help? Can one embed proteins from shorter (sub)contigs?

_File "/media/solificatus/14_TB/gLM/gLM/./gLM/glm_embed.py", line 140, in infer glm_embs.append((ori_protids[i],emb)) IndexError: list index out of range

y-hwang commented 11 months ago

How was batch.pkl file created? if batch_data.py was used, short contigs should be handled because the padded elements will be assigned attention_mask of 0. If you could print out batch['prot_ids'] and batch['attention_mask'] for your batch that causes this error, that would be helpful. If not, feel free to send me the smallest batch.pkl file with this problem per email: yhwang@g.harvard.edu thank you!

zwmuam commented 11 months ago

Yes we used "batch_data.py".

batch['prot_ids']: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 0] [43 44 45 46 47 48 49 50 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [52 53 54 55 56 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [58 59 60 61 62 63 64 65 66 67 68 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

batch['attention_mask'] [1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0] [1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Many thanks for quick reaction!

zwmuam commented 10 months ago

Would it confuse model if I just filter zeros out at (some?) stage before final embedding? Or should I filter results after the procedure?

y-hwang commented 10 months ago

Hello, I ran the batch.pkl file that you shared, and glm_embed.py ran as expected with no errors. Perhaps you can pdb into the line that is causing the error and print out both ori_prot_ids and i and debug from there? Could you please check glm_embed.py is not modified since git clone, as line 140 in the error msg does not match the line number in the git repo.

zwmuam commented 10 months ago

The lines are shifted by debug prints (sorry). The "ori_prot_ids" is list of 58 elements: ['lcl|NC_000913.3_prot_NP_414542.1_1', 'lcl|NC_000913.3_prot_NP_414543.1_2', 'lcl|NC_000913.3_prot_NP_414544.1_3', 'lcl|NC_000913.3_prot_NP_414545.1_4', 'lcl|NC_000913.3_prot_NP_414546.1_5', 'lcl|NC_000913.3_prot_NP_414547.1_6', 'lcl|NC_000913.3_prot_NP_414548.1_7', 'lcl|NC_000913.3_prot_NP_414549.1_8', 'lcl|NC_000913.3_prot_NP_414550.1_9', 'lcl|NC_000913.3_prot_NP_414551.1_10', 'lcl|NC_000913.3_prot_NP_414552.1_11', 'lcl|NC_000913.3_prot_YP_009518733.1_12', 'lcl|NC_000913.3_prot_NP_414554.1_13', 'lcl|NC_000913.3_prot_NP_414555.1_14', 'lcl|NC_000913.3_prot_NP_414556.1_15', 'lcl|NC_000913.3_prot_NP_414557.1_16', 'lcl|NC_000913.3_prot_NP_414559.1_17', 'lcl|NC_000913.3_prot_YP_025292.1_18', 'lcl|NC_000913.3_prot_NP_414560.1_19', 'lcl|NC_000913.3_prot_NP_414561.1_20', 'lcl|NC_000913.3_prot_NP_414562.1_21', 'lcl|NC_000913.3_prot_NP_414563.1_22', 'lcl|NC_000913.3_prot_NP_414564.1_23', 'lcl|NC_000913.3_prot_NP_414565.1_24', 'lcl|NC_000913.3_prot_NP_414566.1_25', 'lcl|NC_000913.3_prot_NP_414567.1_26', 'lcl|NC_000913.3_prot_NP_414568.1_27', 'lcl|NC_000913.3_prot_NP_414569.1_28', 'lcl|NC_000913.3_prot_NP_414570.1_29', 'lcl|NC_000913.3_prot_NP_414571.1_30', 'lcl|NC_000913.3_prot_NP_414572.1_31', 'lcl|NC_000913.3_prot_NP_414573.1_32', 'lcl|NC_000913.3_prot_NP_414574.1_33', 'lcl|NC_000913.3_prot_NP_414576.4_34', 'lcl|NC_000913.3_prot_NP_414577.2_35', 'lcl|NC_000913.3_prot_NP_414578.2_36', 'lcl|NC_000913.3_prot_NP_414579.4_37', 'lcl|NC_000913.3_prot_NP_414580.1_38', 'lcl|NC_000913.3_prot_NP_414581.1_39', 'lcl|NC_000913.3_prot_NP_414582.1_40', 'lcl|NC_000913.3_prot_NP_414583.2_41', 'lcl|NC_000913.3_prot_NP_414584.1_42', 'lcl|NC_000913.3_prot_NP_414585.1_43', 'lcl|NC_000913.3_prot_NP_414586.1_44', 'lcl|NC_000913.3_prot_NP_414587.1_45', 'lcl|NC_000913.3_prot_NP_414588.1_46', 'lcl|NC_000913.3_prot_NP_414589.1_47', 'lcl|NC_000913.3_prot_NP_414590.1_48', 'lcl|NC_000913.3_prot_NP_414591.1_49', 'lcl|NC_000913.3_prot_NP_414592.1_50', 'lcl|NC_000913.3_prot_NP_414593.1_51', 'lcl|NC_000913.3_prot_NP_414594.1_52', 'lcl|NC_000913.3_prot_NP_414595.1_53', 'lcl|NC_000913.3_prot_NP_414596.1_54', 'lcl|NC_000913.3_prot_NP_414597.1_55', 'lcl|NC_000913.3_prot_YP_009518734.1_56', 'lcl|NC_000913.3_prot_YP_009518735.1_57', 'lcl|NC_000913.3_prot_NP_414600.1_58']

The "hidden_embs" that guides the iteration is an array with (60, 1280) shape. Thus last "i" is 58 and then ori_prot_ids[i] runs out of IDs to match. (i used gLM/gLM/glm_embed.py with just minor debugging modifications)

zwmuam commented 10 months ago

The issue seems to be fixed if one slightly modifies the "get_original_prot_ids" function to add provisional sting identifiers "0" numerical IDs (e.g. to filter them out later).

def get_original_prot_ids(ids, id_dict):
    ori_ids = []
    for i in ids:
        if i != 0:
            if i not in id_dict.keys():
                ori_ids.append(str(i))
            else:
                ori_ids.append(id_dict[i])

        # DIRTY FIX
        else:
            ori_ids.append('dummy')

    return ori_ids
zwmuam commented 10 months ago

I just wanted to know if it is the correct approach for the point of view of the model / embedding process .

y-hwang commented 10 months ago

Hidden embs should be of shape (30,1280) as max_seq_length is 30. Additionally, batch_data.py already pads all prot_ids to max_seq_length (30). I emailed you the processed file that works fine from our end.

mcn3159 commented 8 months ago

I also received the same error as is originally reported. Would also like to known if @zwmuam 's fix is correct.

If it helps with debugging, my hidden_embs shape is (630,1280) with the input I gave it which was a tsv with 21 rows. Thanks

y-hwang commented 8 months ago

Thank you for flagging this again, this issue is now fixed with the latest commit ca8ce07. closing this issue now.