zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
142 stars 10 forks source link

Error when the original contig exceeds 500Mb (2^29) #73

Closed lpzoaa closed 1 month ago

lpzoaa commented 1 month ago

Thank you very much for developing such an efficient tool! However, I encountered an issue while using it. I noticed that when the length of the original contig exceeds 500Mb (2^29), the error shown in the attached image occurs. image

OverflowError: signed integer is greater than maximum

Could you please confirm if this is a limitation of the software itself?

zengxiaofei commented 1 month ago

Yes, there is a limitation of contig length.

def update_clm_dict(clm_dict, ctg_name_pair, len_i, len_j, coord_i_0, coord_j_0):

    clm_dict[ctg_name_pair].extend((
        len_i - coord_i_0 + coord_j_0,
        len_i - coord_i_0 + len_j - coord_j_0,
        coord_i_0 + coord_j_0,
        coord_i_0 + len_j - coord_j_0))

clm_dict here is a Python array with a data type of signed integer, which ranges from -2,147,483,648 to 2,147,483,647. When calculating the four integers mentioned above, the maximum absolute value can be twice the length of the longest contig. Therefore, I suspect there is at least one contig in your assembly longer than 1.07 Gb, rather than just 500 Mb.

Although such a long contig is uncommon, if my suspicion is correct, I could develop a feature to dynamically set the data type of clm_dict based on the maximum contig length. However, I am uncertain whether downstream tools like ALLHiC will encounter issues, so I need some time to test this. Alternatively, you could break these long contigs and record the breakpoints, then rejoin them after completing the scaffolding process.

lpzoaa commented 1 month ago

Thank you for your reply. As you suspected, the longest contig reaches 1.2 Gb. I will follow your suggestion to split the contigs that exceed 1 Gb. Once again, thank you for developing such an efficient tool.