zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
140 stars 10 forks source link

TypeError: object of type 'NoneType' has no len() #7

Closed yt619 closed 8 months ago

yt619 commented 10 months ago

Hi Xiaofei,

When I run HapHiC, I encounter an error, which doesn't seem to be an issue with the program installation. Could you please guide me on how to solve this problem? I am combining the hap1 and hap2 outputs from the hifiasm software and assembling them at the chromosome level through Hi-C reads. I will upload my log. HapHiC_cluster.log 1704627444152

Best regards, Tuo

yt619 commented 10 months ago

Hi Xiaofei,

The file corrected_ctgs.txt is empty. Does this indicate that no contigs have been corrected?

Best regards, Tuo

yt619 commented 10 months ago

This seems to be caused by --remove_allelic_links 4. What I have assembled is a segmental allopolyploid, and chromosomal exchanges have led to high similarity in some regions. Should I not filter out MAPQ 1 (mapping quality ≥ 1) from the BAM file?

zengxiaofei commented 10 months ago

Hi Tuo,

The file corrected_ctgs.txt is empty. Does this indicate that no contigs have been corrected?

Yes, you are correct.

Should I not filter out MAPQ 1 (mapping quality ≥ 1) from the BAM file?

Perhaps no. In my opinion, MAPQ >=1 is already considered a basic criterion.

This seems to be caused by --remove_allelic_links 4.

Yes. It appears that the removal of allelic Hi-C links unexpectedly resulted in an empty flank_link_dict.

I suggest trying out the quick view mode first and showing me the Hi-C contact map in Juicebox. This will help me better understand the problem.

Best regards, Xiaofei

yt619 commented 10 months ago

Hi Xiaofei,

Thanks for your answer to these question. This genome is a segmental allopolyploid. Due to chromosomal exchanges, there are numerous identical sequence regions between homologous chromosomes. These regions, being indistinguishable due to the shortness of Hi-C reads, result in a large number of reads with a mapping quality of MAPQ=0. Initial filtering tends to mask the signals in these areas, and the effective sequencing rate is only 75%. Could this be the reason why --remove_allelic_links 4 fails to output flank_link_dict? I am trying to solve this problem using Pore-C, which has an effectiveness rate of 91%. I noticed that you mentioned the use of Pore-C data in another issue; is it possible to use Pore-C data for genome assembly?

Best regards, Tuo Hi-C挂载 两套一起 Pore-C挂载 两套一起 Hi-C挂载率 Pore-C挂载率

zengxiaofei commented 10 months ago

Hi Tuo,

Could this be the reason why --remove_allelic_links 4 fails to output flank_link_dict?

I'm not sure. --remove_allelic_links only deals with allelic contig pairs that have diagonally distributed Hi-C links between them. This kind of distribution pattern can be observed in the second contact map you provided. However, it seems that these Hi-C links are absent in the first contact map. I am wondering if there are any differences in the mapping and filtering methods for the Hi-C data?

Another unexpected observation is that none of the contigs were filtered before clustering. Especially during the rank sum filtering, both Q1 and Q3 were calculated to be 120. I can reproduce this result when the BAM file is either empty or does not match the FASTA file. Therefore, I would suggest checking the input BAM file as a first step.

Best regards, Xiaofei