zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
BSD 3-Clause "New" or "Revised" License
127 stars 8 forks source link

How to improve scaffolding efficiency #49

Closed vergilback closed 2 weeks ago

vergilback commented 3 weeks ago

Hello, I used HapHiC for chromosome scaffolding. However, since my species has a large number of chromosomes including microchromosomes, there is a significant size difference between the chromosomes (ranging from 1M to 100M). I lowered the min_group_len to 1M, which improved the anchoring rate to some extent. After zooming in on the Hi-C heatmap, I noticed that some contigs that were not anchored to the linkage groups still show significant Hi-C signals. Do you have a suitable parameter combination to improve the scaffolding of these contigs? Or will I need to anchor them manually? Thank you.

image

zengxiaofei commented 3 weeks ago

It seems that you are attempting to scaffold a haplotype-resolved assembly, and these unanchored contigs come originate from approximately three chromosomes with much higher Hi-C link density than other chromosomes. These unanchored contigs might be filtered out during the clustering step by the default parameter --density_upper, and they may not be rescued to a certain group during the reassignment step because the whole chromosomes were filtered together.

This issue could be due chromosome-level collapses (where assemblers merged homologous chromosomes into a single one) or significant differences between homologous chromosomes (e.g., the non-PAR regions in human X and Y chromosomes)

To address this problem, you may try adding the following parameters, as we did in scaffolding the human HG002 genome in our paper: --density_upper 1: Prevents filtering of contigs with much higher Hi-C link density; --normalize_by_nlinks: Normalizes the contact matrix for clustering based on the number of Hi-C links on each contig.

Alternatively, in your case, manually scaffolding them in Juicebox is also a straightforward option.

vergilback commented 2 weeks ago

Hello, I tried adding the two parameters separately:

zengxiaofei commented 2 weeks ago

There are too many collapses in the assembly. When --density_upper was set to 1, collapsed contigs were also retained during clustering, which resulted in this error. So, I think you can manually rescue these contigs in Juicebox based on your first version of result.