zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
BSD 3-Clause "New" or "Revised" License
93 stars 4 forks source link

How to set the chromosome number if I use p_ctg.fa from hifiasm when running haphic pipeline #28

Closed utpala101 closed 5 days ago

utpala101 commented 1 month ago

Thank you for your excellent work! I wonder if I run the haphic pipeline correctly. I have a genome of 4G (a diploid, 2n=26), and I just use the p_ctg.fa file from hifiasm and HiC data. When I run the pipeline I set the nchr as 13, and I have a good result. But when I saw the issue I found that some users used combined hap*.p_ctg file to run the pipeline, and set the nchr as doubled (in my case as 26). So I wonder if I run it correctly, I just want a scaffold genome of haploid, and the HiC contact map of 13 chromosomes by the way.

Appreciate for your early reply!

zengxiaofei commented 1 month ago

Yes, you are correct. For your primary or haplotype-collapsed assembly, please set the nchr as 13.

utpala101 commented 1 month ago

@zengxiaofei Thank you so much for your early reply, I really appreciate your great work!

melop commented 1 month ago

To follow up with this question, the HIFIASM output that I have only produces gfa format files. Should I use gfa2fa tool to convert .hap1.p_ctg.gfa and .hap2.p_ctg.gfa into fasta format, concatenate them into one fasta file (asm.fa), then use BWA to map the HiC reads? But in this case, wouldn't most HIC reads receive a MAPQ of 0 due to multiple mapping and become filtered out by the recommended MAPQ filter of 1?

Best regards, Rongfeng Cui

zengxiaofei commented 1 month ago

Hi Rongfeng @melop,

Overall, there are two main strategies for scaffolding haplotype-resolved assemblies (hap*.p_ctg). The first is aligning all Hi-C reads to each haplotype and scaffolding them separately. Although this approach can preserve much more Hi-C reads for scaffolding, it may disregard possible chromosomal structural variations between haplotypes. For the second strategy, Hi-C reads are aligned to all haplotypes simultaneously, as you mentioned. Despite the fact that many Hi-C reads may be filtered out based on the MAPQ > 0 criterion, this method allows for the differentiation of possible chromosomal structural variations between haplotypes on a contact map.

The percentage of reads filtered out due to MAPQ > 0 is influenced by the heterozygosity level of the genome. When filtered Hi-C reads are sufficient for scaffolding, I would suggest using the second strategy. If the heterozygosity level is too low to maintain enough Hi-C reads, the first strategy may be the only choice.

Best, Xiaofei

melop commented 1 month ago

Hi Xiaofei, I'm wondering if it would be possible to develop a tool to compare the agp files produced by both strategies and merge them together, so that you get a better assignment rate and can pickup any structural variation between haplotypes.

zengxiaofei commented 1 month ago

Hi Xiaofei, I'm wondering if it would be possible to develop a tool to compare the agp files produced by both strategies and merge them together, so that you get a better assignment rate and can pickup any structural variation between haplotypes.

I think there is a paradox. If the Hi-C reads after filtering are sufficient to scaffold the haplotype-resolved assemblies, there is no need to use the first strategy. Conversely, If the Hi-C reads are inadequate for scaffolding, they will also fail to differentiate the SVs between haplotypes.

Additionally, Hi-C-based scaffolding may produce many errors in chromosome assignment, and ordering and orientation, especially for those complex genomes. Therefore, it is essential to carefully evaluate these potential SVs by using a contact map. Developing a "one-click" tool that can generate publication-quality pseudo-chromosomes remains a challenging task at present.

melop commented 1 month ago

Or perhaps if I rephrase my question: is there a way to identify genomic regions where the two haplotypes agree or disagree from HiC data? If so, one could partition the HiC data into haplotype-shared and haplotype-specific for scaffolding.

zengxiaofei commented 2 weeks ago

if I rephrase my question: is there a way to identify genomic regions where the two haplotypes agree or disagree from HiC data? If so, one could partition the HiC data into haplotype-shared and haplotype-specific for scaffolding.

Firstly, there could be a misunderstanding. The criterion MAPQ > 0 that we discussed above is used to filter out multiple mapping reads. That means some regions between haplotypes are identical in their sequences, Hi-C reads (often PE150 in length) cannot differentiate which position is correct (i.e. ambiguous mapping). When you are talking about haplotype-shared and -specific Hi-C reads, maybe you mean the Hi-C reads that can be used to identify SVs. They are totally different as Hi-C reads provide information for scaffolding mainly through the long-range interaction information within them. Imagine that there is a big SV between two haplotypes (e.g., an inversion in one of the haplotype), but the sequences of the two haplotypes are almost identical (except for the breakpoint of the inversion). Only a very small amount of Hi-C reads mapped to the breakpoint (± <150 bp) can be kept after MAPQ >0 filtering.

If you are talking about the Hi-C reads that can be used to identify SVs, there are two converse and contradictory tasks (a "chicken or the eggs" problem): (1) find shared and specific genomic regions between haplotypes using Hi-C reads; (2) find shared and specific Hi-C reads between haplotypes using genome sequences. Or you mean you want to find shared and specific Hi-C reads between haplotypes at the contig level? Even if this is possible, there are too many problems: (1) you need an accurate genome alignment between two haplotypes to liftover the genomic positions from one haplotype to another at the contig level, which may result in new errors; (2) how to integrate the results of haplotype-shared and -specific Hi-C reads?

If you are talking about the Hi-C reads that are filtered out or not filtered out by MAPQ>0. There is still a question: how to use the Hi-C reads with a MAPQ==0? The contigs that are not anchored to any chromosomes due to the MAPQ>0 criterion mean that Hi-C reads on them cannot accurately differentiate which chromosomes these contigs should be assigned to.