zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
141 stars 10 forks source link

[Help] for choosing MAPQ settings on F1 Hybrid genome #74

Closed Isoris closed 1 month ago

Isoris commented 1 month ago

Hello,

I am currently working on scaffolding an F1 hybrid genome, which has 55 chromosomes per haplotype: 27 chromosomes from parental species 1 and 28 from parental species 2. I am using HiFi / Hi-C / ONT data generated from Hifiasm for both haplotypes separately.

While running the 3D-DNA assembly pipeline in diploid mode, I achieved a reasonably decent result but had to use individual haploid haplotypes in diploid mode, -r 5 for five rounds of misassembly detection and -q 30 (MAPQ30) to ensure that Hi-C reads would map correctly to a single parent due to the low divergence between the parental species.

However, the final scaffolds have an unexpected genome size:

711 Mb for the 27 chromosomes from species 1 (expected size ~820 Mb)
810 Mb for the 28 chromosomes from species 2 (expected size ~910 Mb)

I suspect the size discrepancy is due to haplotype collapse or the merging of heterozygous regions between the two species. The low divergence between the parental genomes might be causing the 3D-DNA pipeline to collapse these regions into a single sequence during scaffolding, despite the fact that the Hifiasm contigs are correctly phased and match the expected lengths when mapped back to their respective parental genomes.

Given the nature of this challenge, I am seeking your suggestions on whether your software or approach could provide better results. Specifically:

Would using MAPQ30 from the start (instead of MAPQ1) yield better scaffolding results in such a case of low divergence between parental species?
Are there additional parameters or strategies you would recommend to avoid collapsing heterozygous regions?

Thank you very much for your time and suggestions. I look forward to your guidance.

Best regards, Quentin

zengxiaofei commented 1 month ago

Hi Quentin,

Are there additional parameters or strategies you would recommend to avoid collapsing heterozygous regions?

I have not yet tried the diploid mode of 3D-DNA. However, most scaffolders, including HapHiC, do not collapse any sequences during scaffolding. This results in the total length of contigs in the output scaffolds being the same as those in the input contigs.

Would using MAPQ30 from the start (instead of MAPQ1) yield better scaffolding results in such a case of low divergence between parental species?

Based on my experience, the Juicer pipeline tends to introduce more mapping errors compared to bwa mem -5SP. Therefore, I always recommend using bwa mem -5SP other than the Juicer pipeline for Hi-C read mapping.To maintain sufficient Hi-C links without introducing too many mapping errors, the MAPQ cutoff can be set to 1 (please see: the mapping and filtering method we recommend).

When scaffolding the haplotype-resolved assembly of a diploid using HapHiC, I recommend adding --remove_allelic_links 2 to remove Hi-C links between allelic contig pairs during clustering based on the distribution pattern of Hi-C links. Alternatively, you can use --gfa *hap1.*.p_ctg.gfa,*hap2.p_ctg.gfa to prevent clustering contigs from different haplotypes into the same group (refer to: work with hifiasm).

Best, Xiaofei

zengxiaofei commented 1 month ago

Close this issue as there has been no response for two weeks.