zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
141 stars 10 forks source link

Question for Work with hifiasm #44

Closed lijl459 closed 3 months ago

lijl459 commented 3 months ago

Hello,

I have a question regarding the use of haphic for genome assembly and processing results with hifiasm.

The genome I'm working with is diploid, and the haploid chromosome number is 11. In the command:

haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa hap1.p_ctg.gfa,hap2.p_ctg.gfa

should I set nchrs to 11 or 22?

Additionally, what is the difference between this approach and directly using p_utg.fa? Will I obtain haplotype-phased assemblies if I run:

haphic pipeline p_utg.fa HiC.filtered.bam nchrs --gfa p_utg.gfa?

And in this case, should nchrs be set to 11 or 22?

Thank you!

zengxiaofei commented 3 months ago

In both cases, the nchrs parameter should be set to 22. Because there should be two sets of haplotypes in allhaps.fa or p_utg.fa.

Additionally, what is the difference between this approach and directly using p_utg.fa?

This question is complicated. You first need to understand the principle of how hifiasm uses Hi-C data to phase the assembly graph. Please refer to this paper of hifiasm. In short, the hap.p_ctg sequences are fully phased and should have a higher contig N50 than that of p_utg. Although hifiasm may have some problems in dealing with autopolyploid genomes. However, in the case of a diploid genome, hifiasm is expected to work well. Hence, I would recommend using the hap.p_ctg rather than the p_utg.

Please note that the scaffolding strategy could be different depending on the heterozygosity level. If the heterozygosity is high and there are still enough Hi-C reads after filtering with MAPQ>=1, you can use your first command to run HapHiC. If the heterozygosity is too low and there are not enough Hi-C reads after MAPQ>=1 filtering, it is suggested to align and scaffold the two sets of haplotypes separately.

lijl459 commented 3 months ago

Thanks very much!

DayTimeMouse commented 3 months ago

Hi, zengxiaofei

If I use this command: haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa"

Is HiC.filtered.bam generated by allhap.fa(cat hap1.fa hap2.fa)?Like:

bwa mem -5SP -t 28 allhap.fa read1_fq.gz read2_fq.gz | samblaster | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam HapHiC/utils/filter_bam HiC.bam 1 --nm 3 --threads 14 | samtools view - -b -@ 14 -o HiC.filtered.bam

Best wishes.

zengxiaofei commented 3 months ago

@DayTimeMouse Yes.

awesomedeer commented 3 months ago

“Please note that the scaffolding strategy could be different depending on the heterozygosity level. If the heterozygosity is high and there are still enough Hi-C reads after filtering with MAPQ>=1, you can use your first command to run HapHiC. If the heterozygosity is too low and there are not enough Hi-C reads after MAPQ>=1 filtering, it is suggested to align and scaffold the two sets of haplotypes separately.” @zengxiaofei Hi, Zeng. For the filtered BAM file, around how much left (like how much percent?) would be considered as enough and indicated high heterozygosity?