zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
140 stars 10 forks source link

How to assemble a Haplotype-phased genome? #70

Closed chaofan520 closed 1 month ago

chaofan520 commented 1 month ago

The Hi-C heatmap created from out_JBAT.hic and out_JBAT.assembly is very complex. It's too difficult to manually curate in Juicebox. Should I work on a Haplotype at a time?

hifiasm -o Solanum_commersonii.asm -t48 03.hifiasm/00.raw_data/CRR1072648.fq.gz 2> Solanum_commersonii.asm.log
awk '/^S/{print ">"$2;print $3}' 03.hifiasm/Solanum_commersonii.asm.bp.hap1.p_ctg.gfa >  03.hifiasm/hap1.genome.fa
awk '/^S/{print ">"$2;print $3}' 03.hifiasm/Solanum_commersonii.asm.bp.hap2.p_ctg.gfa >  03.hifiasm/hap2.genome.fa

# cat之前检查hap1.fa和hap2.fa的序列ID是否一致
cat 03.hifiasm/hap1.genome.fa 03.hifiasm/hap2.genome.fa > 04.HapHic/allhap.fa

# lign Hi-C data to the assembly, remove PCR duplicates and filter out secondary and supplementary alignments
bwa index 04.HapHic/allhap.fa
bwa mem -5SP -t 28 04.HapHic/allhap.fa 01.clean_data/CRR1072651_f1.fq.gz 01.clean_data/CRR1072651_r2.fq.gz | samblaster | samtools view - -@ 14 -S -h -b -F 3340 -o 04.HapHic/HiC.bam

# Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)
./HapHiC/utils/filter_bam 04.HapHic/HiC.bam 1 --nm 3 --threads 14 | samtools view - -b -@ 14 -o 04.HapHic/HiC.filtered.bam

# One-line command
cd 04.HapHic
../HapHiC/haphic pipeline allhap.fa HiC.filtered.bam 24 --gfa "../03.hifiasm/Solanum_commersonii.asm.bp.hap1.p_ctg.gfa,../03.hifiasm/Solanum_commersonii.asm.bp.hap2.p_ctg.gfa" --threads 24 --processes 24 --remove_allelic_links 2 --correct_nrounds 2

cd 04.build && bash juicebox.sh
chaofan520 commented 1 month ago

2024.09.26.19.12.22.HiCImage.pdf

chaofan520 commented 1 month ago

image

zengxiaofei commented 1 month ago

Based on the contact map you provided, the genome exhibits a high level of heterozygosity, resulting in sufficient Hi-C links for manual curation in Juicebox after MAPQ filtering. Additionally, given that you supplied haplotype-phased GFA files to HapHiC, it appears that there are many incorrectly phased and misassembled contigs in the assembly. This could be the primary reason for the difficulties you are experiencing with manual curation. Therefore, I would not recommend scaffolding each haplotype separately.

chaofan520 commented 1 month ago

Thank you for your reply.

the genome exhibits a high level of heterozygosity

Yes.

Therefore, I would not recommend scaffolding each haplotype separately.

Thank you for your advice.