zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
140 stars 10 forks source link

workflow assistance with hifiasm assembly for a triploid #67

Closed gunjanpandey closed 1 month ago

gunjanpandey commented 1 month ago

Could you please help me refine my workflow for assembling a phased triploid genome using hifiasm and HapHiC?

Here is the current plan:

  1. Run the hifiasm pipeline with Hi-C data for triploid assembly.
  2. Convert *.hic.p_utg.gfa to asm.fa. Or should I convert three .hic.hap.p_ctg.gfa files to respective fasta, and concatenate these three to generate asm.fa ? In my assembly, these three *.fasta would require individual purging for duplicates.
  3. Align Hi-C data to the assembly and prepare a BAM file:
    $ bwa index asm.fa
    $ bwa mem -5SP -t 28 asm.fa /path/to/read1_fq.gz /path/to/read2_fq.gz | samblaster | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam
    $ /path/to/HapHiC/utils/filter_bam HiC.bam 1 --nm 3 --threads 14 | samtools view - -b -@ 14 -o HiC.filtered.bam
  4. Run the HapHiC pipeline:
    $ /path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa,hap3.p_ctg.gfa"

I would also appreciate your feedback on the following:

  1. When should I remove duplicates in this process?
  2. At what stage should I handle assembly decontamination?
  3. What is the expected output of step 4 above? Would I receive three sets of hap*.hic and hap*.assembly files for manual curation? If now, how to generate those?

Thank you for your assistance!

zengxiaofei commented 1 month ago

Convert *.hic.p_utg.gfa to asm.fa. Or should I convert three *.hic.hap*.p_ctg.gfa files to respective fasta, and concatenate these three to generate asm.fa?

Although contigs (p_ctg) are longer than unitigs (p_utg), hifiasm may produce imbalanced haplotypes. Therefore, we often use unitigs for scaffolding. However, when hifiasm outputs balanced haplotypes, you can try using *.hic.hap*.p_ctg.gfa. Please note that hifiasm may have a bug in generating contig IDs for polyploids, so you may need to modify them before concatenating them into a single asm.fa.

When should I remove duplicates in this process?

I have mentioned this question in this issue:

"Purging duplicates" is somewhat a misleading term. These duplicates are derived from the heterozygous sites and this term is defined from the perspective of constructing a primary or a collapsed assembly. Hence, when you are attempting to construct a haplotype-resolved assembly, these duplicates are actually what you preferred and purging them is an incorrect choice.

From the current perspective, the statement above may be a bit hasty. In trio mode, hifiasm sets -l to 0 by default, which aligns with my previous understanding. However, in Hi-C mode, hifiasm sets -l to 3 by default. This may be due to a different algorithm used in Hi-C-based phasing. In this mode, hifiasm relies on determining homologous coverage and a similarity threshold to identify haplotigs. Therefore, we can understand why the authors recommend tuning the --hom-cov and -s parameters in such scenarios.

Please refer to the FAQs above if you encounter related issues. I would not recommend standalone tools for purging duplicates.

At what stage should I handle assembly decontamination?

You can do it before scaffolding or during Juicebox curation.

What is the expected output of step 4 above? Would I receive three sets of hap*.hic and hap*.assembly files for manual curation? If now, how to generate those?

You will get two AGP files, a scaffold-level FASTA file, and a bash script named juicebox.sh (refer to "Final outputs" in Run HapHiC scaffolding pipeline). If you want to manual curate your assembly in Juicebox, execute the bash script juicebox (refer to Juicebox curation). This will generate out_JBAT.hic file and out_JBAT.assembly files, which contain all the haplotypes you input into HapHiC.

gunjanpandey commented 1 month ago

Thank you very much for your quick reply @zengxiaofei

Could you please elaborate on how to generate the three fasta files of phased haplotypes from step 04 output? It is not clear to me from the link.

Would this step generate three fasta files of three haplotypes?


$ /path/to/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa```
zengxiaofei commented 1 month ago

Could you please elaborate on how to generate the three fasta files of phased haplotypes from step 04 output?

Current Hi-C scaffolders cannot guarantee perfect chromosome assignment results. Some output scaffolds (groups) may originate from the same chromosome, while a single scaffold may contain sequences from different chromosomes. Therefore, it is essential to verify the results in Juicebox first. Consequently, HapHiC does not infer haplotypes, as the process could be problematic and misleading. After manual curation in Juicebox, you can perform either self-alignment or align the assembly to a reference genome to distinguish each haplotype. In addition, it is possible to identify homologous chromosomes based on the diagonally distributed Hi-C links on the contact map. For hap*.p_ctg, you can also distinguish them according to the contig IDs in each scaffold.

Please note that Hi-C-based scaffolding is chromosome-level phasing, while only trio-based scaffolding achieves true whole-genome phasing (refer to the hifiasm paper).

gunjanpandey commented 1 month ago

Thank you @zengxiaofei

Could you please suggest a suitable program to rename/fix contig names of hifiasm generated *gfa files. As you suspected, I am getting duplicate contig names error in the mapping step?

zengxiaofei commented 1 month ago

The contig IDs for all haplotypes may start with h1tg. To address this issue, you need to rename h1tg to h2tg for haplotype 2 and h3tg for haplotype 3 using awk.

gunjanpandey commented 1 month ago

Thank you again @zengxiaofei,

I have been able to run the pipeline up to the Juicebox manual curation step.

Here’s what I have done so far:

  1. Fixed the naming of the contigs in the GFA file as you suggested.
  2. Generated three FASTA files and concatenate them as asm.fa.
  3. Ran the pipeline until I obtained the out_JBAT.hic, out_JBAT.assembly, and other files using the juicer.sh script from the 04.build directory.
  4. Loaded the .hic and .assembly files into the Juicebox program and manually curated the contigs and chromosomes.
  5. Exported the assembly to out_JBAT.review.assembly from Juicebox.

However, I am stuck at this point.

Could you please guide me on how to proceed with the following tasks?

  1. Generating FASTA files of the three haplotypes from out_JBAT.review.assembly.
    I tried the following command, which generates out_JBAT.FINAL.agp and out_JBAT.FINAL.fasta files.

    juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa 

While trying to generate three haplotype fasta files, I tried the following command

  ragtag_agp2fa.py  out_JBAT.FINAL.agp out_JBAT.FINAL.fa 

But despite all contigs name present in both agp and fa files, I get sequence 'h2tg001085l' not present error. But I am not sure if this is the correct way to get three final haplotype files.

Please help me understand.

  1. Generating figures for the assembly.
zengxiaofei commented 1 month ago

While trying to generate three haplotype fasta files, I tried the following command

The ragtag_agp2fa.py script is used to convert a contig-level FASTA file into a chromosome-level FASTA file using an AGP file. However, the out_JBAT.FINAL.fa file is already the final chromosome-level FASTA file, so you do not need to use this script. As mentioned earlier:

HapHiC does not infer haplotypes, as the process could be problematic and misleading. After manual curation in Juicebox, you can perform either self-alignment or align the assembly to a reference genome to distinguish each haplotype.

the final output of HapHiC includes only one AGP file and one FASTA file. HapHiC does not generate three haplotypes; you will need to create these yourself. In addition, it is possible to identify homologous chromosomes based on the diagonally distributed Hi-C links on the contact map.

zengxiaofei commented 1 month ago

Close this issue as there has been no response for two weeks.