Closed gunjanpandey closed 1 month ago
Convert
*.hic.p_utg.gfa
toasm.fa
. Or should I convert three*.hic.hap*.p_ctg.gfa
files to respective fasta, and concatenate these three to generateasm.fa
?
Although contigs (p_ctg) are longer than unitigs (p_utg), hifiasm may produce imbalanced haplotypes. Therefore, we often use unitigs for scaffolding. However, when hifiasm outputs balanced haplotypes, you can try using *.hic.hap*.p_ctg.gfa
. Please note that hifiasm may have a bug in generating contig IDs for polyploids, so you may need to modify them before concatenating them into a single asm.fa
.
When should I remove duplicates in this process?
I have mentioned this question in this issue:
"Purging duplicates" is somewhat a misleading term. These duplicates are derived from the heterozygous sites and this term is defined from the perspective of constructing a primary or a collapsed assembly. Hence, when you are attempting to construct a haplotype-resolved assembly, these duplicates are actually what you preferred and purging them is an incorrect choice.
From the current perspective, the statement above may be a bit hasty. In trio mode, hifiasm sets -l
to 0
by default, which aligns with my previous understanding. However, in Hi-C mode, hifiasm sets -l
to 3
by default. This may be due to a different algorithm used in Hi-C-based phasing. In this mode, hifiasm relies on determining homologous coverage and a similarity threshold to identify haplotigs. Therefore, we can understand why the authors recommend tuning the --hom-cov
and -s
parameters in such scenarios.
Please refer to the FAQs above if you encounter related issues. I would not recommend standalone tools for purging duplicates.
At what stage should I handle assembly decontamination?
You can do it before scaffolding or during Juicebox curation.
What is the expected output of step 4 above? Would I receive three sets of
hap*.hic
andhap*.assembly
files for manual curation? If now, how to generate those?
You will get two AGP files, a scaffold-level FASTA file, and a bash script named juicebox.sh
(refer to "Final outputs" in Run HapHiC scaffolding pipeline). If you want to manual curate your assembly in Juicebox, execute the bash script juicebox
(refer to Juicebox curation). This will generate out_JBAT.hic
file and out_JBAT.assembly
files, which contain all the haplotypes you input into HapHiC.
Thank you very much for your quick reply @zengxiaofei
Could you please elaborate on how to generate the three fasta files of phased haplotypes from step 04 output? It is not clear to me from the link.
Would this step generate three fasta files of three haplotypes?
$ /path/to/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa```
Could you please elaborate on how to generate the three fasta files of phased haplotypes from step 04 output?
Current Hi-C scaffolders cannot guarantee perfect chromosome assignment results. Some output scaffolds (groups) may originate from the same chromosome, while a single scaffold may contain sequences from different chromosomes. Therefore, it is essential to verify the results in Juicebox first. Consequently, HapHiC does not infer haplotypes, as the process could be problematic and misleading. After manual curation in Juicebox, you can perform either self-alignment or align the assembly to a reference genome to distinguish each haplotype. In addition, it is possible to identify homologous chromosomes based on the diagonally distributed Hi-C links on the contact map. For hap*.p_ctg
, you can also distinguish them according to the contig IDs in each scaffold.
Please note that Hi-C-based scaffolding is chromosome-level phasing, while only trio-based scaffolding achieves true whole-genome phasing (refer to the hifiasm paper).
Thank you @zengxiaofei
Could you please suggest a suitable program to rename/fix contig names of hifiasm generated *gfa
files.
As you suspected, I am getting duplicate contig names error in the mapping step?
The contig IDs for all haplotypes may start with h1tg
. To address this issue, you need to rename h1tg
to h2tg
for haplotype 2 and h3tg
for haplotype 3 using awk.
Thank you again @zengxiaofei,
I have been able to run the pipeline up to the Juicebox manual curation step.
Here’s what I have done so far:
asm.fa
.out_JBAT.hic
, out_JBAT.assembly
, and other files using the juicer.sh
script from the 04.build
directory..hic
and .assembly
files into the Juicebox program and manually curated the contigs and chromosomes.out_JBAT.review.assembly
from Juicebox.However, I am stuck at this point.
Could you please guide me on how to proceed with the following tasks?
Generating FASTA files of the three haplotypes from out_JBAT.review.assembly
.
I tried the following command, which generates out_JBAT.FINAL.agp
and out_JBAT.FINAL.fasta
files.
juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa
While trying to generate three haplotype fasta files, I tried the following command
ragtag_agp2fa.py out_JBAT.FINAL.agp out_JBAT.FINAL.fa
But despite all contigs name present in both agp and fa files, I get sequence 'h2tg001085l' not present
error. But I am not sure if this is the correct way to get three final haplotype files.
Please help me understand.
While trying to generate three haplotype fasta files, I tried the following command
The ragtag_agp2fa.py script is used to convert a contig-level FASTA file into a chromosome-level FASTA file using an AGP file. However, the out_JBAT.FINAL.fa file is already the final chromosome-level FASTA file, so you do not need to use this script. As mentioned earlier:
HapHiC does not infer haplotypes, as the process could be problematic and misleading. After manual curation in Juicebox, you can perform either self-alignment or align the assembly to a reference genome to distinguish each haplotype.
the final output of HapHiC includes only one AGP file and one FASTA file. HapHiC does not generate three haplotypes; you will need to create these yourself. In addition, it is possible to identify homologous chromosomes based on the diagonally distributed Hi-C links on the contact map.
Close this issue as there has been no response for two weeks.
Could you please help me refine my workflow for assembling a phased triploid genome using hifiasm and HapHiC?
Here is the current plan:
*.hic.p_utg.gfa
toasm.fa
. Or should I convert three .hic.hap.p_ctg.gfa files to respective fasta, and concatenate these three to generateasm.fa
? In my assembly, these three *.fasta would require individual purging for duplicates.I would also appreciate your feedback on the following:
hap*.hic
andhap*.assembly
files for manual curation? If now, how to generate those?Thank you for your assistance!