Closed gemmacol closed 1 month ago
My question: Is it possible to keep using the original gfa file to go with haplotype assemblies that have been purged?
The answer is currently no. We designed these checks because many users often input wrong files. Although we can provide an additional parameter to skip these checks, it needs some time for me. Simply, you can just modify the original GFA files to delete the contigs removed by purge_dups in S lines and update the sequence lengths in LN
tags.
Some posts here recommend not to purge
"Purging duplicates" is somewhat a misleading term. These duplicates are derived from the heterozygous sites and this term is defined from the perspective of constructing a primary or a collapsed assembly. Hence, when you are attempting to construct a haplotype-resolved assembly, these duplicates are actually what you preferred and purging them is an incorrect choice.
So far we have followed the RAFT assembly pipeline which includes read fragmentation (making ONT reads more like HiFi in length evenness) and hifiasm assembly with the hic reads (and >50kb ultra-long data subset).
I'm not sure whether this strategy works when assembling a haplotype-resolved assembly. The RAFT pipeline uses hifiasm to correct reads. However, the read correction function in hifiasm is also designed for PacBio HiFi reads. ONT R10.4 reads are still more error-prone than HiFi reads, and the sequencing errors in them are often biasedly distributed. Hifiasm may view these sequencing errors as true heterozygous sites, resulting in incorrect phasing results. The newest ONT reads (median accuracy: Q26~Q28, SQK-LSK114 chemistry + V5.0.0 sup basecaller) may be suitable for hifiasm (https://github.com/chhylp123/hifiasm/issues/573). You may also ask the author of hifiasm for more information.
Thank you very much for your comments. We have taken the suggestions on board and no longer using purged assemblies. The unitigs really are making great contact maps -- it's unclear why the combined hap*.p_ctg would be the preferred option.
In the discussion here: https://github.com/zengxiaofei/HapHiC/issues/21 it's mentioned "If hifiasm is able to generate balanced haplotype assemblies, I recommend using hap*.p_ctg rather than p_utg." and also "While p_utg is used, there is no phasing information in the corresponding GFA file, HapHiC only makes use of the read depth information in the GFA file for contig filtering."
Can you see the following two files for comparison? The unitigs is more clean.
Or maybe this "exclude these short unanchored sequences from the contact map using the --min_len parameter." or "--specified_scaffolds "group1,group2,group3,group4" as is being discussed here: https://github.com/zengxiaofei/HapHiC/issues/48 I guess we would be worried about losing important information from the assembly.
(In the meantime we will run some further QC steps on the scaffolded assembly which could also help decide which version is best)
"Purging duplicates" is somewhat a misleading term. These duplicates are derived from the heterozygous sites and this term is defined from the perspective of constructing a primary or a collapsed assembly. Hence, when you are attempting to construct a haplotype-resolved assembly, these duplicates are actually what you preferred and purging them is an incorrect choice.
From the current perspective, the statement above may be a bit hasty. In trio mode, hifiasm sets -l
to 0
by default, which aligns with my previous understanding. However, in Hi-C mode, hifiasm sets -l
to 3
by default. This may be due to a different algorithm used in Hi-C-based phasing. In this mode, hifiasm relies on determining homologous coverage and a similarity threshold to identify haplotigs. Therefore, we can understand why the authors recommend tuning the --hom-cov
and -s
parameters in such scenarios.
Please refer to the FAQs above if you encounter related issues. But I still would not recommend standalone tools for purging duplicates.
Even if the three phased haplotype assemblies outputted by hifiasm were similar in size, it is still maybe ok to use the unitigs rather than the haplotype assemblies, and let HapHiC do the phasing during scaffolding, no?
Yes, using unitigs is always acceptable. We recommend trying hap*.p_ctg
when haplotypes are balanced simply because that contigs (p_utg) are longer than unitigs (p_ctg).
Can you see the following two files for comparison? The unitigs is more clean.
Yes, I also prefer the unitigs. However, according to the contact map, please be aware that Group 1 may contain unitigs from different haplotypes. This misassignment error is primarily due to the presence of a large collapsed region in this chromosome (please refer to the definition of collapse here and in Fig. 2a of our paper). In addition, there are some scaffolds from the same chromosome. Therefore, you need to manually curate your assembly in Juicebox.
Our understanding is that the contact map with the three hap*.p_ctg combined ("allhaps") is showing lots of shorter scaffolds with the grey lines top and right on the plot. We are thinking of filtering out some of these shorter scaffolds e.g. 10kb filter then mapping hic reads again and putting it back through HapHiC pipeline . Would you recommend this approach? Or maybe this "exclude these short unanchored sequences from the contact map using the --min_len parameter."
Just use unitigs. Since these unanchored contigs often lack Hi-C links, your strategy may not work.
or "--specified_scaffolds "group1,group2,group3,group4" as is being discussed here: https://github.com/zengxiaofei/HapHiC/issues/48
--specified_scaffolds
will not alter your scaffolders, it is simply a method for selecting a subset of scaffolds for visualization.
Close this issue as there has been no response for two weeks.
HapHiC is working very well so far, and now we would like to fine-tune the scaffolding (looking into more of the options, and comparing different input versions).
My question: Is it possible to keep using the original gfa file to go with haplotype assemblies that have been purged? Because they no longer match. It seems that purging (with purge dups, separately after hifiasm) has reduced the length of some contigs. Error message:
While we are leaning towards using the unitigs, we are also looking at the haplotype assemblies combined (before and now after purging as well) for comparisons. i.e. purge the haplotype assemblies separately, see that the assembly stats and busco duplication scores improve, then combine the purged haplotype assemblies for haphic pipeline.
Background info: We are aiming for haplotype-resolved chromosome-level assemblies for hybrid stick insects. They are quite large genomes (~3.2 Gb haploid) as well as highly heterozygous and repetitive due to the hybridisation history.
Data types we are using: ONT (R10), HiC reads, and soon coming Illumina reads for polishing.
So far we have followed the RAFT assembly pipeline which includes read fragmentation (making ONT reads more like HiFi in length evenness) and hifiasm assembly with the hic reads (and >50kb ultra-long data subset). So we are getting the phased haplotype assemblies.
However, for the specimen in question, hap1 was twice the size of hap2, and busco genes highly duplicated. It is supposedly a diploid but there might be three sets of homologous chromosomes but maybe not all of them. Perhaps this is why hifiasm had trouble making hap1 and hap2 the same size. I will run hifiasm again with the triploid option.
Some posts here recommend not to purge, and indeed we are preferring the clean contact map from p_utg. But due to the hybrid nature of the genomes, our busco duplication is quite high and purging brings this down a lot. So the question still stands about how to match the gfa file. One thing that has to be done is rename the headers in the purged fasta so they are the same as in the gfa file, because purging adds and extra
_1
to the end of every header.Many thanks, Gemma