zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
138 stars 10 forks source link

mis-matched .fa <--> .gfa after purging #64

Closed gemmacol closed 1 month ago

gemmacol commented 1 month ago

HapHiC is working very well so far, and now we would like to fine-tune the scaffolding (looking into more of the options, and comparing different input versions).

My question: Is it possible to keep using the original gfa file to go with haplotype assemblies that have been purged? Because they no longer match. It seems that purging (with purge dups, separately after hifiasm) has reduced the length of some contigs. Error message:

The contig h1tg000013l in gfa file finalasm.hic.hap1.p_ctg.gfa has a different length than the one in the fasta file. Maybe the gfa file does not match the fasta file.

NCHRS=36
haphic pipeline purged_haps_combined.fa HiC.filtered.bam $NCHRS --gfa "${GFA1},${GFA2}" --correct_nrounds 2 --remove_allelic_links 2 --threads 48 --processes 48

While we are leaning towards using the unitigs, we are also looking at the haplotype assemblies combined (before and now after purging as well) for comparisons. i.e. purge the haplotype assemblies separately, see that the assembly stats and busco duplication scores improve, then combine the purged haplotype assemblies for haphic pipeline.

Background info: We are aiming for haplotype-resolved chromosome-level assemblies for hybrid stick insects. They are quite large genomes (~3.2 Gb haploid) as well as highly heterozygous and repetitive due to the hybridisation history.

Data types we are using: ONT (R10), HiC reads, and soon coming Illumina reads for polishing.

So far we have followed the RAFT assembly pipeline which includes read fragmentation (making ONT reads more like HiFi in length evenness) and hifiasm assembly with the hic reads (and >50kb ultra-long data subset). So we are getting the phased haplotype assemblies.

However, for the specimen in question, hap1 was twice the size of hap2, and busco genes highly duplicated. It is supposedly a diploid but there might be three sets of homologous chromosomes but maybe not all of them. Perhaps this is why hifiasm had trouble making hap1 and hap2 the same size. I will run hifiasm again with the triploid option.

Some posts here recommend not to purge, and indeed we are preferring the clean contact map from p_utg. But due to the hybrid nature of the genomes, our busco duplication is quite high and purging brings this down a lot. So the question still stands about how to match the gfa file. One thing that has to be done is rename the headers in the purged fasta so they are the same as in the gfa file, because purging adds and extra _1 to the end of every header.

Many thanks, Gemma

zengxiaofei commented 1 month ago

My question: Is it possible to keep using the original gfa file to go with haplotype assemblies that have been purged?

The answer is currently no. We designed these checks because many users often input wrong files. Although we can provide an additional parameter to skip these checks, it needs some time for me. Simply, you can just modify the original GFA files to delete the contigs removed by purge_dups in S lines and update the sequence lengths in LN tags.

Some posts here recommend not to purge

"Purging duplicates" is somewhat a misleading term. These duplicates are derived from the heterozygous sites and this term is defined from the perspective of constructing a primary or a collapsed assembly. Hence, when you are attempting to construct a haplotype-resolved assembly, these duplicates are actually what you preferred and purging them is an incorrect choice.

So far we have followed the RAFT assembly pipeline which includes read fragmentation (making ONT reads more like HiFi in length evenness) and hifiasm assembly with the hic reads (and >50kb ultra-long data subset).

I'm not sure whether this strategy works when assembling a haplotype-resolved assembly. The RAFT pipeline uses hifiasm to correct reads. However, the read correction function in hifiasm is also designed for PacBio HiFi reads. ONT R10.4 reads are still more error-prone than HiFi reads, and the sequencing errors in them are often biasedly distributed. Hifiasm may view these sequencing errors as true heterozygous sites, resulting in incorrect phasing results. The newest ONT reads (median accuracy: Q26~Q28, SQK-LSK114 chemistry + V5.0.0 sup basecaller) may be suitable for hifiasm (https://github.com/chhylp123/hifiasm/issues/573). You may also ask the author of hifiasm for more information.

gemmacol commented 1 month ago

Thank you very much for your comments. We have taken the suggestions on board and no longer using purged assemblies. The unitigs really are making great contact maps -- it's unclear why the combined hap*.p_ctg would be the preferred option.

In the discussion here: https://github.com/zengxiaofei/HapHiC/issues/21 it's mentioned "If hifiasm is able to generate balanced haplotype assemblies, I recommend using hap*.p_ctg rather than p_utg." and also "While p_utg is used, there is no phasing information in the corresponding GFA file, HapHiC only makes use of the read depth information in the GFA file for contig filtering."

  1. Even if the three phased haplotype assemblies outputted by hifiasm were similar in size, it is still maybe ok to use the unitigs rather than the haplotype assemblies, and let HapHiC do the phasing during scaffolding, no?

Can you see the following two files for comparison? The unitigs is more clean.

  1. Our understanding is that the contact map with the three hap*.p_ctg combined ("allhaps") is showing lots of shorter scaffolds with the grey lines top and right on the plot. We are thinking of filtering out some of these shorter scaffolds e.g. 10kb filter then mapping hic reads again and putting it back through HapHiC pipeline . Would you recommend this approach?

Or maybe this "exclude these short unanchored sequences from the contact map using the --min_len parameter." or "--specified_scaffolds "group1,group2,group3,group4" as is being discussed here: https://github.com/zengxiaofei/HapHiC/issues/48 I guess we would be worried about losing important information from the assembly.

(In the meantime we will run some further QC steps on the scaffolded assembly which could also help decide which version is best)

AXP_p_utg_nchrs54.pdf

AXP_allhaps_nchrs54.pdf

zengxiaofei commented 1 month ago

"Purging duplicates" is somewhat a misleading term. These duplicates are derived from the heterozygous sites and this term is defined from the perspective of constructing a primary or a collapsed assembly. Hence, when you are attempting to construct a haplotype-resolved assembly, these duplicates are actually what you preferred and purging them is an incorrect choice.

From the current perspective, the statement above may be a bit hasty. In trio mode, hifiasm sets -l to 0 by default, which aligns with my previous understanding. However, in Hi-C mode, hifiasm sets -l to 3 by default. This may be due to a different algorithm used in Hi-C-based phasing. In this mode, hifiasm relies on determining homologous coverage and a similarity threshold to identify haplotigs. Therefore, we can understand why the authors recommend tuning the --hom-cov and -s parameters in such scenarios.

Please refer to the FAQs above if you encounter related issues. But I still would not recommend standalone tools for purging duplicates.

Even if the three phased haplotype assemblies outputted by hifiasm were similar in size, it is still maybe ok to use the unitigs rather than the haplotype assemblies, and let HapHiC do the phasing during scaffolding, no?

Yes, using unitigs is always acceptable. We recommend trying hap*.p_ctg when haplotypes are balanced simply because that contigs (p_utg) are longer than unitigs (p_ctg).

Can you see the following two files for comparison? The unitigs is more clean.

Yes, I also prefer the unitigs. However, according to the contact map, please be aware that Group 1 may contain unitigs from different haplotypes. This misassignment error is primarily due to the presence of a large collapsed region in this chromosome (please refer to the definition of collapse here and in Fig. 2a of our paper). In addition, there are some scaffolds from the same chromosome. Therefore, you need to manually curate your assembly in Juicebox.

Our understanding is that the contact map with the three hap*.p_ctg combined ("allhaps") is showing lots of shorter scaffolds with the grey lines top and right on the plot. We are thinking of filtering out some of these shorter scaffolds e.g. 10kb filter then mapping hic reads again and putting it back through HapHiC pipeline . Would you recommend this approach? Or maybe this "exclude these short unanchored sequences from the contact map using the --min_len parameter."

Just use unitigs. Since these unanchored contigs often lack Hi-C links, your strategy may not work.

or "--specified_scaffolds "group1,group2,group3,group4" as is being discussed here: https://github.com/zengxiaofei/HapHiC/issues/48

--specified_scaffolds will not alter your scaffolders, it is simply a method for selecting a subset of scaffolds for visualization.

zengxiaofei commented 1 month ago

Close this issue as there has been no response for two weeks.