zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
133 stars 9 forks source link

Work with hifiasm using haplotype-specific data #21

Closed chenzixi07 closed 5 months ago

chenzixi07 commented 6 months ago

Hello @zengxiaofei , I used hifiasm with hifi and HiC data, and generate haplotype-specific data, including p_ctg.fa, p_utg.fa, r_utg.fa, and hap*.p_ctg.fa. As you mentioned in the tutorial, if I want to run the script below:

/path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa"

1) Which fa file should I use as the ? p_utg, p_ctg, or cat the hap*.p_ctg.fa as one file? According to the previous post #16 , maybe I should cat the hap*.p_ctg.fa as one file?

2) Which fa should I use for HiC reads mapping to generate the ?

3) I tested both 46 and 46*3 for scaffolding as I mentioned here #19. I used p_utg.fa as the only input, without hap*.gfa. I am a beginner in genome assembly. It seems that 46 is better for resolving the triploid genome? But the survey of my species seems to be 2n=3x=2.4G, so the size of 46*3 result seems better? I am really confused.

The pic below using the command: haphic pipeline p_utg.fa HiC.filtered.bam 138 --gfa p_utg.gfa image

The pic below using the command: haphic pipeline p_utg.fa HiC.filtered.bam 46 --gfa p_utg.gfa image

Regards, Zixi

zengxiaofei commented 6 months ago

Hi @chenzixi07,

1) In theory, primary unitigs (p_utg) have no phasing errors if generated correctly. However, they are typically much shorter than haplotype-phased contigs (hap*.p_ctg). If hifiasm is able to generate balanced haplotype assemblies, I recommend using hap*.p_ctg rather than p_utg. Unfortunately, hifiasm may output imbalanced haplotype assemblies, especially for autopolyploids. In such cases, you need to use p_utg instead of hap*.p_ctg. If hap*.p_ctg files are concatenated for downstream scaffolding, you can benefit from the phasing information output by hifiasm using --gfa. While p_utg is used, there is no phasing information in the corresponding GFA file, HapHiC only makes use of the read depth information in the GFA file for contig filtering.

2) This question is the same as the first question.

3) According to the Hi-C heatmaps you provided, nchr=138 is much better than nchr=46 as most of homologous chromosomes are incorrectly clustered together in the latter one. However, there are still some clustering errors for nchr=138. You may alleviate these errors by adding the parameter --remove_allelic_links 3.

Best, Xiaofei

chenzixi07 commented 6 months ago

Hi @zengxiaofei , 1) Here is the assemblies of hifiasm, seems that the hap*.p_ctg can be concerned as balanced? image

Another question is that I hope to assemble 3x=2.4G triploid genome (haploid ~800M), but the total length of hifiasm hap*.p_ctg is nearly 2.4G*3. Is it necessary to use tools like purge_dups to deduplicat the them before HapHiC scaffolding? I will try to use the concatenated hap*.p_ctg with HapHiC this week.

3) I do run the pipeline with the parameter --remove_allelic_links 3. I pasted the heatmap below. image

Actually, I think 138 may not be the accurate chr number of my species. Among the 46 chrs, most of them are triploid, but maybe not all. I am still trying to get the exact number.

Regards, Zixi

zengxiaofei commented 6 months ago

1) Although the second haplotype assembly is slightly shorter, you can try using it. 2) How did you estimate the genome size? You can veritfy this based on the k-mer histograms output by hifiasm. The function purge_dups is typically used to purge duplications (which are derived from heterozygous sites) between haplotigs, so it's not preferred in the context of haplotype phasing. Another concern is that your heatmap of p_utg only represents ~2 Gb, but your assembly is ~5 Gb. Nore that haphic plot ignores scaffolds shorter than 1 Mb. It would be better if you use the parameter --min_len 0 or visualize it in Juicebox. 3) The heatmap is clear enough for counting the exact chromosome number or for manual curation in Juicebox.

chenzixi07 commented 6 months ago

Thanks for your reply! We estimated the genome size by both FCM and short-reads/long-reads genome survey (although the result is...weired.)

I found that the sequence ids in hap.fa were all h1tg00xxxxl, but the sequences from the 3 hap.fa with the same id have different length. So is it ok to rename the ids in hap2.fa as h2tg, and hap3.fa as h3tg?

I also noticed that in the log of hifiasm, there is a line: [M::ha_ft_gen] peak_hom: 27; peak_het: -1 Something wrong with the result? I should rerun hifiasm?

zengxiaofei commented 6 months ago

We estimated the genome size by both FCM and short-reads/long-reads genome survey (although the result is...weird.)

What method did you use for the genome survey? Generally, I do not use tools like GenomeScope or GenomeScope2. Instead, I often estimate the genome size by calculating the ratio k-mer count / peak depth, where the k-mer count represents the number of k-mers with a minimum depth of 2, and for phased assemblies, the peak depth is the k-mer depth of the leftmost peak. In the cases of HiFi reads, the formula can be simplified to the ratio sequenced bases / peak depth.

If the assembly size matches the result of the k-mer analysis. The inconsistence between FCM and k-mer analysis might be explained by (1) a high proportion of organelle DNA; (2) inappropriate reference used in FCM; (3) genetic variations among individuals; (4) a chimera.

I found that the sequence ids in hap.fa were all h1tg00xxxxl, but the sequences from the 3 hap.fa with the same id have different length. So is it ok to rename the ids in hap2.fa as h2tg, and hap3.fa as h3tg?

Yes, it could be a bug in hifiasm. The sequence IDs should be renamed.

I also noticed that in the log of hifiasm, there is a line: [M::ha_ft_gen] peak_hom: 27; peak_het: -1 Something wrong with the result? I should rerun hifiasm?

You can manually speficy the homologous coverage using the --hom-cov parameter based on the k-mer histogram output by hifiasm (≈ the k-mer depth of the leftmost peak * ploidy).

chenzixi07 commented 6 months ago

1) I used gce for genome survey. 3) Yeah I will try both all tha hap fasta, and rerun hifiasm with --hom-cov. Hope to have a better result.

Thanks again!

chenzixi07 commented 6 months ago

Hi @zengxiaofei 1) I tried to use the concatenated hap*.p_ctg files as the reference genome. After bwa mapping, the size of the bam file is 371G, but the filtered bam is only 2.3G (filter_bam HiC.bam 1 --nm 3). In compare, while using the p.utg as reference, the size of the bam file is 336G, and the filtered bam is 36G. Is the 2.3G bam file too small for scaffolding?

2) While using the pipeline, I use nchr138, but the pipeline could only get 81 clusters even with inflation3.0. The maximum number of clusters (81) is even less than the expected number of chromosomes (138). You could try higher inflation. I tried to set max_inflation to 40.0, but the pipeline still could not get the clusters I want, and it automatically quit at the cluster step. As I posted above, when I used haphic pipeline p_utg.fa HiC.filtered.bam 138 --gfa p_utg.gfa, although the pipeline could not get 138 clusters, the pipeline did not break. I wonder what is the difference, and what should I suppose to do?

Regards, Zixi

zengxiaofei commented 6 months ago

I tried to use the concatenated hap*.p_ctg files as the reference genome. After bwa mapping, the size of the bam file is 371G, but the filtered bam is only 2.3G (filter_bam HiC.bam 1 --nm 3). In compare, while using the p.utg as reference, the size of the bam file is 336G, and the filtered bam is 36G. Is the 2.3G bam file too small for scaffolding?

It's a little bit weird. Could you please show me the command lines of bwa mapping and bam filtering? We may need to solve this before considering the second problem.

chenzixi07 commented 6 months ago

I used the command you suggested in the Quick start part, the hic.hapAll.p_ctg.fa file is the concatenated hap*.p_ctg file.

asm=hic.hapAll.p_ctg.fa samtools faidx $asm bwa index $asm

bwa mem -5SP -t 36 $asm \ clean_1P.fastq.gz \ clean_2P.fastq.gz \ | samblaster | samtools view - -@ 8 -S -h -b -F 3340 -o HiC.bam

filter_bam HiC.bam 1 --nm 3 --threads 16 | samtools view - -b -@ 16 -o HiC.filtered.bam

Thanks! Zixi

zengxiaofei commented 5 months ago

Maybe the Hi-C mode of hifiasm assembled too many identical sequences that are not in the primary unitigs, which can lead to fewer effective Hi-C reads after filtering. You can try scaffolding each haplotype individually.

chenzixi07 commented 5 months ago

Thanks, I will try other approaches. Until now, the utg/ctg-based scaffoding results seem to be enough for further analysis.

Regards, Zixi