zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
https://www.nature.com/articles/s41477-024-01755-3
BSD 3-Clause "New" or "Revised" License
141 stars 10 forks source link

How to improve allotetraploid scaffolding in haphic? #22

Closed RezwanCAAS closed 5 months ago

RezwanCAAS commented 7 months ago

Hi , I used haphic for allotetraploid genome (2n=4x=44). It makes 22 groups like the following which have huge variations in chromosome sizes as given below and can be seen in juicebox plot. Could you suggest how to improve this

group1 335147534 group2 265712237 group3 259566223 group4 256679759 group5 252275747 group6 223506148 group7 216891361 group8 204149503 group9 178470265 group10 169460411 group11 165539612 group12 151015244 group13 150193252 group14 124757396 group15 106023119 group16 103438627 group17 102100592 group18 94194491 group19 71377479 group20 51412018 group21 41958356 group22 41183465

Screenshot 2024-04-17 at 6 04 38 AM
zengxiaofei commented 7 months ago

Hi @RezwanCAAS,

According to your heatmap, it seems that the homologous chromosomes were incorrectly clustered. How did you assemble the genome and which assembly did you use for scaffolding (e.g., p_utg, p_ctg, or hap*.p_ctg)?

Best, Xiaofei

RezwanCAAS commented 7 months ago

Hi @zengxiaofei I used hifiasm with following command and got these outputs

module load hifiasm/0.19.8
hifiasm -o yellow_assembly -t 32 --hom-cov 63 \
 --h1 yellow_1.fastq.gz \
 --h2 yellow_2.fastq.gz \
 reads_cell_*

output

-rw-r--r-- 1 tariqr ibex-c2141 44943554304 Mar  2 02:38 yellow_assembly.ec.bin
-rw-r--r-- 1 tariqr ibex-c2141  3020953966 Mar 25 17:29 yellow_assembly.hic.hap1.p_ctg.fasta
-rw-r--r-- 1 tariqr ibex-c2141  3083618349 Mar  2 10:52 yellow_assembly.hic.hap1.p_ctg.gfa
-rw-r--r-- 1 tariqr ibex-c2141    16185036 Mar  2 10:52 yellow_assembly.hic.hap1.p_ctg.lowQ.bed
-rw-r--r-- 1 tariqr ibex-c2141    62763143 Mar  2 10:52 yellow_assembly.hic.hap1.p_ctg.noseq.gfa
-rw-r--r-- 1 tariqr ibex-c2141  3603444541 Mar 25 17:30 yellow_assembly.hic.hap2.p_ctg.fasta
-rw-r--r-- 1 tariqr ibex-c2141  3680868301 Mar  2 10:53 yellow_assembly.hic.hap2.p_ctg.gfa
-rw-r--r-- 1 tariqr ibex-c2141    16712429 Mar  2 10:54 yellow_assembly.hic.hap2.p_ctg.lowQ.bed
-rw-r--r-- 1 tariqr ibex-c2141    77494725 Mar  2 10:53 yellow_assembly.hic.hap2.p_ctg.noseq.gfa
-rw-r--r-- 1 tariqr ibex-c2141  3358681400 Mar  2 10:04 yellow_assembly.hic.lk.bin
-rw-r--r-- 1 tariqr ibex-c2141  3728413366 Mar 25 17:31 yellow_assembly.hic.p_ctg.fasta
-rw-r--r-- 1 tariqr ibex-c2141  3807131425 Mar  2 04:21 yellow_assembly.hic.p_ctg.gfa
-rw-r--r-- 1 tariqr ibex-c2141    16786239 Mar  2 04:22 yellow_assembly.hic.p_ctg.lowQ.bed
-rw-r--r-- 1 tariqr ibex-c2141    78785721 Mar  2 04:21 yellow_assembly.hic.p_ctg.noseq.gfa
-rw-r--r-- 1 tariqr ibex-c2141  7065869776 Mar  2 04:16 yellow_assembly.hic.p_utg.gfa
-rw-r--r-- 1 tariqr ibex-c2141    36327989 Mar  2 04:18 yellow_assembly.hic.p_utg.lowQ.bed
-rw-r--r-- 1 tariqr ibex-c2141   141553288 Mar  2 04:17 yellow_assembly.hic.p_utg.noseq.gfa
-rw-r--r-- 1 tariqr ibex-c2141  8681089843 Mar  2 04:12 yellow_assembly.hic.r_utg.gfa
-rw-r--r-- 1 tariqr ibex-c2141    47969038 Mar  2 04:14 yellow_assembly.hic.r_utg.lowQ.bed
-rw-r--r-- 1 tariqr ibex-c2141   156694833 Mar  2 04:13 yellow_assembly.hic.r_utg.noseq.gfa
-rw-r--r-- 1 tariqr ibex-c2141 50678500976 Mar  2 06:18 yellow_assembly.hic.tlb.bin
-rw-r--r-- 1 tariqr ibex-c2141 29932238864 Mar  2 03:49 yellow_assembly.ovlp.reverse.bin
-rw-r--r-- 1 tariqr ibex-c2141 20184090104 Mar  2 03:02 yellow_assembly.ovlp.source.bin

later, I used yellow_assembly.hic.p_ctg.fasta file for scaffolding with haphic. Please guide some points to improve the scaffolding.

zengxiaofei commented 7 months ago

Hi @RezwanCAAS,

You need to concatenate the contigs in the hap*.p_ctg files for scaffolding, rather than the p_ctg file. This is because that the contigs in p_ctg are not phased. Additionally, the nchrs parameter should be set to 44.

Best, Xiaofei

RezwanCAAS commented 7 months ago

@zengxiaofei thank you so much for helping. I will let you know soon after getting the results.

RezwanCAAS commented 7 months ago

Hi @zengxiaofei following the above given suggestions. I have this output in form of 44 groups. The groups are shown here in the hic plot. So what do you suggest here? how can I improve it


group1  268313361
group2  229185811
group3  213565163
group4  211333994
group5  192355810
group6  178114398
group7  175672271
group8  173070932
group9  167268982
group10 166994227
group11 165224190
group12 163976005
group13 160510958
group14 156551379
group15 154125454
group16 151485858
group17 149388751
group18 149068743
group19 147517207
group20 146026689
group21 145559499
group22 143005682
group23 141123267
group24 137886426
group25 137438397
group26 135923374
group27 130012929
group28 129423837
group29 128388296
group30 127998997
group31 124813519
group32 124052846
group33 118674522
group34 117104620
group35 112908057
group36 108358901
group37 101293320
group38 97290716
group39 90849898
group40 85567886
group41 74996220
group42 72339106
group43 44300534
group44 `41067405`
Screenshot 2024-04-21 at 9 26 45 PM
zengxiaofei commented 7 months ago

Hi @RezwanCAAS,

It seems that the heatmap is clear enough. You can manually adjust it in Juicebox after importing the .assembly file with the "balanced" normalization.

Best, Xiaofei

RezwanCAAS commented 7 months ago

@zengxiaofei thank you for your great help. I will add the final figure here after correction with juicebox as reference for other users as well.

zengxiaofei commented 7 months ago

@RezwanCAAS Thanks for your sharing!

RezwanCAAS commented 7 months ago

@zengxiaofei Please check this plot having 44 chromosomes. How does this look?

contact_map.pdf

RezwanCAAS commented 7 months ago

One more question, why these red circled lines are not contacting to the main scaffolds? I tried to make their curation but didn’t work. Is it due to artifacts of homologous regions. ? IMG_1921

zengxiaofei commented 6 months ago

Sorry for the delay. I'm quite busy these days. Your contact heatmap shows that there are still many errors in the contig assignment, as well as the ordering and orientation. The Hi-C signals you highlighted with red circles are signals between the homologous chromosomes. They mainly derive from assembly errors (but it's normal for haplotype-phased assemblies). You could check out our manuscript for more information about collapsed contigs, chimeric contigs, and switch errors.

You could also have a look at the heatmaps we generated in the tests of real cases, especially those haplotype-phased assemblies. These figures are in the Supplementary Information. I believe it is helpful for you in curating your assembly.

zengxiaofei commented 6 months ago

Here are two examples of S. spontaneum Np-X (sugarcane) and C. sinensis Tieguanyin (tea plant):

image

image

RezwanCAAS commented 6 months ago

@zengxiaofei thank you so much for your great support, and shared shared examples will be helpful to improve my genome plot accordingly.