zengxiaofei / HapHiC

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
BSD 3-Clause "New" or "Revised" License
112 stars 7 forks source link
3d-dna allhic assembly chromosome genome haplotype hi-c hifiasm lachesis phasing pore-c salsa scaffolding t2t yahs

HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data

HapHiC is an allele-aware scaffolding tool that uses Hi-C data to scaffold haplotype-phased genome assemblies into chromosome-scale pseudomolecules. Unlike ALLHiC, another allele-aware scaffolder, HapHiC can achieve this without the need for reference genomes. Our evaluations indicate that HapHiC outperforms other Hi-C scaffolding tools with higher tolerance to low contig N50, low Hi-C sequencing depth, and various types of assembly errors. Additionally, HapHiC is super-fast and also suitable for haplotype-collapsed diploid and allopolyploid genome assemblies.

Features:

Recent updates:

Terminology: To ensure conciseness and clarity, we use the term "contigs" to refer to the fragmented genome sequences in the input assembly, although they could be either contigs or scaffolds in actuality.

Table of contents

Installation

HapHiC has been tested and validated on servers running Linux, equipped with either Intel Xeon, AMD EPYC, or Hygon C86 CPUs.

# (1) Download HapHiC from GitHub
$ git clone https://github.com/zengxiaofei/HapHiC.git

# (2) Resolve dependencies
# We strongly recommend using conda to install dependencies. If you prefer manual installation, refer to HapHiC/conda_env/create_conda_env_py310.sh
# We have also included additional environments for Python 3.11 and 3.12 in the directory HapHiC/conda_env/
$ conda env create -f HapHiC/conda_env/environment_py310.yml
# Activate the HapHiC conda environment
$ conda activate haphic # or: source /path/to/conda/bin/activate haphic

# (3) Check whether all dependencies are correctly installed
$ /path/to/HapHiC/haphic check

# (4) Show all available commands and help message
$ /path/to/HapHiC/haphic -h

Quick start

Align Hi-C data to the assembly

First, you need to prepare a BAM file by aligning Hi-C data to the assembly. Here is the way that we recommend:

# (1) Align Hi-C data to the assembly, remove PCR duplicates and filter out secondary and supplementary alignments
$ bwa index asm.fa
$ bwa mem -5SP -t 28 asm.fa /path/to/read1_fq.gz /path/to/read2_fq.gz | samblaster | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam

# (2) Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)
$ /path/to/HapHiC/utils/filter_bam HiC.bam 1 --nm 3 --threads 14 | samtools view - -b -@ 14 -o HiC.filtered.bam

HapHiC also supports Pore-C data:

# (1) Align Pore-C data (FASTA/FASTQ) to the assembly using minimap2
$ minimap2 -x map-ont -a -t 28 asm.fa /path/to/porec_reads.gz | samtools view - -b -@ 14 -o map.bam

# (2) Convert `map.bam` to a format compatible with HapHiC
# Finally, you will obtain a new BAM file named `porec_paired.bam`, which should be used as the input BAM file for HapHiC.
$ /path/to/HapHiC/utils/prepare_paired_bam_minimap2.py map.bam

Notes:

Run HapHiC scaffolding pipeline

(i) One-line command. HapHiC provides a one-line command haphic pipeline to execute the entire scaffolding pipeline. The required parameters are:

1) asm.fa, your genome assembly file in FASTA format. 2) HiC.filtered.bam, the BAM file prepared in the previous step (the .pairs file output by chromap is also acceptable since version 1.0.3). 3) nchrs, the number of chromosomes present in the assembly, and also the expected number of output scaffolds.

$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs

(ii) Restriction site. The default restriction site is GATC (MboI/DpnII). You can modify this using the --RE parameter. If you are unsure or if your Hi-C library was constructed without restriction enzymes (REs), it is acceptable to leave it as the default.

# For HindIII
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --RE "AAGCTT"
# For Arima two-enzyme chemistry
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --RE "GATC,GANTC"
# For Arima four-enzyme chemistry
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --RE "GATC,GANTC,CTNAG,TTAA"

(iii) Contig correction. To correct input contigs based on Hi-C linking information, use --correct_nrounds to enable assembly correction and set the number of correction rounds. For example:

# Typically, two rounds of assembly correction are enough
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --correct_nrounds 2

(iv) Switch error. If your input assembly is haplotype-phased and has a high switch error rate (often introduced by assemblers when the sequence divergence between haplotypes is very low), use --remove_allelic_links to remove Hi-C links between allelic contigs. The value should be the ploidy of the assembly. For example:

# For haplotype-phased assembles of autotetraploids, set the parameter to 4
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --remove_allelic_links 4

Note: If your input assembly is haplotype-phased and the Hi-C reads are aligned using other methods like chromap, we also recommend including this parameter to mitigate the adverse effects of incorrect mapping.

(v) Performance. Use --threads to set the number of threads for BAM file reading, and --processes to create multiple processes for contig ordering and orientation. For example:

$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --threads 8 --processes 8

Parameters

For more information, run haphic pipeline --help.

Final outputs

Note: Although the one-line command is convenient, the automatic parameter tuning may fail, leading to poor results or even a pipeline interruption in rare cases. If this occurs, we recommend running each step individually with manual parameter tuning or trying the quick view mode described below.

Go through the pipeline step by step

[Step 1]. Clustering

Before clustering, HapHiC performs preprocessing to correct assembly misjoins, filter out short, mis-assembled contigs, and remove allelic Hi-C links. After that, a Markov cluster algorithm (MCL algorithm) is used to cluster the contigs into groups. Unlike agglomerative hierarchical clustering (AHC, used in LACHESIS and ALLHiC), which specifies the number of clusters, the MCL Algorithm implicitly controls it with a parameter called "inflation". The higher the inflation, the more the groups are clustered. The main problem with AHC is that even though the number of clusters is specified, contigs from different chromosomes may also be clustered into the same group. This is common in phased diploid or polyploid genome assemblies. To solve this, HapHiC tries a series of inflations to cluster the contigs (controlled by min_inflation and max_inflation) and recommends a "best" one based on both the expected number of chromosomes nchrs provided and the length distribution of the groups.

$ /path/to/HapHiC/haphic cluster asm.fa HiC.filtered.bam nchrs

Parameters

For more information, run haphic cluster --help.

Main outputs

The "best" inflation

You can find the "best" inflation recommendation in the log file HapHiC_cluster.log, like:

2022-11-07 17:50:08 <HapHiC_cluster.py> [recommend_inflation] You could try inflation from 1.20 (length ratio = 0.75)

In some cases, HapHiC cannot get the "best" one. It could be due to inappropriate parameters or extensive assembly errors. Check whether the parameters used are correct / appropriate and then try to tune the parameters for assembly correction, contig / Hi-C link filtration, or Markov Clustering:

2022-11-19 13:20:38 <HapHiC_cluster.py> [recommend_inflation] It seems that some chromosomes were grouped together (length ratio = 0.5)...

[Step 2] Reassignment

In the previous step, some contigs may have been filtered out before clustering or assigned to incorrect groups. Additionally, the number of final clusters output by Markov clustering may exceed the specified number of chromosomes (nchrs). To address these issues, we added a reassignment step to rescue contigs that are not in any groups, reassign contigs to the correct groups, and perform an additional agglomerative hierarchical clustering to concatenate groups if necessary. The input files full_links.pkl, mcl_inflation_x.clusters.txt, and paired_links.clm are outputs from the clustering step, where x represents the "best" inflation value:

$ /path/to/HapHiC/haphic reassign asm.fa full_links.pkl mcl_inflation_x.clusters.txt paired_links.clm --nclusters nchrs

Note: If assembly correction has been performed, use corrected_asm.fa as input FASTA file instead of asm.fa.

Parameters

For more information, run haphic reassign --help.

Main outputs

[Step 3] Ordering and orientation

The ordering and orientation step in HapHiC is implemented using an integration of algorithms from 3D-DNA and ALLHiC. First, an efficiency-improved 3D-DNA iterative scaffolding algorithm (refered to as "fast sorting") is used to quickly order and orient the contigs. Then, the ordering and orientation of contigs are input as an initial configuration and optimized with the ALLHiC program (a modified version, in which the hot-start optimization has been fixed). The input file HT_links.pkl is the output file from the clustering step; the directory split_clms and the group files final_groups/group*.txt were created in the reassignment step. The optional parameter --processes is used to set the number of processes for the ordering and orientation.

$ /path/to/HapHiC/haphic sort asm.fa HT_links.pkl split_clms final_groups/group*.txt --processes 8

Note: If assembly correction has been performed, use corrected_asm.fa as input FASTA file instead of asm.fa.

Parameters

For more information, run haphic sort --help.

Main outputs

[Step 4] Building pseudomolecules

The final step is to build the scaffolds (pseudomolecules) using the chromosome assignment, ordering and orientation information of contigs from the group*.tour files. By default, the output scaffolds are sorted by scaffold length.

If assembly correction was not performed:

$ /path/to/HapHiC/haphic build asm.fa asm.fa HiC.filtered.bam final_tours/group*.tour

If assembly correction has been performed, use corrected_asm.fa as input FASTA file instead of the first asm.fa. Additionally, specify the corrected contig list corrected_ctgs.txt using the --corrected_ctgs parameter. Otherwise, the YaHS-style scaffolds.raw.agp generated may be incorrect.

$ /path/to/HapHiC/haphic build corrected_asm.fa asm.fa HiC.filtered.bam final_tours/group*.tour --corrected_ctgs corrected_ctgs.txt

Note:

Parameters

For more information, run haphic build --help.

Main outputs

Examples

HapHiC can scaffold most genomes within 1 hour using 8 CPU cores. For large genomes with fragmented contigs, scaffolding typically takes less than half a day. HapHiC has been successfully validated in scaffolding genomes from various taxa, including higher plants, humans, birds, amphibians, fish, insects, mollusks, and annelids. For more examples and detailed information, please refer to the Supplementary Information in our paper.

Species Karyotype Haplotype-resolved Assembly size (Gb) Contig N50 (Mb) Number of contigs Hi-C depth after filtering Wall time (min) Peak RAM (GiB)
Giant Miscanthus 2n=3x=57 Yes 6.11 2.19 5,761 33.58× 115.35 17.10
Potato C88 2n=4x=48 Yes 3.16 18.78 2,490 13.4× 20.15 5.98
Wild sugarcane Np-X 2n=4x=40 Yes 2.76 0.38 15,510 23.7× 78.97 27.02
Alfalfa XinJiangDaYe 2n=4x=32 Yes 3.16 0.46 31,772 10.1× 33.13 7.68
Tea plant Tieguanyin 2n=2x=30 Yes 5.99 0.22 60,345 9.8× 157.53 33.68
Human HG002 2n=2x=46 Yes 6.02 73.40 1,153 4.7× 13.42 11.50
Wheat 2n=6x=42 No 14.0 2.16 12,982 1.5× 58.05 22.98
Ginkgo tree 2n=2x=24 No 9.87 1.58 261,820 54.1× 440.78 135.83
Northern goshawk 2n=2x=80 No 1.40 17.71 638 27.2× 16.95 2.19
Tropical clawed frog 2n=2x=20 No 1.48 0.38 9,631 47.5× 53.80 19.83
Corkwing warsse 2n=2x=46 No 0.64 1.19 1,774 85.3× 25.73 3.13
Chinese oak silkmoth 2n=2x=98 No 0.73 0.17 9,824 70.8× 35.33 10.66
Gray topshell 2n=2x=36 No 1.27 6.20 843 58.3× 27.07 5.06
Humus earthworm 2n=2x=36 No 0.79 0.71 2,261 64.4× 20.32 3.23

Work with hifiasm

When scaffolding a phased hifiasm assembly, you can run HapHiC with the GFA file(s) output by hifiasm. Here, the term "phased hifiasm assembly" refers to the haplotype-resolved primary contigs assembled via the trio binning or Hi-C-based algorithm (*.hap*.p_ctg.gfa), as well as the phased unitigs (*.p_utg.gfa).

HapHiC uses the read depth information in the GFA file(s) to filter out potential collapsed contigs/unitigs before clustering. If more than one GFA file is provided, HapHiC assumes these GFA files are haplotype-specific (*.hap*.p_ctg.gfa), and artificially removes or reduces the Hi-C links between the haplotypes according to this phasing information. Note that the contigs/unitigs in GFA file(s) should match those in FASTA file. Either .gfa or noseq.gfa is acceptable.

# (1) For hifiasm primary unitigs, use the GFA file to filter out potential collapsed unitigs before clustering
$ /path/to/HapHiC/haphic pipeline p_utg.fa HiC.filtered.bam nchrs --gfa p_utg.gfa

# (2) In addition to read depth filtering, HapHiC can also reduce Hi-C links between haplotypes according to phasing information in GFA files for haplotype-resolved primary contigs

# By default, all Hi-C links between haplotypes are completely removed and contigs from different haplotypes will not be clustered into the same group
$ /path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa"

# The weight can be set to 0 to ignore the phasing information. You can also set it between 0 and 1 to run HapHiC as a double check. In the latter case, contigs from different haplotypes might be clustered together
$ /path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa" ---phasing_weight 0

Quick view mode

You can try the quick view mode in HapHiC when:

  1. The exact number of chromosomes is unknown.
  2. HapHiC cannot provide an acceptable clustering result or encounters a pipeline interruption.
  3. You need a quick view of your assembly (e.g., to identify the type and approximate proportion of assembly errors).
  4. You just want to manually curate your assembly and split chromosomes in Juicebox by yourself.

In quick view mode, HapHiC simply uses the fast sorting to order and orient all contigs without clustering. The result is similar to *.0.hic in 3D-DNA. Most parameters are disabled in this mode, but you can use --correct_nrounds to correct input contigs. When scaffolding a haplotype-resolved hifiasm assembly ( *.hap*.p_ctg.gfa ), you can still partition contigs into different haplotypes with the haplotype-specific GFA files.

# HapHiC will ignore the parameter "nchrs", it can be any integer
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --quick_view
# Correct input contigs before a quick view
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --quick_view --correct_nrounds 2
# Partition contigs into different haplotypes in quick view mode
$ /path/to/HapHiC/haphic pipeline allhaps.fa HiC.filtered.bam nchrs --quick_view --gfa "hap1.p_ctg.gfa,hap2.p_ctg.gfa"

Juicebox curation

There are two ways of generating .assembly and .hic files for visualization and manual curation in Juicebox. You can choose one of them according to your preference.

(1) SALSA-style scaffolds.agp

First, install the dependencies, including (1) 3D-DNA, (2) matlock, (3) Juicebox scripts. Then, generate the .assembly and .hic files by following these steps:

# (1) Generate .mnd file
$ /path/to/matlock bam2 juicer HiC.filtered.bam out.links.mnd
$ sort -k2,2 -k6,6 out.links.mnd > out.sorted.links.mnd

# (2) Generate .assembly file
$ /path/to/juicebox_scripts/agp2assembly.py scaffolds.agp scaffolds.assembly

# (3) Generate .hic file
$ bash /path/to/3d-dna/visualize/run-assembly-visualizer.sh -p false scaffolds.assembly out.sorted.links.mnd

Note: If there are any contigs corrected by HapHiC, you need to re-align Hi-C reads to corrected_asm.fa and re-filter them instead of using the original HiC.filtered.bam . Otherwise, there will not be any Hi-C signals on the corrected contigs in Juicebox. This is because that the IDs of corrected contigs in the SALSA-style scaffolds.agp do not match the contig IDs in the original BAM file.

You can recall these steps on the command line:

$ /path/to/HapHiC/haphic juicer

After manual curation in Juicebox, you will obtain the modified assembly file scaffolds.review.assembly . To generate the final FASTA file for the scaffolds:

# Generate the final FASTA file for the scaffolds
$ /path/to/juicebox_scripts/juicebox_assembly_converter.py -a scaffolds.review.assembly -f asm.fa -s

(2) YaHS-style scaffolds.raw.agp (recommended)

To avoid the necessity of re-aligning Hi-C data, we have incorporated a YaHS-style scaffolds.raw.agp since HapHiC version 1.0.1. In this AGP file, the broken contigs are not assigned new IDs. Instead, their starting and ending coordinates in the raw contigs are displayed in the seventh and eighth columns. By following the approach provided by YaHS, you can generate the .assembly and .hic files without the need for re-aligning.

After constructing the final scaffolds, HapHiC automatically generates a shell script for visualization and curation in Juicebox. Ensure that Java and samtools have been installed and added to $PATH on your system. Then, run the following command:

# bash, not sh
$ bash juicebox.sh

Notes:

After manual curation in Juicebox, you will obtain the modified assembly file out_JBAT.review.assembly . To generate the final FASTA file for the scaffolds:

# Generate the final FASTA file for the scaffolds
$ /path/to/HapHiC/utils/juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp asm.fa

Visualization

Since HapHiC version 1.0.2, we have introduced a haphic plot command to generate highly customizable Hi-C contact maps. This command requires two input files: a filtered BAM file HiC.filtered.bam and a scaffold AGP file containing contig IDs that match those in the BAM file:

# For HapHiC scaffolding result
$ /path/to/HapHiC/haphic plot scaffolds.raw.agp HiC.filtered.bam
# For the AGP file generated after manual curation in Juicebox
$ /path/to/HapHiC/haphic plot out_JBAT.FINAL.agp HiC.filtered.bam

The visualized Hi-C contact map is output as contact_map.pdf . This process may be somewhat slow if the BAM file is large, taking several minutes per 10 GiB of the BAM file. Upon completion, the program will produce a binary file named contact_matrix.pkl . This file can be utilized in place of HiC.filtered.bam for faster visualization (~1 minute).

# Use previously generated `contact_matrix.pkl` for faster visualization
$ /path/to/HapHiC/haphic plot out_JBAT.FINAL.agp contact_matrix.pkl

Note: The input AGP file and the parameters --bin_size and --min_len must remain consistent throughout.

By default, the bin size is set to 500 Kbp and only scaffolds exceeding 1 Mbp in length will be displayed on the contact map. To modify these parameters:

# Set the bin size to 1 Mbp and display only scaffolds longer than 5 Mbp
$ /path/to/HapHiC/haphic plot out_JBAT.FINAL.agp HiC.filtered.bam --bin_size 1000 --min_len 5

Additionally, you can create separate_plots.pdf , which illustrates the contact map for each scaffold individually:

$ /path/to/HapHiC/haphic plot out_JBAT.FINAL.agp HiC.filtered.bam --separate_plots

To change the colormap, origin, border style, and normalization method for the contact maps, refer to the examples provided in the figure above.

Order and orient whole scaffolds using a reference genome

HapHiC has introduced a separate command, haphic refsort, in version 1.0.4 to order and orient whole scaffolds according to a reference genome.

To begin, you should prepare a PAF file by align raw contigs (not scaffolds) to a reference genome using minimap2. The reference genome can be from the same species or a closely related one:

# The preset can be `asm5` if the reference genome is well-assembled from the same species
$ minimap2 -x asm20 ref.fa asm.fa --secondary=no -t 28 -o asm_to_ref.paf
# `haphic refsort` can also be compatible with other aligners, like wfmash
$ wfmash ref.fa asm.fa -m -n 1 -S 1 -t 28 | cut -f 1-6,8- > asm_to_ref.paf

By using the haphic refsort command, you can generate a new AGP file based on the PAF file:

# By default, scaffolds are output based on the alphabetical order of the chromosome IDs of the reference genome
$ haphic refsort 04.build/scaffolds.raw.agp asm_to_ref.paf > scaffolds.refsort.agp
# You can manually specify the order by listing them and separating with commas
$ haphic refsort 04.build/scaffolds.raw.agp asm_to_ref.paf --ref_order chr1,chr2,chr3,chr4,... > scaffolds.refsort.agp

The generated scaffolds.refsort.agp file can be directly used for Juicebox curation and for haphic plot visualization. Please note that this function will NOT alter your scaffolds, it only changes the way of presentation through overall ordering and orientation of the entire scaffolds.

Here is an example of the autotetraploid sugarcane Np-X assembly:

Frequently asked questions (FAQs)

Problems and bug reports

Citing HapHiC

If you have used HapHiC in your work, please cite our paper published in Nature Plants:

Xiaofei Zeng, Zili Yi, Xingtan Zhang, Yuhui Du, Yu Li, Zhiqing Zhou, Sijie Chen, Huijie Zhao, Sai Yang, Yibin Wang, Guoan Chen. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nature Plants. doi: https://doi.org/10.1038/s41477-024-01755-3

There is also a Research Briefing available in Nature Plants:

Xiaofei Zeng, Guoan Chen. (2024) Achieving de novo scaffolding of chromosome-level haplotypes using Hi-C data. Nature Plants. doi: https://doi.org/10.1038/s41477-024-01756-2

If you have used the optimization function for contig ordering and orientation, please cite ALLHiC as well:

Xingtan Zhang, Shengcheng Zhang, Qian Zhao, Ray Ming, Haibao Tang. (2019) Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nature Plants, 5:833-845. doi: https://doi.org/10.1038/s41477-019-0487-8

Reproducibility

To reproduce the results in our paper, please use the HapHiC code in commit 431b7b6.