wh-xu / Hyper-Gen

HyGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors
MIT License
21 stars 2 forks source link

License bsd-3-clause

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

HyperGen is a Rust library used to sketch genomic files and realize fast Average Nucleotide Identity (ANI) approximation. HyperGen leverages two advanced algorithms: 1. FracMinHash and 2. hyperdimensional computing (HDC) with random indexing as shown in the following figure:

HyperGen first samples the kmer set using FracMinHash. Then the kmer hashes are encoded into hyperdimensional vectors (HVs) using HDC encoding to obtain better tradeoff of ANI estimation quality, sketch size, and computation speed. The sketch size generated by HyperGen is 1.8 to 2.7× smaller than Mash and Dashing 2. ANI estimation in HyperGen can be realized using highly vectorized vector multiplication. HyperGen's database search speed for large-scale datasets is up to 4.3x faster than Dashing 2.

Quickstart

Installation

Basic Installation

HyperGen requires Rust language and Cargo to be installed. We recommend installing HyperGen using the following command:

git clone https://github.com/wh-xu/Hyper-Gen.git
cd Hyper-Gen

# Without GPU acceleration for sketching
cargo install --path .

Install with GPU Support

HyperGen supports GPU acceleration. Using GPU mode will require the installation of NVIDIA GPU driver. Use nvidia-smi or nvcc -V to check if the driver is installed. Then run the following command to install with GPU support:

# With GPU acceleration for sketching, RTX 4090 series
cargo install --features cuda-sketch-ada-lovelace --path .
## A100 series
cargo install --features cuda-sketch-ampere --path .
## V100 series
cargo install --features cuda-sketch-volta --path .
## H100 series
cargo install --features cuda-sketch-hopper --path .

Currently only Nvidia GPUs are supported. We tested the compatibility on both desktop RTX4090 and laptop RTX4060 with CUDA Version 12.x.

Usage

Current version supports following functions:

1. Genome sketching for .fa/.fna/.fasta files

Example:
hyper-gen sketch -p ./data -o ./fna.sketch

Positional arguments:
-p, --path <PATH>               Input folder path to sketch
-o, --out <OUT>                 Output path 
-t, --thread <THREAD>           Threads used for computation [default: 16]
-C, --canonical <CANONICAL>     If use canonical kmer [default: true]
-k, --ksize <KSIZE>             k-mer size for sketching [default: 21]
-s, --scaled <SCALED>           Scaled factor for FracMinHash [default: 1500]
-d, --hv_d <HD_D>               Dimension for hypervector [default: 4096]
-D, --device <DEVICE>           Device to run [default: cpu] [possible values: cpu, gpu]

2. ANI estimation and database search

Example:
hyper-gen dist -r fna1.sketch -q fna2.sketch -o output.ani

Positional arguments:
-r, --path_r <PATH_R>           Path to ref sketch file
-q, --path_q <PATH_Q>           Path to query sketch file
-o, --out <OUT>                 Output path 
-t, --thread <THREAD>           Threads used for computation [default: 16]
-a, --ani_th <ANI_TH>           ANI threshold [default: 85.0]

3. Faster sketching on GPU

HyperGen supports offloading the kmer hashing and sampling steps to GPU to speed up the sketching process. Use the following command to run on GPU device:

hyper-gen sketch -D gpu -p ./data -o ./fna.sketch

Differences between Mash and HyperGen

Publication

  1. Weihong Xu, Po-kai Hsu, Niema Moshiri, Shimeng Yu, and Tajana Rosing. "HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors." Bioinformatics, 2024.

Contact

For more information, post an issue or send an email to wexu@ucsd.edu.