HyperSpec is a Python library that supports extremely fast spectra clustering. HyperSpec adopts the brain-inspired hyperdimensional (HD) computing to project the spectra data into binary hyperdimensional space to obtain better clustering quality and faster clustering speed. HyperSpec shortens the runtime on the draft human proteome dataset with 25 million spectra from a few hours to <15 minutes. The software is available as open-source under the BSD license.
HyperSpec requires Python 3.8+
with CUDA
environment. A GPU should be installed properly. HyperSpec has been tested on two types of NVIDIA GPUs on a Linux platform, including GTX 1080Ti and GTX 3090.
Other NVIDIA GPUs should support but need further test. We recommend using high-performance SSD as the storage device for the best performance.
Install via Docker
We recommend installing HyperSpec via docker using the following command:
docker build --no-cache -f ./docker/Dockerfile -t hyper_spec .
docker run --gpus all -v /ms-dataset/:/dataset/ -it hyper_spec /bin/bash
Install from Source
git https://github.com/wh-xu/Hyper-Spec.git
sh install.sh
usage: python src/main.py [-h] [--cpu_core_preprocess CPU_CORE_PREPROCESS] [--cpu_core_cluster CPU_CORE_CLUSTER]
[--batch_size BATCH_SIZE] [--use_gpu_cluster] [--min_peaks MIN_PEAKS]
[--mz_interval MZ_INTERVAL] [--min_mz_range MIN_MZ_RANGE] [--min_mz MIN_MZ] [--max_mz MAX_MZ]
[--remove_precursor_tol REMOVE_PRECURSOR_TOL] [--min_intensity MIN_INTENSITY]
[--max_peaks_used MAX_PEAKS_USED] [--scaling {off,root,log,rank}] [--hd_dim HD_DIM] [--hd_Q HD_Q] [--hd_id_flip_factor HD_ID_FLIP_FACTOR]
[--cluster_charges [CLUSTER_CHARGES ...]]
[--precursor_tol PRECURSOR_TOL PRECURSOR_TOL] [--rt_tol RT_TOL] [--fragment_tol FRAGMENT_TOL] [--eps EPS]
[--cluster_alg {dbscan,hc_single,hc_complete,hc_average}]
[--refine REFINE]
[--checkpoint CHECKPOINT] [--representative_mgf]
input_filepath output_filename
Positional arguments:
input_filepath The path containing the `MGF` files for raw spectra data
output_filename Output CSV file that stores the clustering results.
Optional arguments:
-h, --help Show the help messages
--cpu_core_preprocess The number of CPU cores used for preprocessing. (default: 6)
--cpu_core_cluster The number of CPU cores used for clustering.
Only enable when `use_gpu_cluster` is True. (default: 6)
--batch_size The batch size for HD encoding on GPU. (default: 5000)
--use_gpu_cluster Flag that determines whether to use DBSCAN
on GPU. (default: True)
--hd_dim The HD dimension. (default: 2048)
--hd_Q The HD quantization level. (default: 16)
--cluster_charges The charges to be clustered. (default: 2 3)
--cluster_alg Select DBSCAN or hierarchical clustering algorithm (including dbscan, hc_single, hc_complete, and hc_average) for spectra (default: hc_complete)
--eps The threshold value `eps` for DBSCAN clustering.
(default: 0.4)
--refine Flag to determine whether refine the clustering results.
(default: True)
--representative_mgf Flag to determine whether exporting the clustering representatives.
(default: False)
--checkpoint The checkpoint filename to save the encoded HVs of spectra (default: None)
HyperSpec supports running using the command line and takes MGF
peak files as input and exports the clustering result as a CSV file with each MS/MS spectrum and its cluster label on a single line. Here we provide two examples of running HyperSpec:
python src/main.py ~/dataset/ ./output.csv --cpu_core_preprocess=4 --cluster_alg dbscan --use_gpu_cluster --cluster_charges 2 3 --eps=0.2 --refine
This will cluster all MS/MS spectra in folder ~/dataset/
on GPU
and generate the output.csv
file. The number of CPU cores for preprocessing is 4
. Only Charge 2
and Charge 3
are clustered in this configuration. The DBSCAN clustering threshold is eps=0.2
and post-clustering refinement is enable
.
python src/main.py ~/dataset/ ./output.csv --cpu_core_preprocess=4 --cluster_alg hc_complete --cluster_charges 2 3 --eps=0.25 --refine
This will cluster all MS/MS spectra in folder ~/dataset/
using hierarchical clustering with complete linkage
on CPU
and generate the output.csv
file. The number of CPU cores for preprocessing is 4
. Only Charge 2
and Charge 3
are clustered in this configuration. The hierarchical clustering threshold is eps=0.25
and post-clustering refinement is enable
.
The exported meta data for clustering results are compressed and stored in parquet
file, which records bucket
, precursor_charge
, precursor_mz
, identifier
, scan
, retention_time
, cluster
, and is_representative
information. The format is given as:
bucket | precursor_charge | precursor_mz | identifier | scan | retention_time | cluster | is_representative |
---|---|---|---|---|---|---|---|
598 | 2 | 300.148804 | Adult_Gallbladder_bRP_Elite_53_f07 | 338 | 165.133194 | 664 | True |
5384 | 3 | 1796.564697 | Fetal_Ovary_bRP_Velos_41_f18 | 4875 | 2896.885986 | 4455302 | False |
For more information, post an issue or send an email to wexu@ucsd.edu.