PDLLMs: A model-box based on Plant DNA Large Language Models (LLMs)
English | 简体中文
Online prediction of other models and prediction tasks can be found here.
Anaconda package manager is recommended for building the training environment. For pre-train and fine-tune models, please ensure that you have a Nvidia GPU and the corresponding drivers are installed. For inference, devices without Nvidia GPU (CPU only, AMD GPU, Apple Silion, etc.) are also acceptable.
conda create -n llms python=3.11
conda activate llms
If you want to pre-train or fine-tune models, make sure you are using Nvidia GPU(s).
Install Nvidia driver and corresponding version of CUDA driver (> 11.0, we used CUDA 12.1).
Also Pytorch (>=2.0) with corresponding CUDA version should also install.
We recommend to use pip
to install python packages that needed. Please be sure to install the corresponding CUDA and Torch versions carefully, the CUDA version used in this test environment is 12.1. Please refer to Official Website for the detailed installation tutorial of pytorch.
pip install 'torch<2.4' --index-url https://download.pytorch.org/whl/cu121
If you just want to use models for inference (prediction), you can install Pytorch GPU version (above) or install Pytorch CPU version if your machine has no Nvidia GPU.
pip install 'torch<2.4' --index-url https://download.pytorch.org/whl/cpu
Next install other required dependencies.
git clone --recursive https://github.com/zhangtaolab/Plant_DNA_LLMs
cd Plant_DNA_LLMs
pip install -r requirements.txt
(Optional) If you want to train a mamba model, you need to install several extra dependencies, also you should have a Nvidia GPU.
pip install 'causal-conv1d<=1.3'
pip install 'mamba-ssm<2'
glt-lfs
is required for download large models and datasets,git-lfs
installation can be refer to git-lfs install.
If git-lfs
is installed, run the following command
$ git lfs version
will get message like this
git-lfs/3.3.0 (GitHub; linux amd64; go 1.19.8)
To fine-tune the plant DNA LLMs, please first download the desired models from HuggingFace or ModelScope to local. You can use git clone
(which may require git-lfs
to be installed) to retrieve the model or directly download the model from the website.
In the activated llms
python environment, use the model_finetune.py
script to fine-tune a model for downstream task.
Our script accepts .csv
format data (separated by ,
) as input, when preparing the training data, please make sure the data contain a header and at least these two columns:
sequence,label
Where sequence
is the input sequence, and label
is the corresponding label for the sequence.
We also provide several plant genomic datasets for fine-tuning on the HuggingFace and ModelScope.
We use Plant DNAGPT model as example to fine-tune a model for active core promoter prediction.
First download a pretrain model and corresponding dataset from HuggingFace or ModelScope:
# prepare a work directory
mkdir LLM_finetune
cd LLM_finetune
# download pretrain model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE
# download train dataset
git clone https://huggingface.co/zhangtaolab/plant-multi-species-core-promoters
After preparing the model and dataset, using the following script to finetune model (here is a promoter prediction example)
python model_finetune.py \
--model_name_or_path plant-dnagpt-BPE \
--train_data plant-multi-species-core-promoters/train.csv \
--test_data plant-multi-species-core-promoters/test.csv \
--eval_data plant-multi-species-core-promoters/dev.csv \
--train_task classification \
--labels 'Not promoter;Core promoter' \
--run_name plant_dnagpt_BPE_promoter \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 8 \
--learning_rate 1e-5 \
--num_train_epochs 5 \
--load_best_model_at_end \
--metric_for_best_model 'f1' \
--save_strategy epoch \
--logging_strategy epoch \
--evaluation_strategy epoch \
--output_dir plant-dnagpt-BPE-promoter
In this script:
--model_name_or_path
: Path to the foundation model you downloaded--train_data
: Path to the train dataset--test_data
: Path to the test dataset, omit it if no test data available--dev_data
: Path to the validation dataset, omit it if no validation data available--train_task
: Determine the task type, should be classification, multi-classification or regression--labels
: Set the labels for classification task, separated by ;
--run_name
: Name of the fine-tuned model--per_device_train_batch_size
: Batch size for training model--per_device_eval_batch_size
: Batch size for evaluating model--learning_rate
: Learning rate for training model--num_train_epochs
: Epoch for training model (also you can train model with steps, then you should change the strategies for save, logging and evaluation)--load_best_model_at_end
: Whether to load the model with the best performance on the evaluated data, default is True
--metric_for_best_model
: Use which metric to determine the best model, default is loss
, can be accuracy
, precison
, recall
, f1
or matthews_correlation
for classification task, and r2
or spearmanr
for regression task--save_strategy
: Strategy for saving model, can be epoch
or steps
--logging_strategy
: Strategy for logging training information, can be epoch
or steps
--evaluation_strategy
: Strategy for evaluating model, can be epoch
or steps
--output_dir
: Where to save the fine-tuned modelDetailed descriptions of the arguments can be referred here.
Finally, wait for the progress bar completed, and the fine-tuned model will be saved in the plant-dnagpt-BPE-promoter
directory. In this directory, there will be a checkpoint directory, a runs directory, and a saved fine-tuning model.
To use a fine-tuned model for inference, please first download the desired models from HuggingFace or ModelScope to local or provide a model trained by yourself.
We use Plant DNAGPT model as example to predict active core promoter in plants.
First download a fine-tuned model and corresponding dataset from HuggingFace or ModelScope
# prepare a work directory
mkdir LLM_inference
cd LLM_inference
# download fine-tuned model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE-promoter
# download train dataset
git clone https://huggingface.co/zhangtaolab/plant-multi-species-core-promoters
We provide a script named model_inference.py
for model inference.
Here is an example that use the script to predict histone modification:
# (method 1) Inference with local model, directly input a sequence
python model_inference.py -m ./plant-dnagpt-BPE-promoter -s 'TTACTAAATTTATAACGATTTTTTATCTAACTTTAGCTCATCAATCTTTACCGTGTCAAAATTTAGTGCCAAGAAGCAGACATGGCCCGATGATCTTTTACCCTGTTTTCATAGCTCGCGAGCCGCGACCTGTGTCCAACCTCAACGGTCACTGCAGTCCCAGCACCTCAGCAGCCTGCGCCTGCCATACCCCCTCCCCCACCCACCCACACACACCATCCGGGCCCACGGTGGGACCCAGATGTCATGCGCTGTACGGGCGAGCAACTAGCCCCCACCTCTTCCCAAGAGGCAAAACCT'
# (method 2) Inference with local model, provide a file contains multiple sequences to predict
python model_inference.py -m ./plant-dnagpt-BPE-promoter -f ./plant-multi-species-core-promoters/test.csv -o promoter_predict_results.txt
# (method 3) Inference with an online model (Auto download the model trained by us from huggingface or modelscope)
python model_inference.py -m zhangtaolab/plant-dnagpt-BPE-promoter -ms huggingface -s 'GGGAAAAAGTGAACTCCATTGTTTTTTCACGCTAAGCAGACCACAATTGCTGCTTGGTACGAAAAGAAAACCGAACCCTTTCACCCACGCACAACTCCATCTCCATTAGCATGGACAGAACACCGTAGATTGAACGCGGGAGGCAACAGGCTAAATCGTCCGTTCAGCCAAAACGGAATCATGGGCTGTTTTTCCAGAAGGCTCCGTGTCGTGTGGTTGTGGTCCAAAAACGAAAAAGAAAGAAAAAAGAAAACCCTTCCCAAGACGTGAAGAAAAGCAATGCGATGCTGATGCACGTTA'
In this script:
-m
: Path to the fine-tuned model that is used for inference-s
: Input DNA sequence, only nucleotide A, C, G, T, N are acceptable-f
: Input file that contain multiple sequences, one line for each sequence. If you want to keep more information, file with ,
of \t
separator is acceptable, but a header contains sequence
column must be specified.-ms
: Download the model from huggingface
or modelscope
if the model is not local. The format of model name is zhangtaolab/model-name
, users can copy model name here:
Output results contains the original sequence, input sequence length. If the task type is classification, predicted label and probability of each label will provide; If the task type is regression, a predicted score will provide.
Environment deployment for LLMs may be an arduous job. To simplify this process, we also provide a docker version of our model inference code.
The images of the docker version are here, and the usage of docker implementation is shown below.
For GPU inference (with Nvidia GPU), please pull the image with gpu
tag, and make sure your computer has install the Nvidia Container Toolkit.
First download a finetune model from Huggingface or ModelScope, here we use Plant DNAMamba model as an example to predict active core promoters。
# prepare a work directory
mkdir LLM_inference
cd LLM_inference
git clone https://huggingface.co/zhangtaolab/plant-dnamamba-BPE-promoter
Then download the corresponding dataset, and if users have their own data, users can also prepare a custom dataset based on the previously mentioned inference data format.
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters
Once the model and dataset are ready, pull our model inference image from docker and test if it works.
docker pull zhangtaolab/plant_llms_inference:gpu
docker run --runtime=nvidia --gpus=all -v ./:/home/llms zhangtaolab/plant_llms_inference:gpu -h
usage: inference.py [-h] [-v] -m MODEL [-f FILE] [-s SEQUENCE] [-t THRESHOLD]
[-l MAX_LENGTH] [-bs BATCH_SIZE] [-p SAMPLE] [-seed SEED]
[-d {cpu,gpu,mps,auto}] [-o OUTFILE] [-n]
Script for Plant DNA Large Language Models (LLMs) inference
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-m MODEL Model path (should contain both model and tokenizer)
-f FILE File contains sequences that need to be classified
-s SEQUENCE One sequence that need to be classified
-t THRESHOLD Threshold for defining as True class (Default: 0.5)
-l MAX_LENGTH Max length of tokenized sequence (Default: 512)
-bs BATCH_SIZE Batch size for classification (Default: 1)
-p SAMPLE Subsampling for testing (Default: 1e7)
-seed SEED Random seed for subsampling (Default: None)
-d {cpu,gpu,mps,auto}
Choose CPU or GPU to do inference (require specific
drivers) (Default: auto)
-o OUTFILE Prediction results (Default: stdout)
-n Whether or not save the runtime locally (Default:
False)
Example:
docker run --runtime=nvidia --gpus=all -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -f seqfile.csv -o output.txt
docker run --runtime=nvidia --gpus=all -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -s 'ATCGGATCTCGACAGT' -o output.txt
If the preceding information is displayed, the image is downloaded and the inference script can run normally. Inference is performed below using previously prepared models and datasets.
docker run --runtime=nvidia --gpus=all -v ./:/home/llms zhangtaolab/plant_llms_inference:gpu -m /home/llms/plant-dnamamba-BPE-promoter -f /home/llms/plant-multi-species-core-promoters/test.csv -o /home/llms/predict_results.txt
After the inference progress bar is completed, see the output file predict_results.txt
in the current local directory, which saves the prediction results corresponding to each sequence in the input file.
For CPU inference, please pull the image with cpu
tag, this image support computer without NVIDIA GPU, such as cpu-only or Apple M-series Silicon. (Note that Inference of DNAMamba model is not supported in CPU mode)
First download a finetune model from Huggingface or ModelScope, here we use Plant DNAGPT model as an example to predict active core promoters。
# prepare a work directory
mkdir LLM_inference
cd LLM_inference
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE-promoter
Then download the corresponding dataset, and if users have their own data, users can also prepare a custom dataset based on the previously mentioned inference data format.
git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters
Once the model and dataset are ready, pull our model inference image from docker and test if it works.
docker pull zhangtaolab/plant_llms_inference:cpu
docker run -v ./:/home/llms zhangtaolab/plant_llms_inference:cpu -h
usage: inference.py [-h] [-v] -m MODEL [-f FILE] [-s SEQUENCE] [-t THRESHOLD]
[-l MAX_LENGTH] [-bs BATCH_SIZE] [-p SAMPLE] [-seed SEED]
[-d {cpu,gpu,mps,auto}] [-o OUTFILE] [-n]
Script for Plant DNA Large Language Models (LLMs) inference
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-m MODEL Model path (should contain both model and tokenizer)
-f FILE File contains sequences that need to be classified
-s SEQUENCE One sequence that need to be classified
-t THRESHOLD Threshold for defining as True class (Default: 0.5)
-l MAX_LENGTH Max length of tokenized sequence (Default: 512)
-bs BATCH_SIZE Batch size for classification (Default: 1)
-p SAMPLE Subsampling for testing (Default: 1e7)
-seed SEED Random seed for subsampling (Default: None)
-d {cpu,gpu,mps,auto}
Choose CPU or GPU to do inference (require specific
drivers) (Default: auto)
-o OUTFILE Prediction results (Default: stdout)
-n Whether or not save the runtime locally (Default:
False)
Example:
docker run -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -f seqfile.csv -o output.txt
docker run -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -s 'ATCGGATCTCGACAGT' -o output.txt
If the preceding information is displayed, the image is downloaded and the inference script can run normally. Inference is performed below using previously prepared models and datasets.
docker run -v ./:/home/llms zhangtaolab/plant_llms_inference:cpu -m /home/llms/plant-dnagpt-BPE-promoter -f /home/llms/plant-multi-species-core-promoters/test.csv -o /home/llms/predict_results.txt
After the inference progress bar is completed, see the output file predict_results.txt
in the current local directory, which saves the prediction results corresponding to each sequence in the input file.
In order to facilitate users to use the model to predict DNA analysis tasks, we also provide online prediction platforms.
Please refer to online prediction platform