zhangtaolab / Plant_DNA_LLMs

Plant foundation DNA large language models (LLMs) trained on different models
https://zhangtaolab.org/Plant_DNA_LLMs/
Other
13 stars 3 forks source link

PDLLMs: A model-box based on Plant DNA Large Language Models (LLMs)

English | 简体中文

0. Demo for plant DNA LLMs prediction

demo

Online prediction of other models and prediction tasks can be found here.

1. Environment

Anaconda package manager is recommended for building the training environment. For pre-train and fine-tune models, please ensure that you have a Nvidia GPU and the corresponding drivers are installed. For inference, devices without Nvidia GPU (CPU only, AMD GPU, Apple Silion, etc.) are also acceptable.

1.1 Download and install Anaconda package manager

1.2 Create environment (We trained the models with python 3.11)

conda create -n llms python=3.11
conda activate llms

1.3 Install dependencies

If you want to pre-train or fine-tune models, make sure you are using Nvidia GPU(s).
Install Nvidia driver and corresponding version of CUDA driver (> 11.0, we used CUDA 12.1).

Also Pytorch (>=2.0) with corresponding CUDA version should also install.
We recommend to use pip to install python packages that needed. Please be sure to install the corresponding CUDA and Torch versions carefully, the CUDA version used in this test environment is 12.1. Please refer to Official Website for the detailed installation tutorial of pytorch.

pip install 'torch<2.4' --index-url https://download.pytorch.org/whl/cu121

If you just want to use models for inference (prediction), you can install Pytorch GPU version (above) or install Pytorch CPU version if your machine has no Nvidia GPU.

pip install 'torch<2.4' --index-url https://download.pytorch.org/whl/cpu

Next install other required dependencies.

git clone --recursive https://github.com/zhangtaolab/Plant_DNA_LLMs
cd Plant_DNA_LLMs
pip install -r requirements.txt

(Optional) If you want to train a mamba model, you need to install several extra dependencies, also you should have a Nvidia GPU.

pip install 'causal-conv1d<=1.3'
pip install 'mamba-ssm<2'

1.4 Install git-lfs

glt-lfs is required for download large models and datasets,git-lfs installation can be refer to git-lfs install.

If git-lfs is installed, run the following command

$ git lfs version

will get message like this

git-lfs/3.3.0 (GitHub; linux amd64; go 1.19.8)

2. Fine-tune

To fine-tune the plant DNA LLMs, please first download the desired models from HuggingFace or ModelScope to local. You can use git clone (which may require git-lfs to be installed) to retrieve the model or directly download the model from the website.

In the activated llms python environment, use the model_finetune.py script to fine-tune a model for downstream task.

Our script accepts .csv format data (separated by ,) as input, when preparing the training data, please make sure the data contain a header and at least these two columns:

sequence,label

Where sequence is the input sequence, and label is the corresponding label for the sequence.

We also provide several plant genomic datasets for fine-tuning on the HuggingFace and ModelScope.

We use Plant DNAGPT model as example to fine-tune a model for active core promoter prediction.

First download a pretrain model and corresponding dataset from HuggingFace or ModelScope:

# prepare a work directory
mkdir LLM_finetune
cd LLM_finetune
# download pretrain model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE
# download train dataset
git clone https://huggingface.co/zhangtaolab/plant-multi-species-core-promoters

After preparing the model and dataset, using the following script to finetune model (here is a promoter prediction example)

python model_finetune.py \
    --model_name_or_path plant-dnagpt-BPE \
    --train_data plant-multi-species-core-promoters/train.csv \
    --test_data plant-multi-species-core-promoters/test.csv \
    --eval_data plant-multi-species-core-promoters/dev.csv \
    --train_task classification \
    --labels 'Not promoter;Core promoter' \
    --run_name plant_dnagpt_BPE_promoter \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-5 \
    --num_train_epochs 5 \
    --load_best_model_at_end \
    --metric_for_best_model 'f1' \
    --save_strategy epoch \
    --logging_strategy epoch \
    --evaluation_strategy epoch \
    --output_dir plant-dnagpt-BPE-promoter

In this script:

  1. --model_name_or_path: Path to the foundation model you downloaded
  2. --train_data: Path to the train dataset
  3. --test_data: Path to the test dataset, omit it if no test data available
  4. --dev_data: Path to the validation dataset, omit it if no validation data available
  5. --train_task: Determine the task type, should be classification, multi-classification or regression
  6. --labels: Set the labels for classification task, separated by ;
  7. --run_name: Name of the fine-tuned model
  8. --per_device_train_batch_size: Batch size for training model
  9. --per_device_eval_batch_size: Batch size for evaluating model
  10. --learning_rate: Learning rate for training model
  11. --num_train_epochs: Epoch for training model (also you can train model with steps, then you should change the strategies for save, logging and evaluation)
  12. --load_best_model_at_end: Whether to load the model with the best performance on the evaluated data, default is True
  13. --metric_for_best_model: Use which metric to determine the best model, default is loss, can be accuracy, precison, recall, f1 or matthews_correlation for classification task, and r2 or spearmanr for regression task
  14. --save_strategy: Strategy for saving model, can be epoch or steps
  15. --logging_strategy: Strategy for logging training information, can be epoch or steps
  16. --evaluation_strategy: Strategy for evaluating model, can be epoch or steps
  17. --output_dir: Where to save the fine-tuned model

Detailed descriptions of the arguments can be referred here.

Finally, wait for the progress bar completed, and the fine-tuned model will be saved in the plant-dnagpt-BPE-promoter directory. In this directory, there will be a checkpoint directory, a runs directory, and a saved fine-tuning model.

3. Inference

To use a fine-tuned model for inference, please first download the desired models from HuggingFace or ModelScope to local or provide a model trained by yourself.

We use Plant DNAGPT model as example to predict active core promoter in plants.

First download a fine-tuned model and corresponding dataset from HuggingFace or ModelScope

# prepare a work directory
mkdir LLM_inference
cd LLM_inference
# download fine-tuned model
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE-promoter
# download train dataset
git clone https://huggingface.co/zhangtaolab/plant-multi-species-core-promoters

We provide a script named model_inference.py for model inference.
Here is an example that use the script to predict histone modification:

# (method 1) Inference with local model, directly input a sequence
python model_inference.py -m ./plant-dnagpt-BPE-promoter -s 'TTACTAAATTTATAACGATTTTTTATCTAACTTTAGCTCATCAATCTTTACCGTGTCAAAATTTAGTGCCAAGAAGCAGACATGGCCCGATGATCTTTTACCCTGTTTTCATAGCTCGCGAGCCGCGACCTGTGTCCAACCTCAACGGTCACTGCAGTCCCAGCACCTCAGCAGCCTGCGCCTGCCATACCCCCTCCCCCACCCACCCACACACACCATCCGGGCCCACGGTGGGACCCAGATGTCATGCGCTGTACGGGCGAGCAACTAGCCCCCACCTCTTCCCAAGAGGCAAAACCT'

# (method 2) Inference with local model, provide a file contains multiple sequences to predict
python model_inference.py -m ./plant-dnagpt-BPE-promoter -f ./plant-multi-species-core-promoters/test.csv -o promoter_predict_results.txt

# (method 3) Inference with an online model (Auto download the model trained by us from huggingface or modelscope)
python model_inference.py -m zhangtaolab/plant-dnagpt-BPE-promoter -ms huggingface -s 'GGGAAAAAGTGAACTCCATTGTTTTTTCACGCTAAGCAGACCACAATTGCTGCTTGGTACGAAAAGAAAACCGAACCCTTTCACCCACGCACAACTCCATCTCCATTAGCATGGACAGAACACCGTAGATTGAACGCGGGAGGCAACAGGCTAAATCGTCCGTTCAGCCAAAACGGAATCATGGGCTGTTTTTCCAGAAGGCTCCGTGTCGTGTGGTTGTGGTCCAAAAACGAAAAAGAAAGAAAAAAGAAAACCCTTCCCAAGACGTGAAGAAAAGCAATGCGATGCTGATGCACGTTA'

In this script:

  1. -m: Path to the fine-tuned model that is used for inference
  2. -s: Input DNA sequence, only nucleotide A, C, G, T, N are acceptable
  3. -f: Input file that contain multiple sequences, one line for each sequence. If you want to keep more information, file with , of \t separator is acceptable, but a header contains sequence column must be specified.
  4. -ms: Download the model from huggingface or modelscope if the model is not local. The format of model name is zhangtaolab/model-name, users can copy model name here: copy

Output results contains the original sequence, input sequence length. If the task type is classification, predicted label and probability of each label will provide; If the task type is regression, a predicted score will provide.

4. Docker implementation for model inference

Environment deployment for LLMs may be an arduous job. To simplify this process, we also provide a docker version of our model inference code.

The images of the docker version are here, and the usage of docker implementation is shown below.

Inference using GPU

For GPU inference (with Nvidia GPU), please pull the image with gpu tag, and make sure your computer has install the Nvidia Container Toolkit.

First download a finetune model from Huggingface or ModelScope, here we use Plant DNAMamba model as an example to predict active core promoters。

# prepare a work directory
mkdir LLM_inference
cd LLM_inference
git clone https://huggingface.co/zhangtaolab/plant-dnamamba-BPE-promoter

Then download the corresponding dataset, and if users have their own data, users can also prepare a custom dataset based on the previously mentioned inference data format.

git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters

Once the model and dataset are ready, pull our model inference image from docker and test if it works.

docker pull zhangtaolab/plant_llms_inference:gpu
docker run --runtime=nvidia --gpus=all -v ./:/home/llms zhangtaolab/plant_llms_inference:gpu -h
usage: inference.py [-h] [-v] -m MODEL [-f FILE] [-s SEQUENCE] [-t THRESHOLD]
                    [-l MAX_LENGTH] [-bs BATCH_SIZE] [-p SAMPLE] [-seed SEED]
                    [-d {cpu,gpu,mps,auto}] [-o OUTFILE] [-n]

Script for Plant DNA Large Language Models (LLMs) inference

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -m MODEL              Model path (should contain both model and tokenizer)
  -f FILE               File contains sequences that need to be classified
  -s SEQUENCE           One sequence that need to be classified
  -t THRESHOLD          Threshold for defining as True class (Default: 0.5)
  -l MAX_LENGTH         Max length of tokenized sequence (Default: 512)
  -bs BATCH_SIZE        Batch size for classification (Default: 1)
  -p SAMPLE             Subsampling for testing (Default: 1e7)
  -seed SEED            Random seed for subsampling (Default: None)
  -d {cpu,gpu,mps,auto}
                        Choose CPU or GPU to do inference (require specific
                        drivers) (Default: auto)
  -o OUTFILE            Prediction results (Default: stdout)
  -n                    Whether or not save the runtime locally (Default:
                        False)

Example:
  docker run --runtime=nvidia --gpus=all -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -f seqfile.csv -o output.txt
  docker run --runtime=nvidia --gpus=all -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -s 'ATCGGATCTCGACAGT' -o output.txt

If the preceding information is displayed, the image is downloaded and the inference script can run normally. Inference is performed below using previously prepared models and datasets.

docker run --runtime=nvidia --gpus=all -v ./:/home/llms zhangtaolab/plant_llms_inference:gpu -m /home/llms/plant-dnamamba-BPE-promoter -f /home/llms/plant-multi-species-core-promoters/test.csv -o /home/llms/predict_results.txt

After the inference progress bar is completed, see the output file predict_results.txt in the current local directory, which saves the prediction results corresponding to each sequence in the input file.

Inference using CPU

For CPU inference, please pull the image with cpu tag, this image support computer without NVIDIA GPU, such as cpu-only or Apple M-series Silicon. (Note that Inference of DNAMamba model is not supported in CPU mode)

First download a finetune model from Huggingface or ModelScope, here we use Plant DNAGPT model as an example to predict active core promoters。

# prepare a work directory
mkdir LLM_inference
cd LLM_inference
git clone https://huggingface.co/zhangtaolab/plant-dnagpt-BPE-promoter

Then download the corresponding dataset, and if users have their own data, users can also prepare a custom dataset based on the previously mentioned inference data format.

git clone https://huggingface.co/datasets/zhangtaolab/plant-multi-species-core-promoters

Once the model and dataset are ready, pull our model inference image from docker and test if it works.

docker pull zhangtaolab/plant_llms_inference:cpu
docker run -v ./:/home/llms zhangtaolab/plant_llms_inference:cpu -h
usage: inference.py [-h] [-v] -m MODEL [-f FILE] [-s SEQUENCE] [-t THRESHOLD]
                    [-l MAX_LENGTH] [-bs BATCH_SIZE] [-p SAMPLE] [-seed SEED]
                    [-d {cpu,gpu,mps,auto}] [-o OUTFILE] [-n]

Script for Plant DNA Large Language Models (LLMs) inference

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -m MODEL              Model path (should contain both model and tokenizer)
  -f FILE               File contains sequences that need to be classified
  -s SEQUENCE           One sequence that need to be classified
  -t THRESHOLD          Threshold for defining as True class (Default: 0.5)
  -l MAX_LENGTH         Max length of tokenized sequence (Default: 512)
  -bs BATCH_SIZE        Batch size for classification (Default: 1)
  -p SAMPLE             Subsampling for testing (Default: 1e7)
  -seed SEED            Random seed for subsampling (Default: None)
  -d {cpu,gpu,mps,auto}
                        Choose CPU or GPU to do inference (require specific
                        drivers) (Default: auto)
  -o OUTFILE            Prediction results (Default: stdout)
  -n                    Whether or not save the runtime locally (Default:
                        False)

Example:
  docker run -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -f seqfile.csv -o output.txt
  docker run -v /local:/container zhangtaolab/plant_llms_inference:gpu -m model_path -s 'ATCGGATCTCGACAGT' -o output.txt

If the preceding information is displayed, the image is downloaded and the inference script can run normally. Inference is performed below using previously prepared models and datasets.

docker run -v ./:/home/llms zhangtaolab/plant_llms_inference:cpu -m /home/llms/plant-dnagpt-BPE-promoter -f /home/llms/plant-multi-species-core-promoters/test.csv -o /home/llms/predict_results.txt

After the inference progress bar is completed, see the output file predict_results.txt in the current local directory, which saves the prediction results corresponding to each sequence in the input file.

Online prediction platform

In order to facilitate users to use the model to predict DNA analysis tasks, we also provide online prediction platforms.

Please refer to online prediction platform