π Paper β’ π€ Model β’ π¬ Space
2024-2
We've released ChatCell, a new paradigm that leverages natural language to make single-cell analysis more accessible and intuitive. Please visit our homepage and Github page for more information.2024-1
Our paper Domain-Agnostic Molecular Generation with Chemical Feedback is accepted by ICLR 2024.2024-1
Our paper Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models is accepted by ICLR 2024.2023-10
We open-source MolGen-7b, which now supports de novo molecule generation! 2023-6
We open-source KnowLM, a knowledgeable LLM framework with pre-training and instruction fine-tuning code (supports multi-machine multi-GPU setup).2023-6
We release Mol-Instructions, a large-scale biomolecule instruction dataset for large language models.2023-5
We propose Knowledge graph-enhanced molecular contrAstive learning with fuNctional prOmpt (KANO) on Nature Machine Intelligence
, exploiting fundamental domain knowledge in both pre-training and fine-tuning.2023-4
We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.2023-3
We release our pre-trained and fine-tuned model on π€ Hugging Face at MolGen-large and MolGen-large-opt.2023-2
We provide a demo on π€ Hugging Face at Space.To run the codes, You can configure dependencies by restoring our environment:
conda env create -f MolGen/environment.yml -n $Your_env_name$
and thenοΌ
conda activate $Your_env_name$
You can download the pre-trained and fine-tuned models via Huggingface: MolGen-large and MolGen-large-opt.
Moreover, the dataset used for downstream tasks can be found here.
The expected structure of files is:
moldata
βββ checkpoint
βΒ Β βββ molgen.pkl # pre-trained model
β βββ syn_qed_model.pkl # fine-tuned model for QED optimization on synthetic data
β βββ syn_plogp_model.pkl # fine-tuned model for p-logP optimization on synthetic data
β βββ np_qed_model.pkl # fine-tuned model for QED optimization on natural product data
β βββ np_plogp_model.pkl # fine-tuned model for p-logP optimization on natural product data
βββ finetune
βΒ Β βββ np_test.csv # nature product test data
βΒ Β βββ np_train.csv # nature product train data
βΒ Β βββ plogp_test.csv # synthetic test data for plogp optimization
βΒ Β βββ qed_test.csv # synthetic test data for plogp optimization
βΒ Β βββ zinc250k.csv # synthetic train data
βββ generate # generate molecules
βββ output # molecule candidates
βββ vocab_list
βββ zinc.npy # SELFIES alphabet
output
. cd MolGen
bash preprocess.sh
checkpoint
. bash finetune.sh
To generate molecules, run this script. Please specify the checkpoint_path
to determine whether to use the pre-trained model or the fine-tuned model.
cd MolGen
bash generate.sh
We conduct experiments on well-known benchmarks to confirm MolGen's optimization capabilities, encompassing penalized logP, QED, and molecular docking properties. For detailed experimental settings and analysis, please refer to our paper.
If you use or extend our work, please cite the paper as follows:
@inproceedings{fang2023domain,
author = {Yin Fang and
Ningyu Zhang and
Zhuo Chen and
Xiaohui Fan and
Huajun Chen},
title = {Domain-Agnostic Molecular Generation with Chemical feedback},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2024},
url = {https://openreview.net/pdf?id=9rPyHyjfwP}
}