About the generation process

tszslovewanpu commented 7 months ago

Hello, and great job! 1、When generating the 10K molecules in Table 1、Table2, or Table3, should we input some molecules, are they from the ZINK250K or MOSES? 2、MOLGEN can generate better molecules when gives the inputs, so the generation process is actually an optimization process, am i right? Thank you very much!

ZJU-Fangyin commented 7 months ago

Hi,

Yes, all the experiments in the paper require the input of molecules, as the base model is the BART model.
Your understanding is completely correct; this is a work on molecular optimization.

tszslovewanpu commented 7 months ago

Thank you! 3、And how about MolGen 7B generate molecules? Is there any prompt gives to the trained model to start the generation process? 4、Does MolGen 7B designed for the 'generation from scratch' mission (generation and estimate the whole distribution, compare the distribution with the trainingset) or it can also complete the optimization mission? Again thanks very much!

ZJU-Fangyin commented 7 months ago

MolGen 7B is capable of generating molecules from scratch. You can input a bos_token, or input an incomplete structure for the model to complete.

Denovo molecule generation example:

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("zjunlp/MolGen-7b")
model = LlamaForCausalLM.from_pretrained(
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sf_input = tokenizer(tokenizer.bos_token, return_tensors="pt").to(device)

molecules = model.generate(input_ids=sf_input["input_ids"],
sf_output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).replace(" ","") for g in molecules]

Molecular completion example:

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("zjunlp/MolGen-7b")
model = LlamaForCausalLM.from_pretrained(
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sf_input = tokenizer("[C][N][O]", return_tensors="pt").to(device)

molecules = model.generate(input_ids=sf_input["input_ids"],
sf_output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).replace(" ","") for g in molecules]

MolGen 7B is primarily designed for tasks involving the de novo generation of molecules or the completion of molecular structures. However, by making appropriate modifications to the model's generate function, it can also support inputting molecular embeddings and can be used for optimization tasks as well.

tszslovewanpu commented 7 months ago

Got it!

zjunlp / MolGen

About the generation process #11