yoshida-lab / XenonPy

XenonPy is a Python Software for Materials Informatics
http://xenonpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
131 stars 57 forks source link

RuntimeError: Min sample_order must not be smaller than max sample_order #253

Closed krk-krk-krk closed 2 years ago

krk-krk-krk commented 2 years ago

Hello,

First of all, thank you very much for all your work. It's really interesting. When I was learning by referring to this tutorial, there was an error that I could not solve by myself, so I will ask you a question. https://github.com/yoshida-lab/XenonPy/blob/master/samples/iQSPR.ipynb

I got a strange error in NGram.fit included in this tutorial. The code is below. Here is the csv trained in the code. iQSPR_sample_data-100.csv Could you please teach me a workaround for this error?

Code

# basic uses
from tools import *
import os
import random
import typing

import xenonpy
# N-gram library in XenonPy-iQSPR
from xenonpy.inverse.iqspr import NGram
"""
This Tutorial
https://github.com/yoshida-lab/XenonPy/blob/master/samples/iQSPR.ipynb
"""

# extract a subset from the full data set
# load in-house data
data = pd.read_csv("./iQSPR_sample_data-100.csv")

# Method 2: expand n-gram training set with randomly reordered SMILES
# (we show one of the many possible ways of doing it)
np.random.seed(201906) # fix the random seed
n_reorder = 15 # pick a fixed number of re-ordering

# convert the SMILES to canonical SMILES in RDKit (not necessary in general)
cans = []
for smi in data['SMILES']:
    # remove some molecules in the full SMILES list that may lead to    error
    try:
        cans.append(Chem.MolToSmiles(Chem.MolFromSmiles(smi)))
    except:
        print(smi)
        pass

mols = [Chem.MolFromSmiles(smi) for smi in cans]
smi_reorder_all = []
smi_reorder = []
for mol in mols:
    idx = list(range(mol.GetNumAtoms()))
    tmp = [Chem.MolToSmiles(mol,rootedAtAtom=x) for x in range(len(idx))]
    smi_reorder_all.append(np.array(list(set(tmp))))
    smi_reorder.append(np.random.choice(smi_reorder_all[-1],n_reorder, replace=(len(smi_reorder_all[-1]) < n_reorder)))

flat_list = [item for sublist in smi_reorder for item in sublist]
print(flat_list[:10])

# first train up to order 10
n_gram_reorder1 = NGram(sample_order=(1, 10), del_range=(1, 10), min_len=1, max_len=1000, reorder_prob=0.5)
n_gram_reorder1.fit(flat_list, train_order=(1, 10))
# save results
with open('ngram_pubchem_ikebata_reO15_01toO10.obj', 'wb') as f:
    pk.dump(n_gram_reorder1, f)

# Then, train up from order 11 to 20
n_gram_reorder2 = NGram(sample_order=(1, 10), del_range=(1, 10), min_len=1, max_len=1000, reorder_prob=0.5)
n_gram_reorder2.fit(flat_list, train_order=(11, 20))
# save results
with open('ngram_pubchem_ikebata_reO15_O11to20.obj', 'wb') as f:
    pk.dump(n_gram_reorder2, f)

AnacondaEnv

condalist_XenonPy-v052.txt

ErrorText

(XenonPy-v052) C:\Users\taishi_kuroki\Desktop\Script\Python\XenonPy\v052>python 04-iQSPR-err.py ['C(C(=O)[O-])C(CN+(C)C)OC(C)=O', 'C(=O)([O-])CC(CN+(C)C)OC(C)=O', 'CC(=O)OC(CC(=O)[O-])CN+(C)C', 'CN+(C)CC(CC(=O)[O-])OC(C)=O', 'CC(=O)OC(CC(=O)[O-])CN+(C)C', 'C(C)(=O)OC(CC(=O)[O-])CN+(C)C', 'C(CC(=O)[O-])(CN+(C)C)OC(C)=O', 'CN+(C)CC(CC(=O)[O-])OC(C)=O', 'CN+(C)CC(CC(=O)[O-])OC(C)=O', 'C(C)(=O)OC(CC(=O)[O-])CN+(C)C'] 100%|███████████████████████████████████████████████████████████████████████████████████████| 1485/1485 [01:00<00:00, 24.39it/s] C:\Users\taishi_kuroki\anaconda3\envs\XenonPy-v052\lib\site-packages\xenonpy\inverse\iqspr\modifier.py:175: RuntimeWarning: min : 1 is smaller than min : 11,min will be increased to min (self.sample_order[0], self._train_order[0]), RuntimeWarning) Traceback (most recent call last): File "04-iQSPR-err.py", line 55, in n_gram_reorder2.fit(flat_list, train_order=(11, 20)) File "C:\Users\taishi_kuroki\anaconda3\envs\XenonPy-v052\lib\site-packages\xenonpy\utils\usefulcls.py", line 102, in fn rt = fn(self, *args, *kwargs) File "C:\Users\taishi_kuroki\anaconda3\envs\XenonPy-v052\lib\site-packages\xenonpy\inverse\iqspr\modifier.py", line 434, in fit self._fit_sample_order() File "C:\Users\taishi_kuroki\anaconda3\envs\XenonPy-v052\lib\site-packages\xenonpy\utils\usefulcls.py", line 102, in fn rt = fn(self, args, **kwargs) File "C:\Users\taishi_kuroki\anaconda3\envs\XenonPy-v052\lib\site-packages\xenonpy\inverse\iqspr\modifier.py", line 176, in _fit_sam self.sample_order = (self._train_order[0], self.sample_order[1]) File "C:\Users\taishi_kuroki\anaconda3\envs\XenonPy-v052\lib\site-packages\xenonpy\inverse\iqspr\modifier.py", line 112, in sample_order raise RuntimeError('Min sample_order must not be smaller than max sample_order') RuntimeError: Min sample_order must not be smaller than max sample_order (XenonPy-v052) C:\Users\taishi_kuroki\Desktop\Script\Python\XenonPy\v052>`

stewu5 commented 2 years ago

Thank you for posting this issue. I have been updating the tutorial to be compatible with the new version of XenonPy (v.06). I will upload a new script and also renew the ngram files by the end of this week.

To answer why this error happens, it is due to the automatic check of the minimum sample order to be greater than the train order, yet the check is not perfect enough. A quick fix to the issue is to replace the line before training n_gram_reorder2 to: n_gram_reorder2 = NGram(reorder_prob=0.5, sample_order=(11, 20)) [note that a new sample order is initiated or fix the error]

krk-krk-krk commented 2 years ago

@stewu5 Thank you! I think this workaround is something I wanted to know.

Are sample_order and train_order commonly used terms such as reorder_prob, which is a change of this NGram (natural language processing)?

stewu5 commented 2 years ago

These terms are not standard. For example, reordering probability is specific for using NGram with SMILES due to the fact that SMILES is non-unique and we are only making modification of the string from the right to left. For sample_order and train_order, I use the word "order" to refer to the number of character being remembered in the NGram table, which I think is rather standard (you can find the same description from wikipedia of n-gram).

krk-krk-krk commented 2 years ago

Thank you! I was able to know what I wanted to know. We are looking forward to the release of the next version of XenonPy.

stewu5 commented 2 years ago

@krk-krk-krk We have updated the tutorial and the n_gram files. Please go to the samples folder and get the latest iQSPR.ipynb file as the new version of the tutorial. Let me know if you still run into problems.

krk-krk-krk commented 2 years ago

Hi. I immediately tried the released XenonPy-v0.6.3 iQSPR. I have confirmed that the error I posted does not occur. thank you!

stewu5 commented 2 years ago

As a side note, there is an on-going issue of the NGram object being unnecessarily large due to the inefficient design of data format of the current NGram table implementation. We warn users for the memory issue when training extremely large order NGram with our codes and we are trying to resolve this issue. Any other comment or feedback on our iQSPR module would be highly welcomed.

krk-krk-krk commented 2 years ago

@stewu5

Any other comment or feedback on our iQSPR module would be highly welcomed.

This part of the iQSPR tutorial In this part, there is a procedure to add noise.

# include basic measurement noise
c_var = {'E': 650, 'HOMO-LUMO gap': 0.5}

I have a question about this procedure, is that okay?

  1. Is the purpose of adding noise to increase the versatility of the model (although it is added to increase the versatility and interpretability of the model in image-based deep learning)?
  2. What is the guideline for the noise value (adding about 10% of the average property value of the data set as noise)?
stewu5 commented 2 years ago

@krk-krk-krk Sorry for the lack of explanation. The logistic reason is that there is always noise from data on top of the model uncertainty. The technical reason is that when using boostrapping-based uncertainty estimate, there could be cases where the uncertainty is extremely small for most of the candidates, leading to potential numerical issue or difficulty to resample the relatively good candidates. In that sense, 10% of the average property value could be a reasonable start, but it could be smaller if your have a more stable uncertainty estimator.