zadorlab / KinBot

Automated reaction pathway search for gas-phase molecules
Other
48 stars 16 forks source link

NaN Values in Sella ApproximateHessian and ASE SQLite Unique ID Conflict in Kinbot #80

Open IliasChair opened 2 months ago

IliasChair commented 2 months ago

Hello all,

I'm encountering an error when trying to run Kinbot using my own NN PES, as mentioned in issue #79 . The error seems to occur within the code related to the databases in the ASE library.

Since the problem appears to be linked to ASE, I'm unsure whether it would be appropriate to also report this in the ASE forums . But I believe it's worth raising this issue here as well.

The traceback looks as follows:

Traceback (most recent call last):
  File "home/BA/kinbot_dir/KinBot/kinbot/kb/micromamba/envs/fair-chem/bin/kinbot", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "home/BA/kinbot_dir/KinBot/kinbot/kb.py", line 100, in main
    err, well0.geom = qc.get_qc_geom(str(well0.chemid) + '_well',
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/BA/kinbot_dir/KinBot/kinbot/qc.py", line 947, in get_qc_geom
    check = self.check_qc(job)
            ^^^^^^^^^^^^^^^^^^
  File "home/BA/kinbot_dir/KinBot/kinbot/qc.py", line 1232, in check_qc
    if self.is_in_database(job):
       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/BA/kinbot_dir/KinBot/kinbot/qc.py", line 1182, in is_in_database
    for row in rows:
  File "home/micromamba/envs/fair-chem/lib/python3.11/site-packages/ase/parallel.py", line 275, in new_generator
    for result in generator(*args, **kwargs):
  File "home/micromamba/envs/fair-chem/lib/python3.11/site-packages/ase/db/core.py", line 486, in select
    for row in self._select(keys, cmps, explain=explain,
  File "home/micromamba/envs/fair-chem/lib/python3.11/site-packages/ase/db/sqlite.py", line 698, in _select
    yield self._convert_tuple_to_row(tuple(values))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/fair-chem/lib/python3.11/site-packages/ase/db/sqlite.py", line 488, in _convert_tuple_to_row
    dct['forces'] = deblob(values[19], shape=(-1, 3))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/fair-chem/lib/python3.11/site-packages/ase/db/sqlite.py", line 165, in deblob
    array = np.frombuffer(buf, dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: buffer size must be a multiple of element size

The error is triggered in the deblob() function of the sqlite.py script from the ASE package, when attempting to use the np.frombuffer() function to decode a binary buffer string to a data type, which by default is float64. It appears that the issue is related to the data type of the buffer stored in the SQLite database. ASE seems to be decoding this binary buffer string as float64 by default, while the actual data is in float32 format. For now, I have managed to resolve this issue by editing the deblob() function in the sqlite.py script. Here’s the change I made:

try:
    array = np.frombuffer(buf, dtype)
    if not np.little_endian:
        array = array.byteswap()
except:
    array = np.frombuffer(buf, dtype=np.float32)
    if not np.little_endian:
        array = array.byteswap()

This approach allows the code to proceed normally in most cases, using dtype=float64. If the error is triggered, the code attempts to decode the buffer as float32, which resolves the issue for my NN PES.

I am unsure whether this error originates from the Kinbot code, where it might be assumed that calculators always return values as float64. Or from the ASE sqlite library, which might be presuming that the database only contains float64 values.

The the above tempoary solution fixed this issue for me for now, but I think it still should be investigated where this error originally stems from.

Thank you for the quick responses and helpful attitude.

Best Ilias

cmartia commented 2 months ago

This is a weird problem. it doesn't seem to be KinBot's fault, but to be 100 % sure maybe you can send me the files necessary to replicate the problem. Something I just realized is that numpy guys just released version 2.0 less than a month ago and this brakes a lot of things which might include ASE. Do you know which version of numpy are you using? If it's >=2.0 you might wanna try with 1.x. For reference I have 1.25 in my machine.

IliasChair commented 2 months ago

Hi, I am using ASE 3.23.0 and Numpy 1.26.4. So this is probably not the source of the error.

I've modified nn_pes.py s.t. the Nn_surr class wraps the OCPCalculator class (an ASE calculator). See attached files. However, reproducing the error would require installing the fair-chem/OCP package and running a pretrained model.

Since this issue seems pretty niche, the work needed to dig into it might not be worth it. My temporary fix has solved the problem, so I think we can call this issue closed. If anyone else runs into it, (and ASE has not fixed it yet) they can use my fix or re-open this issue.

Best Ilias

nn_pes.py.txt installed_packages.txt

IliasChair commented 3 weeks ago

Hello, I'm continuing to work on integrating Kinbot with my NN PES and have run into two new problems that I can't resolve on my own. The first involves an issue with the Sella optimizer, specifically with subtracting an ApproximateHessian object containing NaNs. The second problem concerns the ASE database interface to SQLite, similar to what I encountered earlier.

Problem 1: Sella Optimizer and ApproximateHessian NaN Values When running a well exploration using Kinbot for some species, I encounter the following error:

Traceback (most recent call last):
  File "home/BA/kinbot_dir/rundir_system/output/130130000000000000002_well.py", line 81, in <module>
    converged = opt.run(fmax=fmax, steps=steps)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/optimize/optimize.py", line 269, in run
    return Dynamics.run(self)
           ^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/optimize/optimize.py", line 156, in run
    for converged in Dynamics.irun(self):
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/optimize/optimize.py", line 135, in irun
    self.step()
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/optimize/optimize.py", line 240, in step
    s, smag = self._predict_step()
              ^^^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/optimize/optimize.py", line 229, in _predict_step
    s, smag = self.rs(
              ^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/optimize/restricted_step.py", line 161, in __init__
    BaseRestrictedStep.__init__(self, pes, *args, **kwargs)
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/optimize/restricted_step.py", line 54, in __init__
    self.pes.get_HL().project(self.P.T),
    ^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/peswrapper.py", line 133, in get_HL
    return self.get_H() - self.get_Hc()
           ~~~~~~~~~~~~~^~~~~~~~~~~~~~~
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/scipy/sparse/linalg/_interface.py", line 522, in __sub__
    return self.__add__(-x)
           ^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/linalg.py", line 236, in __add__
    return ApproximateHessian(
           ^^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/linalg.py", line 155, in __init__
    self.set_B(B0)
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/sella/linalg.py", line 170, in set_B
    self.evals, self.evecs = eigh(self.B)
                             ^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/scipy/linalg/_decomp.py", line 458, in eigh
    a1 = _asarray_validated(a, check_finite=check_finite)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/scipy/_lib/_util.py", line 321, in _asarray_validated
    a = toarray(a)
        ^^^^^^^^^^
  File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/numpy/lib/function_base.py", line 627, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

After some investigation, I found that in the Sella peswrapper module, the PES.get_Hc() method returns a 6x6 array of NaNs for the Hessian of the constraints. This method calls self.cons.hessian().ldot(self.curr['L']). (FYI) Calling self.cons.hessian().asarray() instead returns an array of six 2D arrays consisting entirely of zeros or NaNs. The error occurs when PES.get_Hc() retrieves the Hessian of the constraints, which contains NaNs, and then attempts to subtract this from the Hessian returned by PES.get_h() in PES.get_HL().

I’ve already tested my NN PES thoroughly in its ability to predict accurate forces and energies, so I doubt the errors stem from there. Is this expected behavior in the Sella optimizer? If not, do you have any ideas on what might be causing it?

I have also notices that this error often occurs for systems of size 1 or 2. For systems with only one atom, my NN_PES looks up the energies in a lookup table and sets the forces to zero. Could the zero forces cause this issue?

This issue might also be related to two other issues on the Sella git: Minimal example of N2 with EMT or LJ leads to a crash, Constrained optimization diverging during hessian update

  1. ASE Database Issue This second issue may or may not be related. Here is the error traceback:
    Traceback (most recent call last):
    File "home/BA/kinbot_dir/rundir_system/output/130130000000000000002_well.py", line 114, in <module>
    db.write(mol, name='130130000000000000002_well', data=data)
    File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/parallel.py", line 244, in new_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
    File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/db/core.py", line 203, in new_method
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/db/core.py", line 349, in write
    id = self._write(atoms, kvp, data, id)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "home/micromamba/envs/Kinbot_env/lib/python3.11/site-packages/ase/db/sqlite.py", line 339, in _write
    cur.execute('INSERT INTO systems VALUES ({})'.format(q),
    sqlite3.IntegrityError: UNIQUE constraint failed: systems.unique_id

It occurs because there's already an entry in the database with the same unique_id. As far as I can tell, the unique_id is generated from data in the Atoms object. I suspect two possible causes: either two different Atoms objects are incorrectly mapping to the same unique_id, or Kinbot is attempting to re-run a failed simulation with an incomplete entry already present in the DB. Upon inspecting the database, I noticed some entries are missing key data, like energies and forces.

Although the error appears in the ASE DB code, I believe it may stem from either Kinbot or Sella. I've spent a significant amount of time troubleshooting this and now believe it's beyond my ability to resolve alone.

I’ve attached a well exploration script generated by Kinbot that triggers these errors, as well as subset of the database in csv format for reference.

As a first step, could you check if this error occurs for you as well?

Any help or guidance would be greatly appreciated.

db.csv 130130000000000000002_well.txt

IliasChair commented 3 weeks ago

Hello again,

Here’s a little update. I believe I managed to fix the first issue regarding the NaN values by merging the pull request on the Sella GitHub page into my local branch, which removes the deprecated force_consistent parameter. I have not encountered the crash since then.

Regarding the second issue:

I’ve discovered that the problem lies in how the unique_id is generated in the ASE database code. In the ase.db.row.py module, the unique_id is created using: 'unique_id': '%x' % randint(16**31, 16**32 - 1)

This generates a random integer within a specific range using random.randint and then converts it to a hexadecimal string.

However, when I run multiple well explorations in Kinbot — for example, for 390870650390040000002_well and 330490170000000000002_well — the random integer generator returns the same number for the unique_id. I’ve also confirmed that the generated unique_id remains the same across different executions of the same code.

I still believe this issue is somehow tied to Kinbot, because when I save the ASE Atoms objects from the well exploration scripts and then create the unique_id separately in another script, the random number generator produces different numbers. This suggests that the random number generation algorithm may be influenced by something in Kinbot, causing the same unique_id to be generated for different well explorations and runs.

However, this behavior may be intentional, as we want Kinbot to access the output from previous computations stored in the database using the unique_id, allowing it to skip re-running calculations. That said, this does not explain why two different Atoms objects are generating the same unique_id.

Any assistance or insights regarding the cause of this behavior, as well as suggestions on how to best tackle the issue, would be greatly appreciated.

Best Ilias