zincware / MDSuite

A post-processing engine for particle simulations
https://mdsuite.readthedocs.io/
Eclipse Public License 2.0
36 stars 7 forks source link

Batching untested #383

Open christophlohrmann opened 2 years ago

christophlohrmann commented 2 years ago

I am pretty sure batching is broken somewhere, because I get random test failures depending on hardware and hardware usage. We need either

SamTov commented 2 years ago

In what way does it depend on hardware and what are the errors?

christophlohrmann commented 2 years ago

In github CI everything passes.

On my laptop, one of the calculator tests fails reproducibly (I do not remember which one, but I can find out this evening) It is a value-failure, so the calculation runs but produces wrong numerical values.

On my ICP machine, I can sometimes produce errors by using pytest -n 20 [pytest-xdist running tests in parallel with many runners]. Here I get an IndexError: list index out of range in a minibatch loop

SamTov commented 2 years ago

In terms of the laptop fail please let me know tonight. There was a very recent change to a test that has now been removed and so if the repo was not updated this would fail but we can go through that.

As for the parallel nature of things #145 may the issue that we bring up here. I could imagine something funny happening if you had serious changes in memory availability but I don't have an exact idea of where. It is also just possible that some bug made it fail. But the fact that it fails when you run jobs in parallel is possible. When you run the parallel jobs, does it start 20 kernels and begin 20 projects or does it start one project, open a class and then run the tests on different threads? If there are shared variables in the parent but in a kernel something memory related gets changed then it could go wrong. But this could be corrected in the code by having dynamic, or as it is written in the issue, intelligent, memory management.

Ahh as I say that the mini batching may just not be working correctly. I would need to see an example and run it locally to confirm. But this atom-wise mini-batching may have been impacted by the restructure and now when you run 20 shells it invokes this mini-batching and so fails. But this would only happen on dynamic properties, which calculator failed?

christophlohrmann commented 2 years ago

This time I got lucky and had one value error and two index errors:

FAILED integration_tests/calculators/test_radial_distribution_function.py::test_project - AssertionError: 

FAILED integration_tests/calculators/test_kirkwood_buff_integrals.py::test_project - tensorflow.python.framework.errors_impl.UnknownError: IndexError: list index out of range

FAILED integration_tests/calculators/test_radial_distribution_function.py::test_experiment - tensorflow.python.framework.errors_impl.UnknownError: IndexError: list index out of range
SamTov commented 2 years ago

Okay so rdf fails twice probably because it is the same call and then KB fails maybe because the RDF data does not exist