Open christophlohrmann opened 2 years ago
In what way does it depend on hardware and what are the errors?
In github CI everything passes.
On my laptop, one of the calculator tests fails reproducibly (I do not remember which one, but I can find out this evening) It is a value-failure, so the calculation runs but produces wrong numerical values.
On my ICP machine, I can sometimes produce errors by using pytest -n 20 [pytest-xdist running tests in parallel with many runners]. Here I get an IndexError: list index out of range
in a minibatch loop
In terms of the laptop fail please let me know tonight. There was a very recent change to a test that has now been removed and so if the repo was not updated this would fail but we can go through that.
As for the parallel nature of things #145 may the issue that we bring up here. I could imagine something funny happening if you had serious changes in memory availability but I don't have an exact idea of where. It is also just possible that some bug made it fail. But the fact that it fails when you run jobs in parallel is possible. When you run the parallel jobs, does it start 20 kernels and begin 20 projects or does it start one project, open a class and then run the tests on different threads? If there are shared variables in the parent but in a kernel something memory related gets changed then it could go wrong. But this could be corrected in the code by having dynamic, or as it is written in the issue, intelligent, memory management.
Ahh as I say that the mini batching may just not be working correctly. I would need to see an example and run it locally to confirm. But this atom-wise mini-batching may have been impacted by the restructure and now when you run 20 shells it invokes this mini-batching and so fails. But this would only happen on dynamic properties, which calculator failed?
This time I got lucky and had one value error and two index errors:
FAILED integration_tests/calculators/test_radial_distribution_function.py::test_project - AssertionError:
FAILED integration_tests/calculators/test_kirkwood_buff_integrals.py::test_project - tensorflow.python.framework.errors_impl.UnknownError: IndexError: list index out of range
FAILED integration_tests/calculators/test_radial_distribution_function.py::test_experiment - tensorflow.python.framework.errors_impl.UnknownError: IndexError: list index out of range
Okay so rdf fails twice probably because it is the same call and then KB fails maybe because the RDF data does not exist
I am pretty sure batching is broken somewhere, because I get random test failures depending on hardware and hardware usage. We need either