Breaking Mechanisms in missforest.py

Thank you for making MissForest available for Python.

I have some questions about the mechanisms for breaking in missforest.py.

I wonder whether missforest.py lines 480-483 should be unindented one level.

These lines save a copy of the current dataframe, split into categorical and numerical features. The reason why it saves a copy of the current dataframe is to compare with the previous copy to see whether the imputation is diverging, and if it is, stop.

I think what should happen is it should save the current dataframe at the end of each iteration, then compare this iteration with the previous iteration.

What I think is happening is that it is storing the current dataframe at the end of each feature, then at the end of the iteration comparing this copy with the previous copy, which only differ by the imputation of the last feature.

This mis-indentation has two deleterious effects, that (1) it is only considering changes to the last feature, and (2) it is storing many copies of the dataframe, one for each feature in each iteration, which for my work took 40 GB of memory and caused the process to crash, when it should have taken 1-2 GB for my dataframe.

Actually, is it necessary to store all of the iterations, when it only uses the current and previous?

In lines 488-490, currently with the default of max_iter==5, the process will break when n_iter==6, after iterations 0, 1, 2, 3, 4, and 5, which is six iterations, not five. There may be a similar issue in lines 493 and 501. Perhaps move line 488 to after line 506? Or change line 489 to "if n_iter >= self.max_iter:" if the "n_iter >= 2 and" in lines 493 and 501 is working as intended.
In lines 492-506, if the imputation has started to diverge, should the process return the current iteration or the previous iteration? I actually don't know which would be better.

Thank you for making this code available.

Brad Burkman bradburkman@gmail.com

yuenshingyan / MissForest

Breaking Mechanisms in missforest.py #47