yuenshingyan / MissForest

Arguably the best missing values imputation method.
MIT License
49 stars 5 forks source link

Breaking Mechanisms in missforest.py #47

Closed bburkman closed 1 month ago

bburkman commented 1 month ago

Thank you for making MissForest available for Python.

I have some questions about the mechanisms for breaking in missforest.py.

  1. I wonder whether missforest.py lines 480-483 should be unindented one level.

These lines save a copy of the current dataframe, split into categorical and numerical features. The reason why it saves a copy of the current dataframe is to compare with the previous copy to see whether the imputation is diverging, and if it is, stop.

I think what should happen is it should save the current dataframe at the end of each iteration, then compare this iteration with the previous iteration.

What I think is happening is that it is storing the current dataframe at the end of each feature, then at the end of the iteration comparing this copy with the previous copy, which only differ by the imputation of the last feature.

This mis-indentation has two deleterious effects, that (1) it is only considering changes to the last feature, and (2) it is storing many copies of the dataframe, one for each feature in each iteration, which for my work took 40 GB of memory and caused the process to crash, when it should have taken 1-2 GB for my dataframe.

Actually, is it necessary to store all of the iterations, when it only uses the current and previous?

  1. In lines 488-490, currently with the default of max_iter==5, the process will break when n_iter==6, after iterations 0, 1, 2, 3, 4, and 5, which is six iterations, not five. There may be a similar issue in lines 493 and 501. Perhaps move line 488 to after line 506? Or change line 489 to "if n_iter >= self.max_iter:" if the "n_iter >= 2 and" in lines 493 and 501 is working as intended.

  2. In lines 492-506, if the imputation has started to diverge, should the process return the current iteration or the previous iteration? I actually don't know which would be better.

Thank you for making this code available.

Brad Burkman bradburkman@gmail.com

yuenshingyan commented 1 month ago

@bburkman Thank you for your kind words. Your advices are very helpful.

I have just published MissForest 3.0.0: https://pypi.org/project/MissForest/ For issue 1) I added code to remove the first element in the list which stores imputation matrix x whenever it has the length of 2, right before storing another imputation matrix. 2) I replaced while loop with for loop. 3) According to https://cran.r-project.org/web/packages/missForest/missForest.pdf, the before last imputation matrix will be returned if stopping criterion is triggered.

I ran tests and there is nothing weird so far. Please let me know me if you found something weird.

Hindy Yuen