yuenshingyan / MissForest

Arguably the best missing values imputation method.
MIT License
50 stars 5 forks source link

"None of [Index([382], dtype='int64')] are in the [index]" #38

Closed nnagururu closed 6 months ago

nnagururu commented 6 months ago

Hi i'm getting the following error and have been unable to debug.

Thanks in advance!

kf = KFold(n_splits=5, shuffle=True, random_state=seed)

for fold, (train_index, test_index) in enumerate(kf.split(df)):
    print(f"Processing fold {fold + 1}")
    X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
    y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()

    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

    clf = RandomForestClassifier(n_jobs=-1) #for categorical
    rgr = RandomForestRegressor(n_jobs=-1) #for numerical
    imputer = MissForest(clf,rgr)

    X_train_imputed = imputer.fit_transform(X_train, categorical = cat_col)
    X_test_imputed = imputer.transform(X_test)

    # Save imputed datasets
    train = pd.concat([X_train_imputed, y_train], axis=1)
    test = pd.concat([X_test_imputed, y_test], axis=1)
    train.to_feather(f'Data/Imputed/RFI_fold{fold + 1}_train.feather')
    test.to_feather(f'Data/Imputed/RFI_fold{fold + 1}_test.feather')
KeyError                                  Traceback (most recent call last)
Cell In[9], line 15
     12 imputer = MissForest(clf,rgr)
     14 X_train_imputed = imputer.fit_transform(X_train, categorical = cat_col)
---> 15 X_test_imputed = imputer.transform(X_test)
     17 # Save imputed datasets
     18 train = pd.concat([X_train_imputed, y_train], axis=1)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\missforest\missforest.py:475, in MissForest.transform(self, x)
    473 # Predict the missing column with the trained estimator
    474 miss_index = self._missing_row[c]
--> 475 x_missing = x_imp.loc[miss_index]
    476 x_missing = x_missing.drop(c, axis=1)
    477 y_pred = estimator.predict(x_missing)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py:1192, in _LocationIndexer.__getitem__(self, key)
   1190 maybe_callable = com.apply_if_callable(key, self.obj)
   1191 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1192 return self._getitem_axis(maybe_callable, axis=axis)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py:1421, in _LocIndexer._getitem_axis(self, key, axis)
   1418     if hasattr(key, "ndim") and key.ndim > 1:
   1419         raise ValueError("Cannot index with multidimensional key")
-> 1421     return self._getitem_iterable(key, axis=axis)
...
-> 6248         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6250     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6251     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([382], dtype='int64')] are in the [index]"
Sep905 commented 6 months ago

Hi, I'm facing the same problem. I think the error occurs because the transform method looks for dataframe row indices in the training set that have missing value for a certain feature. Then when applied on the test set dataframe, the method does not find these indexes in there, since they belong to the training set dataframe.

Sep905 commented 6 months ago

I figured out what is happing, at least for my case. Whenever the transform method is called, the _get_missingrows(x) method is also called. The latter populates the __missingrow dictionary with a series of dictionaries in which: key -> feature: values -> list of indexes corresponding to the input df for which the feature has missing value.

When calling transform to impute a test or validation set, after applied _fittransform on a training set, the __missingrow dictionary is updated and not overriden.

I think the solution may be simply to insert:

self._missing_row = {}

at the beginning of the _get_missingrows method's definition.

I tried it and it seems to work.

cvraut commented 6 months ago

@Sep905 you are right absolutely correct, thank you for figuring that out! I'll let you open the pull request for that change, and receive the glory.

But in the meantime, I realized that we can use python's unenforced access to private attributes to have a short-term workaround for this issue:

MissForest()
imputer.fit(X_train)
train_imputed = imputer.transform(X_train)
# reset the _missing_row attribute of the imputer object
imputer._missing_row = {}
test_imputed = imputer.transform(X_test)

worked for me without having to modify the library code.

yuenshingyan commented 6 months ago

Closes #39