yuenshingyan / MissForest

Arguably the best missing values imputation method.
MIT License
54 stars 5 forks source link

ValueError: Input data must be 2 dimensional and non empty. #23

Closed Sehjbir closed 10 months ago

Sehjbir commented 11 months ago

Running transform_fit generating error of: ValueError: Input data must be 2 dimensional and non empty. The input data is 2 dimensional and non-empty:

Code to reproduce:

seed to follow along

np.random.seed(1234)

generate 1000 data points

N = np.arange(1000)

helper function for this data

vary = lambda v: np.random.choice(np.arange(v))

create correlated, random variables

a = 2 b = 1/2 eps = np.array([norm(0, vary(50)).rvs() for n in N]) y = (a + b*N + eps) / 100
x = (N + norm(10, vary(250)).rvs(len(N))) / 100

add missing values

y[binom(1, 0.4).rvs(len(N)) == 1] = np.nan

convert to dataframe

df = pd.DataFrame({"y": y, "x": x}) df.head()

mf = MissForest() df_imputed = mf.fit_transform(df)

Error:


ValueError Traceback (most recent call last) Cell In[87], line 3 1 from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier 2 mf = MissForest() ----> 3 df_imputed = mf.fit_transform(df)

File /opt/conda/lib/python3.10/site-packages/missforest/missforest.py:531, in MissForest.fit_transform(self, X, categorical) 512 """ 513 Class method 'fit_transform' calls class method 'fit' and 'transform' 514 on 'X'. (...) 527 Imputed dataset (features only). 528 """ 530 self.fit(X, categorical) --> 531 X = self.transform(X) 533 return X

File /opt/conda/lib/python3.10/site-packages/missforest/missforest.py:457, in MissForest.transform(self, X) 455 X_missing = X_imp.loc[miss_index] 456 X_missing = X_missing.drop(c, axis=1) --> 457 y_pred = estimator.predict(X_missing) 458 y_pred = pd.Series(y_pred) 459 y_pred.index = self._miss_row[c]

File /opt/conda/lib/python3.10/site-packages/lightgbm/sklearn.py:918, in LGBMModel.predict(self, X, raw_score, start_iteration, num_iteration, pred_leaf, pred_contrib, validate_features, kwargs) 915 predict_params = _choose_param_value("num_threads", predict_params, self.n_jobs) 916 predict_params["num_threads"] = self._process_n_jobs(predict_params["num_threads"]) --> 918 return self._Booster.predict( # type: ignore[union-attr] 919 X, raw_score=raw_score, start_iteration=start_iteration, num_iteration=num_iteration, 920 pred_leaf=pred_leaf, pred_contrib=pred_contrib, validate_features=validate_features, 921 predict_params 922 )

File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:4220, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features, **kwargs) 4218 else: 4219 num_iteration = -1 -> 4220 return predictor.predict( 4221 data=data, 4222 start_iteration=start_iteration, 4223 num_iteration=num_iteration, 4224 raw_score=raw_score, 4225 pred_leaf=pred_leaf, 4226 pred_contrib=pred_contrib, 4227 data_has_header=data_has_header, 4228 validate_features=validate_features 4229 )

File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:1004, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, validate_features) 995 _safe_call( 996 _LIB.LGBM_BoosterValidateFeatureNames( 997 self._handle, (...) 1000 ) 1001 ) 1003 if isinstance(data, pd_DataFrame): -> 1004 data = _data_from_pandas( 1005 data=data, 1006 feature_name="auto", 1007 categorical_feature="auto", 1008 pandas_categorical=self.pandas_categorical 1009 )[0] 1011 predict_type = _C_API_PREDICT_NORMAL 1012 if raw_score:

File /opt/conda/lib/python3.10/site-packages/lightgbm/basic.py:677, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical) 670 def _data_from_pandas( 671 data: pd_DataFrame, 672 feature_name: _LGBM_FeatureNameConfiguration, 673 categorical_feature: _LGBM_CategoricalFeatureConfiguration, 674 pandas_categorical: Optional[List[List]] 675 ) -> Tuple[np.ndarray, List[str], List[str], List[List]]: 676 if len(data.shape) != 2 or data.shape[0] < 1: --> 677 raise ValueError('Input data must be 2 dimensional and non empty.') 679 # determine feature names 680 if feature_name == 'auto':

ValueError: Input data must be 2 dimensional and non empty.

yuenshingyan commented 11 months ago

Hi, I have re-created the bug and fixed it. So far, I don't have any problem on my side. https://pypi.org/project/MissForest/