yuenshingyan / MissForest

Arguably the best missing values imputation method.

MIT License

54 stars 5 forks source link

ValueError: at least one array or dtype is required #1

Closed khanwa closed 2 years ago

khanwa commented 2 years ago

Thank for sharing with us the implementation. I am having an error ValueError: at least one array or dtype is required when I runmfe= mfe.impute(data, rfc, rfr). It is working fine with I read fish = pd.read_csv('Fish.csv')

But When I read some other file it gives the error. Although my DF is fine "[699 rows x 10 columns]", Type "Dataframe". Could please check?

yuenshingyan commented 2 years ago

Hi, would you like to try this script instead ?

Import dependencies

import numpy as np import pandas as pd from MissForest import MissForest from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

# Read our toy dataset
fish = pd.read_csv('Fish.csv')

# Set missing values
fish.iloc[1, 0] = np.nan
fish.iloc[155, 0] = np.nan
fish.iloc[1, 2] = np.nan
fish.iloc[155, 2] = np.nan

# Instantiate our imputator
mf = MissForest()
fish = mf.impute(x=fish, classifier=RandomForestClassifier(), regressor=RandomForestRegressor())

print(fish)

It seems like you are setting mfe to mfe.impute(data, rfc, rfr) and the order of classifier and regressor argument is wrong.

mfe= mfe.impute(data, rfc, rfr)

khanwa commented 2 years ago

Actually, it is the same. https://colab.research.google.com/drive/1olzHObF3eSYk5fYf0-3tsJBUlGD_VuGx?usp=sharing

yuenshingyan commented 2 years ago

Could you send me your data ? Thank you.

khanwa commented 2 years ago

Thank you so much. here it is.

yuenshingyan commented 2 years ago

I fixed the bug and tried with the data you provided. If works fine so far.

from missforest.miss_forest import MissForest
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

# Read our toy dataset
data_train=pd.read_csv('cancer_train_1.csv')#

train_label=data_train.iloc[:,-1:]

data_train.drop('class', axis=1, inplace=True)

data_testt=pd.read_csv('cancer_test_10_1.csv') #

testt_label=data_testt.iloc[:,-1:]

data_testt.drop('class', axis=1, inplace=True)#

label_all = pd.concat([train_label, testt_label], ignore_index=True)

data=pd.concat([data_train,data_testt], ignore_index=True)

print(data.isnull().sum())

# Instantiate our imputator
mf = MissForest()
data = mf.fit_transform(X=data)

print(data.isnull().sum())

a 28 b 17 c 32 d 32 e 31 f 43 g 35 h 30 i 22 dtype: int64 a 0 b 0 c 0 d 0 e 0 f 0 g 0 h 0 i 0 dtype: int64

khanwa commented 2 years ago

Thank you very much.