youbao88 / KPrototypes_plus

A JIT optimized K-Prototype algorithm
MIT License
2 stars 0 forks source link

Problem when using fit_predict #1

Open cnmoro opened 3 years ago

cnmoro commented 3 years ago

I am having the following issue (both python3.6 and 3.8). Any ideas on how to fix this? Thanks !

Failed in nopython mode pipeline (step: nopython frontend) non-precise type array(pyobject, 2d, F) During: typing of argument at /home/cnmoro/miniconda3/envs/py36/lib/python3.6/site-packages/kpplus/kpplus.py (167)

File "../../miniconda3/envs/py36/lib/python3.6/site-packages/kpplus/kpplus.py", line 167: def mean_std(data, types): std = 0 ^

thearcanist commented 3 years ago

yep, me too. @cnmoro did you find any work-arounds for this? @youbao88 please help!

cnmoro commented 3 years ago

@thearcanist I remember solving It, but not how. It was still way too slow for my use, so I ended up using regular kmeans, and applying MCA (Prince package) to the categorical variables, and also normalizing the numerical ones with minmaxScaler (based on the min Max Values from MCA)

ori-katz100 commented 3 years ago

It has something to do with the gamma. Once you specify gamma yourself the problem is solved.

youbao88 commented 2 years ago

Sorry for such a late replay. The problem seems like there is something conflict with the nopython mode of numba. I will try to fix this in the next release. However, at the same time, could you please verify if it could be solved by using @ori-katz100 's suggestion?

cnmoro commented 2 years ago

@thearcanist @ori-katz100 @youbao88

The fix was to calculate the gamma value instead of passing it as None.

I modified the original mean_std function and ended up with the following code:

categorical = [1 if x in categorics else 0 for x in data.columns]

def mean_std(data, types):
    std = 0
    count_num_column = 0
    for col_index in range(len(types)):
        if types[col_index] == 0:
            count_num_column += 1
            std += np.std(data.iloc[:,col_index])
    return std/count_num_column

custom_gamma = mean_std(data, np.array(categorical))

then pass "custom_gamma" as the gamma parameter. the mean_std function worked after I added the ".iloc" function to the np.std calc, otherwise it was throwing some error related to slices of the data

KPrototypes_plus(n_clusters=k, n_init = 16, n_jobs = -1, gamma = custom_gamma)

It is still way too slow for me, I have a dataset with 547930 rows, 4 numerics columns and 7 categorical columns. It takes more than two hours to run the model with n_clusters=2 I can't even plot the elbow curve ( which would require running from k=2 to k=15 ) :(

youbao88 commented 2 years ago

Thank you @cnmoro for your kindly reply.

Yes, I have now noticed this issue and it would be fixed in the next small release.

Regarding the performance issue, it would be improved with the next big release.

Thank you again for your comments.

youbao88 commented 2 years ago

@cnmoro can you verify if it fixes in the new release v0.0.3 Thank you!

cnmoro commented 2 years ago

@cnmoro can you verify if it fixes in the new release v0.0.3 Thank you!

unfortunately i still have the error in 0.0.3

TypingError: Failed in nopython mode pipeline (step: nopython frontend) non-precise type array(pyobject, 2d, F)

167: def mean_std(data, types): std = 0 ^