ymattu / MlBayesOpt

R package to tune parameters for machine learning(Support Vector Machine, Random Forest, and Xgboost), using bayesian optimization with gaussian process
Other
45 stars 15 forks source link

Error: Invalid parameter format for num_class expect int but value = 'NA' #55

Open yongfanbeta opened 6 years ago

yongfanbeta commented 6 years ago

hello,

When I use MIBayesOpt to optimize xgboost model to solve a linear regression problem like predict house price, I choose objectfun = "reg:linear , this is not a classification problem means no classes parameter, but it seems i have to give a num_class?

hope for u reply!

shakfu commented 6 years ago

I had exactly the same problem. Is linear regression not supported in this case?

ymattu commented 6 years ago

@Victorfy @shakfu

Thank you for using MlBayesOpt. I had the same error in this package. For now, it's a bag of this package, so I will fix it in the next version.

I'm very sorry... Please wait for some time until I fix it, or I welcome your PULL REQUEST.

Edward-Aidi commented 6 years ago

I also encountered the same problem! look forward to your next updates! Thanks!

Error in xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, obj) : Invalid Parameter format for num_class expect int but value='NA' In addition: There were 50 or more warnings (use warnings() to see the first 50) Timing stopped at: 11.87 12.9 26.54

SimonTopp commented 5 years ago

Thanks for the great package! Any update on this issue or workarounds for running xgb_opt with reg:linear?

TwZhou0 commented 5 years ago

I also encountered the same problem! look forward to your next updates! Thanks!

Error in xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, obj) : Invalid Parameter format for num_class expect int but value='NA' In addition: There were 50 or more warnings (use warnings() to see the first 50) Timing stopped at: 11.87 12.9 26.54

I also encountered the same problem when I tried fitting a regression model. Have you figured out how to fix it?

carolinart commented 4 years ago

Still remains the same problem for reg:linear.

msmith01 commented 4 years ago

You need to comment out the num_class = num_classes in the "#about classes" else statement of the xgb_cv_opt function . The statement goes:

  if (grepl("logi", objectfun) == TRUE){
    xgb_cv <- function(object_fun,
                       eval_met,
                       num_classes,

So if the objective function is binary:logistic then it correctly uses the num_classes object. However when the function does not correspond to logi or binary:logistic it uses the else part which also contains the num_classes object and reg:linear doesn't use the num_classes object.

The num_classes object appears in both the if and the else part of the code. I pushed a git request to highlight the issue of where the error is occuring. However, I still get a warning message on a unrelated issue.

Running the following should solve the issue (however I have only checked it on the iris data set):

xgb_cv_opt <- function(data,
                       label,
                       objectfun,
                       evalmetric,
                       n_folds,
                       eta_range = c(0.1, 1L),
                       max_depth_range = c(4L, 6L),
                       nrounds_range = c(70, 160L),
                       subsample_range = c(0.1, 1L),
                       bytree_range = c(0.4, 1L),
                       init_points = 4,
                       n_iter = 10,
                       acq = "ei",
                       kappa = 2.576,
                       eps = 0.0,
                       optkernel = list(type = "exponential", power = 2),
                       classes = NULL,
                       seed = 0
)
{
  if(class(data)[1] == "dgCMatrix")
  {dtrain <- xgb.DMatrix(data,
                         label = label)
  xg_watchlist <- list(msr = dtrain)

  cv_folds <- KFold(label, nfolds = n_folds,
                    stratified = TRUE, seed = seed)
  }
  else
  {
    quolabel <- enquo(label)
    datalabel <- (data %>% select(!! quolabel))[[1]]

    mx <- sparse.model.matrix(datalabel ~ ., data)

    if (class(datalabel) == "factor"){
      dtrain <- xgb.DMatrix(mx, label = as.integer(datalabel) - 1)
    } else{
      dtrain <- xgb.DMatrix(mx, label = datalabel)
      }

    xg_watchlist <- list(msr = dtrain)

    cv_folds <- KFold(datalabel, nfolds = n_folds,
                      stratified = TRUE, seed = seed)
  }

  #about classes
  if (grepl("logi", objectfun) == TRUE){
    xgb_cv <- function(object_fun,
                       eval_met,
                       num_classes,
                       eta_opt,
                       max_depth_opt,
                       nrounds_opt,
                       subsample_opt,
                       bytree_opt) {

      object_fun <- objectfun
      eval_met <- evalmetric

      cv <- xgb.cv(params = list(booster = "gbtree",
                                 nthread = 1,
                                 objective = object_fun,
                                 eval_metric = eval_met,
                                 eta = eta_opt,
                                 max_depth = max_depth_opt,
                                 subsample = subsample_opt,
                                 colsample_bytree = bytree_opt,
                                 lambda = 1, alpha = 0),
                   data = dtrain, folds = cv_folds,
                   watchlist = xg_watchlist,
                   prediction = TRUE, showsd = TRUE,
                   early_stopping_rounds = 5, maximize = TRUE, verbose = 0,
                   nrounds = nrounds_opt)

      if (eval_met %in% c("auc", "ndcg", "map")) {
        s <- max(cv$evaluation_log[, 4])
      } else {
        s <- max(-(cv$evaluation_log[, 4]))
      }
      list(Score = s,
           Pred = cv$pred)
    }
  } else{
    xgb_cv <- function(object_fun,
                       eval_met,
                       num_classes,
                       eta_opt,
                       max_depth_opt,
                       nrounds_opt,
                       subsample_opt,
                       bytree_opt) {

      object_fun <- objectfun
      eval_met <- evalmetric

      num_classes <- classes

      cv <- xgb.cv(params = list(booster = "gbtree",
                                 nthread = 1,
                                 objective = object_fun,
                                 #num_class = num_classes,
                                 eval_metric = eval_met,
                                 eta = eta_opt,
                                 max_depth = max_depth_opt,
                                 subsample = subsample_opt,
                                 colsample_bytree = bytree_opt,
                                 lambda = 1, alpha = 0),
                   data = dtrain, folds = cv_folds,
                   watchlist = xg_watchlist,
                   prediction = TRUE, showsd = TRUE,
                   early_stopping_rounds = 5, maximize = TRUE, verbose = 0,
                   nrounds = nrounds_opt)

      if (eval_met %in% c("auc", "ndcg", "map")) {
        s <- max(cv$evaluation_log[, 4])
      } else {
        s <- max(-(cv$evaluation_log[, 4]))
      }
      list(Score = s,
           Pred = cv$pred)
    }
  }

  opt_res <- BayesianOptimization(xgb_cv,
                                  bounds = list(eta_opt = eta_range,
                                                max_depth_opt = max_depth_range,
                                                nrounds_opt = nrounds_range,
                                                subsample_opt = subsample_range,
                                                bytree_opt = bytree_range),
                                  init_points,
                                  init_grid_dt = NULL,
                                  n_iter,
                                  acq,
                                  kappa,
                                  eps,
                                  optkernel,
                                  verbose = TRUE)

  return(opt_res)

}

library(MlBayesOpt)
library(dplyr)
library(Matrix)
library(xgboost)
library(rBayesianOptimization)
df <- iris
label_Species <- iris$Species
xgb_cv_opt(data = df,
           label = label_Species,
           objectfun = "reg:linear", evalmetric = "rmse", n_folds = 2, eta_range = c(0.1, 1L),
           max_depth_range = c(4L, 6L), nrounds_range = c(70, 160L),
           subsample_range = c(0.1, 1L), bytree_range = c(0.4, 1L),
           init_points = 4, n_iter = 10, acq = "ucb", kappa = 2.576, eps = 0,
           optkernel = list(type = "exponential", power = 2), classes = NULL,
           seed = 0)

I get the following warning message:

Warning messages:
1: In matrix(c(sample(index), rep(NA, NA_how_many)), ncol = nfolds) :
  data length [15] is not a sub-multiple or multiple of the number of rows [8]
2: In matrix(c(sample(index), rep(NA, NA_how_many)), ncol = nfolds) :
  data length [43] is not a sub-multiple or multiple of the number of rows [22]
3: In matrix(c(sample(index), rep(NA, NA_how_many)), ncol = nfolds) :
  data length [109] is not a sub-multiple or multiple of the number of rows [55]
4: In matrix(c(sample(index), rep(NA, NA_how_many)), ncol = nfolds) :
  data length [107] is not a sub-multiple or multiple of the number of rows [54]
5: In matrix(c(sample(index), rep(NA, NA_how_many)), ncol = nfolds) :
  data length [133] is not a sub-multiple or multiple of the number of rows [67]

Which I have located to this part of the code:

    cv_folds <- KFold(datalabel, nfolds = n_folds,
                      stratified = TRUE, seed = seed)

I had this solved but lost the unsaved changes when I changed project in R. If I recall correctly I set the datalabel or label to a new numeric or Matrix.