zachmayer / caretEnsemble

caret models all the way down :turtle:
Other
226 stars 75 forks source link

caretList produces incorrect resample data #168

Open farbodr opened 8 years ago

farbodr commented 8 years ago

I'm using the following code to train multiple caret models and it looks like caretList is duplicating row in resample.

fitControl3 <- trainControl(
  method='cv',
  number=5,
  savePredictions=TRUE,
  classProbs=TRUE,
  index=createResample(train_sub$target, 5),
  summaryFunction=twoClassSummary
)

model.list3 <- caretList(
  train_sub$target ~ ., 
  preProcess=NULL,
  data = train_sub,
  metric='ROC',
  trControl= fitControl3,
  tuneList=list(
    glmBoost=caretModelSpec(method='glmboost', tuneGrid=expand.grid(mstop=seq(1900, 2000, by=100),prune=c('no'))),
    glm=caretModelSpec(method='glm'),
    pls=caretModelSpec(method='pls',  tuneGrid=expand.grid(ncomp=c(20))),
    xgbtree=caretModelSpec(method='xgbTree', tuneGrid=expand.grid(eta=c(0.01), 
                                                                  max_depth=c(9), 
                                                                  nrounds=c(3000))),
    rf1=caretModelSpec(method='parRF',  ntree=100, tuneGrid=expand.grid(mtry=c(12, 14, 18)))
  )
)
save(model.list3, file='xgb_rf_glmb_glm_pls_cv_5_all.RData')

if I load the r object 'xgb_rf_glmb_glm_pls_cv_5_all.RData' here is what I see for glmboost model vs glm (and all other models in the list)

> model.list3[1]$glmBoost
Boosted Generalized Linear Model 

7262 samples
 333 predictors
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 7262, 7262, 7262, 7262, 7262 
Resampling results across tuning parameters:

  mstop  ROC        Sens       Spec       ROC SD       Sens SD      Spec SD   
  1900   0.7188322  0.9542160  0.1867477  0.009241549  0.008581122  0.01179033
  2000   0.7187255  0.9536293  0.1877144  0.009214250  0.008817720  0.01259558

Tuning parameter 'prune' was held constant at a value of no
ROC was used to select the optimal model using  the largest value.
The final values used for the model were mstop = 1900 and prune = no. 
> model.list3[1]$glmBoost$resample
         ROC      Sens      Spec  Resample
1  0.7224517 0.9657869 0.1810897 Resample1
2  0.7224517 0.9657869 0.1810897 Resample1
3  0.7330123 0.9598236 0.1977671 Resample2
4  0.7330123 0.9598236 0.1977671 Resample2
5  0.7079159 0.9531936 0.1803543 Resample3
6  0.7079159 0.9531936 0.1803543 Resample3
7  0.7190849 0.9419862 0.2019386 Resample4
8  0.7190849 0.9419862 0.2019386 Resample4
9  0.7116962 0.9502896 0.1725888 Resample5
10 0.7116962 0.9502896 0.1725888 Resample5
> model.list3[2]$glm$resample
        ROC      Sens      Spec  Resample
1 0.6972258 0.9130010 0.2948718 Resample1
2 0.7146589 0.9255267 0.2599681 Resample2
3 0.6906046 0.9107752 0.2753623 Resample3
4 0.7061589 0.9006883 0.2714055 Resample4
5 0.7074587 0.9107143 0.2774958 Resample5
> 

Obviously I can't run the caretEnsemble method with model.list3. It (understandably) give this error:

Error in check_bestpreds_resamples(modelLibrary) : 
  Component models do not have the same re-sampling strategies
jknowles commented 8 years ago

Can you try the same procedure with the same data, but use separate X and Y vectors instead of the formula interface? Looking at this I have a suspicion it is something with the formula interface, which we admittedly don't test very well in our unit tests.

zachmayer commented 8 years ago

Good catch @jknowles. @farbodr the formula interface is really sub-optimal. Try the X/Y interface instead.

zachmayer commented 8 years ago

@farbodr Does this issue occur if you use the X/Y interface and caretEnsemble 2.0.0 from CRAN?

farbodr commented 8 years ago

I haven't but will give it a try this weekend.

farbodr commented 8 years ago

I couldn't find my original example so I used another one and X/Y still produces same error. The interesting thing is that if I remove glmboost from the model list the problem goes away. I can put something together with smaller data set so I can post it here if that helps.

zachmayer commented 8 years ago

Try a caret::train model on your data, using method='glmboost'.

I've had problems with that model in the past.

JasonCEC commented 8 years ago

I am also getting this bug, but switching to X/Y instead of the formula interface brakes random forest with error: Error in predict.randomForest(modelFit, newdata, type = "prob") :missing values in newdata.

caretEnsable running only an rf model works fine through the formula interface.

zachmayer commented 8 years ago

Run anyNA(X) and anyNA(Y)

Sent from my iPhone

On Mar 9, 2016, at 7:15 PM, Jason Cohen notifications@github.com wrote:

I am also getting this bug, but switching to X/Y instead of the formula interface brakes random forest with error: Error in predict.randomForest(modelFit, newdata, type = "prob") :missing values in newdata.

caretEnsable running only an rf model works fine through the formula interface.

— Reply to this email directly or view it on GitHub.

jashshah commented 6 years ago

I am facing the same issue, any updates?