Closed stijnjas closed 6 years ago
You seem to have deleted the issue template when posting your issue. I re-added it. I put your session info output in the correct part of the issue template. Please reformat the rest of your question to fit the template, in particular:
iris
dataset?Both of these things are explained in more detail in the template.
Be sure to update the checklist too!
Thanks Zach, i have re installed the package and removed the old versions, and it works again. I am sorry for the inconvenience. Stijn
No worries! That's why I have the checklist 😄
devtools::install_github("zachmayer/caretEnsemble")
update.packages(oldPkgs="caret", ask=FALSE)
sessionInfo()
Minimal, reproducible example:
Text and example code modified from the R FAQ on stackoverflow
Minimal dataset:
If you have some data that would be too difficult to construct using
caret::twoClassSim
orcaret::SLC14_1
, then you can always make a subset of your original data, using e.g.head()
,subset()
or the indices. Then use e.g.dput()
to give us something that can be put in R immediately, e.g.dput(head(iris,4))
If you must use
dput(head())
, please first remove an columns from your dataset that are not necessary to reproduce the error.If your data frame has a factor with many levels, the
dput
output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the the subset of your data. To solve this issue, you can use thedroplevels()
function. Notice below how species is a factor with only one level:dput(droplevels(head(iris, 4)))
Minimal, runnable code:
Session Info:
#############################################
Original Post
#############################################Dear Zach, please find below a reproducible example of a prediction error I am getting after having created a stacked ensemble. The error was not there a weak ago, so no idea what goes wrong.
SessionInfo()
R version 3.3.3 (2017-03-06) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra 10.12.6
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] parallel splines grid stats graphics grDevices utils datasets methods base
other attached packages: [1] plyr_1.8.4 gbm_2.1.3 survival_2.40-1 randomForest_4.6-12 DMwR_0.4.1 kernlab_0.9-25
[7] caretEnsemble_2.0.0 topicmodels_0.2-6 caret_6.0-77 ggplot2_2.2.1 lattice_0.20-34 e1071_1.6-8
[13] tm_0.7-1 NLP_0.1-11
loaded via a namespace (and not attached): [1] ddalpha_1.3.1 sfsmisc_1.1-1 foreach_1.4.3 prodlim_1.6.1 gtools_3.5.0 assertthat_0.2.0
[7] TTR_0.23-1 stats4_3.3.3 DRR_0.0.2 robustbase_0.92-7 slam_0.1-40 ipred_0.9-6
[13] glue_1.1.1 digest_0.6.12 colorspace_1.3-2 recipes_0.1.0 Matrix_1.2-8 timeDate_3012.100
[19] pkgconfig_2.0.1 CVST_0.2-1 purrr_0.2.3 scales_0.5.0 gdata_2.18.0 gower_0.1.2
[25] lava_1.5.1 tibble_1.3.4 withr_2.0.0 ROCR_1.0-7 pbapply_1.3-3 nnet_7.3-12
[31] lazyeval_0.2.0 quantmod_0.4-11 magrittr_1.5 nlme_3.1-131 SnowballC_0.5.1 MASS_7.3-45
[37] gplots_3.0.1 xts_0.9-7 dimRed_0.1.0 class_7.3-14 tools_3.3.3 data.table_1.10.4-2 [43] stringr_1.2.0 munsell_0.4.3 bindrcpp_0.2 compiler_3.3.3 RcppRoll_0.2.2 caTools_1.17.1
[49] rlang_0.1.2.9000 iterators_1.0.8 bitops_1.0-6 gtable_0.2.0 ModelMetrics_1.1.0 codetools_0.2-15
[55] abind_1.4-5 reshape2_1.4.2 R6_2.2.2 gridExtra_2.3 zoo_1.8-0 lubridate_1.6.0
[61] dplyr_0.7.4 bindr_0.1 KernSmooth_2.23-15 modeltools_0.2-21 stringi_1.1.5 Rcpp_0.12.13
[67] rpart_4.1-10 DEoptimR_1.0-8
Example
Load packages
library(tm) library(e1071) library(caret) library(topicmodels) library(caretEnsemble)
abstract_final<-read.csv("use_as_train.csv")
------------------------------------------------------------
Prepare data to be used as input to classification methods
------------------------------------------------------------
Create the corpus that will be used
review_corpus = VCorpus(VectorSource(abstracts_final$Abstract))
review_corpus = tm_map(review_corpus, content_transformer(tolower)) #no uppercase review_corpus = tm_map(review_corpus, removeNumbers) #no numerical values review_corpus = tm_map(review_corpus, removePunctuation) #remove punctuation review_corpus = tm_map(review_corpus, removeWords, c("the", "and", stopwords("english"))) # remove stopwords review_corpus = tm_map(review_corpus, stripWhitespace) # remove whitespace review_corpus = tm_map(review_corpus, stemDocument, language = "english") #stemming (bring back to common base word)
Term - Document matrices
1. just using the frequencies of the words
review_dtm <- DocumentTermMatrix(review_corpus) review_dtm = removeSparseTerms(review_dtm, 0.98) #To reduce the dimension of the DTM, we can remove the less frequent terms such that the sparsity is less than 0.99
use_dtm<-review_dtm
abstracts_final$Abstract =NULL names(abstracts_final) abstracts_final = cbind(abstracts_final, as.matrix(use_dtm)) # select the desired term-document matrix abstracts_final$Indicator<-as.factor(abstracts_final$Indicator) abstracts_final<-abstracts_final[,c('Indicator',attr(as.matrix(use_dtm),"dimnames")$Terms)] names(abstracts_final) = make.names(names(abstracts_final)) levels(abstracts_final$Indicator)<-c("irrelevant","relevant")
Select training and test set
set.seed(1988) splitIndex <- createDataPartition(abstracts_final$Indicator, p = .10, list = FALSE, times = 1) trainSplit <- abstracts_final[ splitIndex,] testSplit <- abstracts_final[-splitIndex,]
###########################################
Fitting the different models
###########################################
Support Vector Machines
Linear
svm_grid_Linear<-expand.grid( C = c(0.00001,0.0001,0.001,0.01,1,10))
set.seed(5627) svm_Linear_orig <- train(Indicator ~ ., data = trainSplit, method = "svmLinear",metric = "Kappa", tuneGrid = svm_grid_Linear, trControl = trainControl(method = "cv", number = 3, sampling= NULL,savePredictions = "final",classProbs = TRUE,index = createResample(trainSplit$Indicator,3)))
set.seed(5627) svm_Linear_smote <- train(Indicator ~ ., data = trainSplit, method = "svmLinear",metric = "Kappa", tuneGrid = svm_grid_Linear, trControl = trainControl(method = "cv", number = 3, sampling= "smote",savePredictions = "final",classProbs=TRUE,index = createResample(trainSplit$Indicator,3)))
fitted_models <- list(SVM_Linear_Original = svm_Linear_orig, SVM_Linear_Smote = svm_Linear_smote
)
class(fitted_models) <- "caretList"
stackControl <- trainControl(method="cv", number=10, savePredictions=TRUE, classProbs=TRUE,sampling='smote') set.seed(1988) garbage <- capture.output(stack.rf<-caretStack(fitted_models, method="gbm", metric="Kappa", trControl=stackControl)) predict(stack.rf,testSplit)
Thanks in advance, Stijn