In the following example I create a simple custom caret model so that I can view the x, y and wts values being sent to the model. The easiest thing is to add browser() inside of the custom model, but I am using print statements instead that illustrate what problems it can cause.
In the example below my weights are ascending going from 0 to 1 in steps of 0.01. In theory this should have a random effect on the prediction. But because the y get sorted prior to being passed to the model, but the wts don't, the wts no longer align to the x and y rows, and even more pernicious, in the case below they cause the larger y values to be weighted higher, causing a strong distortion of the apparent weighted mean value of the series.
library(caret)
library(caretEnsemble)
# Mean Custom Caret Method
CaretMean <- list (
library = c("dplyr"),
type = "Regression",
parameters = data.frame(parameter = c("None"),
class = c("character"),
label = c("None")),
grid = function(x, y, len = NULL, search = "grid") { data.frame( None = "" ) },
fit = function(x, y, wts, param, lev, last, weights = NA, classProbs = NA, ...) {
RetVal <- list()
if (is.null(wts))
wts <- rep(1, length(y))
# Both x and y are being resorted such that y is in ascending order, however wts is not reordered.
# So the weight no longer corresponds to the correct x and y values, and can cause pernicious problems
# such as in this example the weights are also increasing meaning that the weighted mean y value is much
# higher than the unweighted mean
print(sprintf("Unweighted Mean y: %0.2f", mean(y)))
print(sprintf("Weighted Mean y: %0.2f", sum(y * wts) / sum(wts)))
# browser()
class(RetVal) <- "CaretMean"
return(RetVal)
},
predict = function(modelFit, newdata, preProc = NULL, submodels = NULL) {
sapply(1:nrow(newdata), function(R) mean(newdata[R, ]))
},
prob = NULL,
tags = c("Simple"),
label = "Mean"
)
models <- caretList(y ~ x, data = df, weights = df$w, trControl = trainControl(method = "cv", savePredictions = "final", allowParallel = F), methodList = c("glm", "gbm", "svmRadialCost", "knn"))
ensemble <- caretStack(models, method = CaretMean, weights = df$w, trControl = trainControl(method = "cv", savePredictions = "final", allowParallel = F))
In the following example I create a simple custom caret model so that I can view the x, y and wts values being sent to the model. The easiest thing is to add browser() inside of the custom model, but I am using print statements instead that illustrate what problems it can cause.
In the example below my weights are ascending going from 0 to 1 in steps of 0.01. In theory this should have a random effect on the prediction. But because the y get sorted prior to being passed to the model, but the wts don't, the wts no longer align to the x and y rows, and even more pernicious, in the case below they cause the larger y values to be weighted higher, causing a strong distortion of the apparent weighted mean value of the series.
Minimal, reproducible example:
Minimal dataset:
Minimal, runnable code:
Session Info: