zachmayer / caretEnsemble

caret models all the way down :turtle:
http://zachmayer.github.io/caretEnsemble/
Other
226 stars 75 forks source link

added foreach to predict.caretEnsemble #153

Closed terrytangyuan closed 8 years ago

terrytangyuan commented 9 years ago

I've done some tests but feel free to do more. If this is ok, we can apply foreach in other potential places too.

zachmayer commented 9 years ago

Tests on travis are failing

To fix them:

  1. Run devtools::use_package('foreach') to add foreach to the package imports.
  2. Add @importFrom foreach foreach and @importFrom foreach %dopar% to the roxygen header for predict.caretEnsemble.
  3. Add unit tests to tests/testthat/ that make sure you new functionality works and the same output is produced when running in parallel as when running in serial.
terrytangyuan commented 9 years ago

Should I replace those removed lines by devtools::use_package('foreach')?

zachmayer commented 9 years ago

Nope-- That just needs to be run once

zachmayer commented 9 years ago

It's a helper function to add the foreach to the dependencies/imports for the caretEnsemble package.

zachmayer commented 9 years ago

Once you've made the changes, you can run devtools::test() to run the unit tests. If unit tests pass, run devtools::check() to make sure R CMD CHECK passes.

terrytangyuan commented 9 years ago

Thanks Zach. It looks like one of them failed. I saw the following during devtools::test()

Code is high quality and lint free : S
Test optimizer passing to caretEnsemble correctly : ...y
Test more difficult cases : zA
RMSE : 
Error in setkeyv(value, key) : 
  4 arguments passed to .Internal(nchar) which requires 3

and the following during devtools::check():

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Quitting from lines 107-109 (caretEnsemble-intro.Rmd) 
Error: processing vignette 'caretEnsemble-intro.Rmd' failed with diagnostics:
4 arguments passed to .Internal(nchar) which requires 3
Execution halted
Error: Command failed (1)

Did I do the change correctly? The errors seem very similar to the ones I had earlier. (see my earlier comment regarding the errors from predict.caretEnsemble(), link: https://github.com/zachmayer/caretEnsemble/issues/60#issuecomment-124237220)

zachmayer commented 9 years ago

So that error is a weird data.table error, and isn't your fault.   It has to do with how data.table is loaded in a package vs interactively.

Try library(data.table) and devtools::load_all() and devtools::test()

If that fails, push the commit back out to github and Travis will run the tests and I'll help you debug.

terrytangyuan commented 9 years ago

Thanks. It still didn't pass. In Travis log (https://travis-ci.org/zachmayer/caretEnsemble/jobs/72385390), line 3505, it looks like the problem is caused by foreach package? I've pushed the commit to my Github so that you can probably jump in to help. Much appreciated.

zachmayer commented 9 years ago

You've got an extraneous devtools call in your code. Remove it and I think you'll be good.

Generally, you don't call devtools directly inside packages. It's mostly a collection of "helper" functions to make your like easier when developing packages. It's there to automate things that used to be really tedious, e.g. adding a new package as a dependency.

terrytangyuan commented 9 years ago

I see. But still got the same error. Here's what I did: I removed that line in the code, set the working dir to /caretEnsemble and then ran:

devtools::load_all(".")
devtools::test()
devtools::check()

Anything else I did wrong?

zachmayer commented 9 years ago

Push the commit up to travis (with the devtools::use_package('foreach') line removed) and let me see what happens.

terrytangyuan commented 9 years ago

Looks like still failed for same reason. I found this post http://stackoverflow.com/questions/13085481/namespace-dependencies-not-required I don't quite get it but do we need to change anything in DESCRIPTION file?

zachmayer commented 9 years ago

Yup, you need to add foreach to the imports section of the description file. The easies way to do this is by running devtools::use_package('foreach') (just once.)

This SHOULD update the description file. If it doesn't, manually update it and add foreach to the imports section.

zachmayer commented 9 years ago

@jknowles You should also review this PR, as it touches a bunch of code you wrote. (Though if tests pass, it should be fine).

terrytangyuan commented 9 years ago

I added that into DESCRIPTION file but still failed. I also tried devtools::use_package('foreach') and then ran the test. Would you be able to clone my forked repo and try it on your end if you have some time? Maybe my local environment is messed up.

zachmayer commented 9 years ago

Checkin your commit with the edit to the DESCRIPTION file, and we'll let travis test it.

Also try devtools::install_github('hadley/devtools') to get the latest devtools version.

zachmayer commented 9 years ago

btw— I just broke tests on master (whoops), so this build may fail. I'll read through the log and let you know if there's anything you need to fix.

terrytangyuan commented 9 years ago

Ok thanks a lot!

zachmayer commented 9 years ago

Ok, I fixed tests and they're re-running now.

I have one final request for you: Make a file in tests/testthat called test-parallel.R. Inside that file, put the following code:

context("Parallelization works")
test_that("predict.caretEnsemble works in parallel", {
  skip_on_cran()
  X.reg <- model.matrix(~ ., iris[, -1])
  X.reg.big <- do.call(rbind, lapply(1:100, function(x) X.reg))
  Y.reg <- iris[, 1]
  ens.reg <- caretEnsemble(caretList(X.reg, Y.reg, methodList=c("lm", "glm")))

  #Basic
  pred.reg <- predict(ens.reg, newdata = X.reg)
  pred.reg2 <- predict(ens.reg, newdata = X.reg.big)
  expect_equal(pred.reg, pred.reg2[1:length(pred.reg)])

  #Don't keep NAs
  pred.reg <- predict(ens.reg, newdata = X.reg, keepNA = FALSE)
  pred.reg2 <- predict(ens.reg, newdata = X.reg.big, keepNA = FALSE)
  expect_equal(pred.reg, pred.reg2[1:length(pred.reg)])

  #Return se
  pred.reg <- predict(ens.reg, newdata = X.reg, se = TRUE)
  pred.reg2 <- predict(ens.reg, newdata = X.reg.big, se = TRUE)
  expect_equal(pred.reg, pred.reg2[1:nrow(pred.reg), ])

  #Return weights
  pred.reg <- predict(ens.reg, newdata = X.reg, se = TRUE, return_weights = TRUE)
  pred.reg2 <- predict(
    ens.reg, newdata = X.reg.big, se = TRUE, return_weights = TRUE
    )
  expect_equal(pred.reg$preds, pred.reg2$preds[1:nrow(pred.reg$preds),])

  #Don't keep NAs, return se
  pred.reg <- predict(ens.reg, newdata = X.reg, keepNA = FALSE, se = TRUE)
  pred.reg2 <- predict(ens.reg, newdata = X.reg.big, keepNA = FALSE, se = TRUE)
  expect_equal(pred.reg, pred.reg2[1:nrow(pred.reg), ])

  #Don't keep NAs, return se, return weights
  pred.reg <- predict(
    ens.reg, newdata = X.reg, keepNA = FALSE, se = TRUE, return_weights = TRUE)
  pred.reg2 <- predict(
    ens.reg, newdata = X.reg.big, keepNA = FALSE, se = TRUE,
    return_weights = TRUE)
  expect_equal(pred.reg$preds, pred.reg2$preds[1:nrow(pred.reg$preds),])
})

In the future, you should write tests yourself, but since this is your first controbution, I wrote them for you.

zachmayer commented 9 years ago

nvm, I put that test on master. For your next PR, write some unit test for your new code. In this case I did it for you. =D

terrytangyuan commented 9 years ago

Great! Thanks for the unitest :-). I'll make sure to do that next time. So besides move those comments, anything else I should do to get this request closed?

Also, what changes did you do to fix the tests?

zachmayer commented 9 years ago

I added 2 minor nitpicks that were causing lintr tests to fail. If you fix those, we should be good to go.

terrytangyuan commented 9 years ago

Great. I've pushed new code. Let me know if you need anything else.

terrytangyuan commented 9 years ago

I don't know why it's still failed. Can you merge this PR?

zachmayer commented 9 years ago

I think the build failed. I'll re-run it.

terrytangyuan commented 9 years ago

From this build log [https://travis-ci.org/zachmayer/caretEnsemble/builds/72499945], it looks like when NOT_CRAN='false', this build will succeed. Am I correct? Have you got a chance to re-run it yet?

jknowles commented 9 years ago

I think the only issue remaining is a bit of formatting in the testthat.r file. This looks good, but we probably need to include some documentation changes so the user knows how to specify number of cores, and that this function will try to go parallel when a certain size is reached.

I am realizing the wtd.sd stuff needs some love too since that was a bit of a lazy calculation on my fault. There is some promising development in caret for prediction intervals. I'm waiting on that.

terrytangyuan commented 9 years ago

If this code is ok, could you merge it first? Right now, no parallel is supported but we can certainly add in the future. The parallelization is easy to achieve and I can help with that (I'll submit a new pull request once this is closed). The only issue I might be having is with the Travis CI passing I guess.

zachmayer commented 9 years ago

Hi all-- I'm on vacation for the rest of the week and can't help out with this.

The lingering Travis error when not_cran = TRUE is bothering me.  I'll sort it out next week and merge the PR.

terrytangyuan commented 9 years ago

Thanks Zach. No rush. Please enjoy your vacation!

Best, Yuan

terrytangyuan commented 9 years ago

Is this one still active?

zachmayer commented 9 years ago

Yes, I still need to dig into why the build is stalling.

terrytangyuan commented 9 years ago

@zachmayer okay no rush. I am just cleaning some pending PRs on my Github.

jknowles commented 9 years ago

@terrytangyuan @zachmayer I'm afraid that the changes I made to predict.caretEnsemble on PR #167 may impact this PR. I think we'll want to decide which to pull in first, and the other can re-base against that. It should still be possible to perform the same foreach calculations with the modifications I made, but the way the output is determined by user specified parameters has changed and so have the unit tests.

jknowles commented 8 years ago

@terrytangyuan Can you rebase to master and then resolve any conflicts? I think this will then pass checks and then the prediction method will be fast and robust to some of the bugs we resolved with #173 and #167.

terrytangyuan commented 8 years ago

@jknowles Thanks. But I still couldn't make it pass :-(

jknowles commented 8 years ago

@terrytangyuan

That's not your fault -- I dramatically changed the weighted standard deviation function. It now takes different arguments:

#' @title Calculate a weighted standard deviation
#' @description Used to weight deviations among ensembled model preditions
#'
#' @param x a vector of numerics
#' @param w a vector of weights equal to length of x
#' @param na.rm a logical indicating how to handle missing values, default = FALSE
wtd.sd <- function (x, w = NULL, na.rm = FALSE) {
  if (na.rm) {
    w <- w[i <- !is.na(x)]; x <- x[i]
    }
    n <- length(w)
    xWbar <- weighted.mean(x,w,na.rm=na.rm)
    wbar <- mean(w)
    out <- n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2))
    return(out)
}

So instead of apply(preds, 1, FUN = wtd.sd, weights = object$weights, normwt = TRUE)

you need something like

apply(preds, 1, FUN = wtd.sd, w= object$weights)

That should fix the error you are seeing.

terrytangyuan commented 8 years ago

Ok thanks I'll try when I get home.

terrytangyuan commented 8 years ago

FYI: But if this function is used externally, you might want to let the user know that this method is deprecated. :-)

jknowles commented 8 years ago

@terrytangyuan Thanks. It shouldn't be used externally because we never included it as a namespace export. So users would only be able to use it using the caretEnsemble:::wtd.sd() construction, which isn't best practice. I did this intentionally because this little function is just not something I'm 100% sure of and we caution users against putting too much stock in our "standard errors" in the documentation. It's something that more work needs to be done on, most likely in the caret package itself.

zachmayer commented 8 years ago

In general, if a function is not explicitly exported to the NAMESPACE, there's no promises that it will work the same or stay around from version-to-version of the package.

terrytangyuan commented 8 years ago

Got it. Thanks guys. I've tested it locally so it should pass this time.

terrytangyuan commented 8 years ago

@zachmayer @jknowles I double checked that the tests ran successfully locally. Any ideas?

zachmayer commented 8 years ago

Test errors aren't super informative:

Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  7. Error: We can ensemble models of different predictors 
  8. Error: Checks generate errors 
  9. Error: Test that optFUN does not take random values 
  1. Error: safe and greedy optimizers get same result in the limit 
  2. Error: Warnings and fallbacks in degenerate cases 
  3. Error: Missing values are ignored in optimization 
  4. Error: Test for NA handling - regression 
  5. Error: predict.caretEnsemble works in parallel 
  6. ...

Are there perhaps any packages your tests use that aren't in the DESCRIPTION file?

zachmayer commented 8 years ago

Looking into the logs:

1. Error: We can fit models with a mix of methodList and tuneList --------------
x$control$savePredictions is not TRUE
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_is(caretEnsemble(test), "caretEnsemble") at test-caretList.R:153
5: expect_that(object, is_a(class), info, label)
6: condition(object)
7: paste0(class(x), collapse = ", ")
8: caretEnsemble(test)
9: makePredObsMatrix(all.models)
10: extractBestPreds(list_of_models)
11: lapply(list_of_models, bestPreds)
12: FUN(X[[i]], ...)
13: stopifnot(x$control$savePredictions)
14: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, 
       domain = NA)

make sure you have savePredictions=TRUE in all your tests?

zachmayer commented 8 years ago

Yeah, there's 23 testthat errors:

testthat results ================================================================
OK: 599 SKIPPED: 2 FAILED: 23
1. Error: We can fit models with a mix of methodList and tuneList 
2. Error: CV works with method=boot 
3. Error: CV works with method=adaptive_boot 
4. Error: CV works with method=cv 
5. Error: CV works with method=repeatedcv 
6. Error: CV works with method=adaptive_cv 
7. Error: CV works with method=LGOCV 
8. Error: CV works with method=adaptive_LGOCV 
9. Error: Classification models 
1. Error: Longer tests for Classification models 
2. Error: Test that caretList preserves user specified error functions 
3. Error: Users can pass a custom tuneList 
4. Error: User tuneTest parameters are respected and model is ensembled 
5. Error: User tuneTest parameters are respected and model is ensembled 
6. Error: Regression Models 
7. Error: We can ensemble models of different predictors 
8. Error: Checks generate errors 
9. Error: Test that optFUN does not take random values 
1. Error: safe and greedy optimizers get same result in the limit 
2. Error: Warnings and fallbacks in degenerate cases 
3. Error: Missing values are ignored in optimization 
4. Error: Test for NA handling - regression 
5. Error: predict.caretEnsemble works in parallel 
6. ...
Error: testthat unit tests failed
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

You might need to rebase your branch off master / merge master into your branch and then re-run the local tests. A rebase is preferable, if you know how to do it.

terrytangyuan commented 8 years ago

There's only one foreach which is added to Imports section in DESCRIPTION.

Re: savePredictions=TRUE, I never changed those things. All tests should remain untouched from most recent commit from master. How come the tests pass locally though?

zachmayer commented 8 years ago

@terrytangyuan Tests would pass locally because we've added commits to master since you made your branch. If you merge master into your branch or rebase your branch off master (a rebase is better), you should be able to reproduce errors locally.

zachmayer commented 8 years ago

Hmmm that actually might not be it.

terrytangyuan commented 8 years ago

Yeah that's not it since I have already pulled the changes and fixed the conflicts.