Fixing bug in positive class selection used by makePredObsMatrix

eric-czech commented 8 years ago

This commit contains a fix for an error in makePredObsMatrix (line 241) where the following code was being used to select the positive class for classification problems:

positive <- as.character(unique(modelLibrary$obs)[2]) #IMPROVE THIS!

The problem with that is that unique returns the values in order of appearance so a different positive class is potentially being selected for each resampling (which seems like a rather large bug if I'm not mistaken). For example:

unique(factor(c('positive', 'negative')))[2] # = 'negative'
unique(factor(c('negative', 'positive')))[2] # = 'positive'
# Given the above, the choice of positive class would be dependent on the order of the response data itself

I made a change to choose that class like this instead: positive <- levels(modelLibrary$obs)[2]

From a broader perspective, it would be nice if caretEnsemble maintained consistency with caret in that the first level in binary outcomes is treated as the positive class by default. I see hard-coded selections for the second class's probability predictions in predict.{caretList, caretStack} which might not be too hard to override with a more explicit selection of the positive class (or to change to the first class), but I'd be curious to hear what your thoughts are on that. Perhaps it could be some sort of global configuration option rather than a parameter that needs to be passed in multiple places? I dunno. I'm happy to help with any of it if you have suggestions.

zachmayer commented 8 years ago

From a broader perspective, it would be nice if caretEnsemble maintained consistency with caret in that the first level in binary outcomes

Overall, this seems like a good idea. A few comments:

Why don't you change predict.caretList and predict.caretStack to use the first level as the positive level?
Please add some unit tests to ensure the makePredObsMatrix and the predict functions are using the correct positive levels.
Please fix the lint that the lintr bot found.

Other than that, this seems like a good idea. That todo has been on the list for a while...

eric-czech commented 8 years ago

Alright sorry for the build errors, I didn't realize lintr wasn't running on my machine when running unit tests (had to install it first, whoops).

Anyways the changes to caretList and caretEnsemble were pretty minor and the only previous test it conflicted with was in test-ensemble.R where an expected probability prediction is now for a different class (i.e. the new value is equal to approximately 1 - the old value).

Let me know if you see any issues with those changes or with the new unit test (I'd be happy to move it into an existing test if you don't think it makes sense on its own).

eric-czech commented 8 years ago

@zachmayer What you would think of making the above configurable with a global option like: options(caret.ensemble.def.bin.level=1) # or 2

Now that it's clear what all needs to be changed to make the default one or the other, it would be easy to add that switch in. But I don't want to do that unless you're ok with the changes so far and not aware of a better way to make global configuration settings.

zachmayer commented 8 years ago

@eric-czech sure, making it a global option works for me. Make the default be the same one as the caret package uses by default

zachmayer commented 8 years ago

@eric-czech Actually, I'm happy with the PR as-is. No need to add a global option, but I'd like you to consider adding 1 more test.

eric-czech commented 8 years ago

@zachmayer No problem, I added in the configuration option as well as some extra tests. I didn't see any reason to test problems with alphabetical ordering at first, but I'm glad you suggested it because I did ultimately find that while the alphabetical ordering of the level names doesn't affect results of caret or caretEnsemble models directly, it does lead to different results indirectly because of the stratified CV splits caret creates (they ARE dependent on the factor names ... which is annoying).

In other words if you let the models use the caret::createFolds function then they produce different results depending on what names you give the factors because the stratification is based on results of the base::table function, which orders the factor names alphabetically. Anyways, the results were different because of that initially (not by very much) but by creating the CV splits manually I was able to make sure the factor names don't affect results in the tests.

zachmayer commented 8 years ago

I like getBinaryLevel, thanks for writing that. What do you think about an exported function called setBinaryLevel that's a friendly way to set the option in caretEnsemble? This function could coerce the input to an integer (with a warning if it's not already an integer), and then check that it's equal to 1 or 2.

zachmayer commented 8 years ago

Once we have this merged in here, it might actually make sense to add this option and logic to the caret package itself. I really like the approach.

zachmayer commented 8 years ago

A couple more comments, but this is looking really good.

Do you know to squash git commits? If so, you should squash this PR into one commit. If not, don't worry about it =D.

eric-czech commented 8 years ago

Ok @zachmayer I added in a getter and setter for the target level and added coercion to an integer for the argument (as well as a few other things like docs and using the getter in the test cases to reset the target level).

The stratified sampling issue is up there too so we'll see if anyone has suggestions on it.

I also gave rebasing a try but kind of shot myself in the foot by changing a lot of the same things throughout those commits (err I also kind of screwed it up by repeating some commits .. my gitfu is weak). Anyways, I do have another feature enhancement I was going to submit (for using custom models) and assuming that turns into multiple commits as well I'll be careful rebase them more frequently. My bad.

zachmayer commented 8 years ago

No worries! This PR is fine. In the future, you can "squash" multiple commits into a single commit using an "interactive rebase": http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html

zachmayer commented 8 years ago

Also note that after rebasing, you ALWAYS have to git push -f to replace your branch on github with your new one.

eric-czech commented 8 years ago

Cool I think that was where I went wrong .. I couldn't figure out how to recover from not having forced it and doing another rebase was resulting in tons of conflicts.

zachmayer commented 8 years ago

@eric-czech I fixed the rebase on my branch, and then force pushed to my own master.

Unfortunately, this kinda messed up your fork of my repo. You'll have to replace your master branch with mine: http://scribu.net/blog/resetting-your-github-fork.html

git remote add upstream git@github.com:zachmayer/caretEnsemble.git
git fetch upstream
git branch backup
git checkout upstream/master -B master
git push --force

eric-czech commented 8 years ago

Thanks! Looks much better now and that reset on my fork seems to have worked well.

zachmayer commented 8 years ago

:+1:

zachmayer / caretEnsemble

Fixing bug in positive class selection used by makePredObsMatrix #190