Check case weights in caret package

nguyentr17 commented 6 years ago

[x] Look for literature on pros and cons
[x] Run weighted models and see whether results change significantly compared to unweighted models

yuqiliao commented 6 years ago

This literature What Are We Weighting For? points out that we don't necesarrily need to use weights if we are not doing summary tables.

I ran weight and unweight models ("rpart" and "random forest") without tuning parameters, the results are more of less the same.

Just for the record. rpart - unweighted

Confusion Matrix and Statistics

test_pred   0   1
        0  64  34
        1  87 211

               Accuracy : 0.6944          
                 95% CI : (0.6465, 0.7395)
    No Information Rate : 0.6187          
    P-Value [Acc > NIR] : 0.0009901       

                  Kappa : 0.3056          
 Mcnemar's Test P-Value : 2.276e-06       

            Sensitivity : 0.4238          
            Specificity : 0.8612          
         Pos Pred Value : 0.6531          
         Neg Pred Value : 0.7081          
             Prevalence : 0.3813          
         Detection Rate : 0.1616          
   Detection Prevalence : 0.2475          
      Balanced Accuracy : 0.6425          

       'Positive' Class : 0

rpart - weighted

Confusion Matrix and Statistics

test_pred   0   1
        0  64  31
        1  87 214

               Accuracy : 0.702           
                 95% CI : (0.6543, 0.7467)
    No Information Rate : 0.6187          
    P-Value [Acc > NIR] : 0.0003197       

                  Kappa : 0.3201          
 Mcnemar's Test P-Value : 4.124e-07       

            Sensitivity : 0.4238          
            Specificity : 0.8735          
         Pos Pred Value : 0.6737          
         Neg Pred Value : 0.7110          
             Prevalence : 0.3813          
         Detection Rate : 0.1616          
   Detection Prevalence : 0.2399          
      Balanced Accuracy : 0.6487          

       'Positive' Class : 0

random forest - unweighted

Confusion Matrix and Statistics

test_pred   0   1
        0  90  43
        1  61 202

               Accuracy : 0.7374          
                 95% CI : (0.6911, 0.7801)
    No Information Rate : 0.6187          
    P-Value [Acc > NIR] : 4.056e-07       

                  Kappa : 0.4304          
 Mcnemar's Test P-Value : 0.09552         

            Sensitivity : 0.5960          
            Specificity : 0.8245          
         Pos Pred Value : 0.6767          
         Neg Pred Value : 0.7681          
             Prevalence : 0.3813          
         Detection Rate : 0.2273          
   Detection Prevalence : 0.3359          
      Balanced Accuracy : 0.7103          

       'Positive' Class : 0

random forest - weighted

Confusion Matrix and Statistics

test_pred_weight   0   1
               0  90  41
               1  61 204

               Accuracy : 0.7424          
                 95% CI : (0.6964, 0.7848)
    No Information Rate : 0.6187          
    P-Value [Acc > NIR] : 1.285e-07       

                  Kappa : 0.4399          
 Mcnemar's Test P-Value : 0.05993         

            Sensitivity : 0.5960          
            Specificity : 0.8327          
         Pos Pred Value : 0.6870          
         Neg Pred Value : 0.7698          
             Prevalence : 0.3813          
         Detection Rate : 0.2273          
   Detection Prevalence : 0.3308          
      Balanced Accuracy : 0.7143          

       'Positive' Class : 0

Also, I compared the top 20 important variables from the two random forests models, it looks like they are more or less the same

> intersect(top20_varName, top20_varName_weight)
 [1] "asbr07d" "asbr07e" "asbr07f" "asbg04"  "asbr07a" "atbr04"  "asbr07c" "atbr03a" "atbg08a" "asbr07b" "atbg07f"
[12] "asbr06c" "asbg09c" "asbr02c" "atbg01"  "asbg11c" "asbg11d" "asbg07a" "asbg07b"
> setdiff(top20_varName, top20_varName_weight)
[1] "asbr04"

yuqiliao commented 6 years ago

Additional questions @nguyentr17

in lm.sdf, are weights used? if yes, how are weights used?

additionally, just a note:

after merging student and teacher data, the suggested weight to use should be tchwgt, instead of totwgt, if we do use the weight.

nguyentr17 commented 6 years ago

Decided: Not use weight.

yuqiliao / InternationalAssessment_MachineLearning

Check case weights in caret package #2