Closed nguyentr17 closed 6 years ago
This literature What Are We Weighting For? points out that we don't necesarrily need to use weights if we are not doing summary tables.
I ran weight and unweight models ("rpart" and "random forest") without tuning parameters, the results are more of less the same.
Just for the record. rpart - unweighted
Confusion Matrix and Statistics
test_pred 0 1
0 64 34
1 87 211
Accuracy : 0.6944
95% CI : (0.6465, 0.7395)
No Information Rate : 0.6187
P-Value [Acc > NIR] : 0.0009901
Kappa : 0.3056
Mcnemar's Test P-Value : 2.276e-06
Sensitivity : 0.4238
Specificity : 0.8612
Pos Pred Value : 0.6531
Neg Pred Value : 0.7081
Prevalence : 0.3813
Detection Rate : 0.1616
Detection Prevalence : 0.2475
Balanced Accuracy : 0.6425
'Positive' Class : 0
rpart - weighted
Confusion Matrix and Statistics
test_pred 0 1
0 64 31
1 87 214
Accuracy : 0.702
95% CI : (0.6543, 0.7467)
No Information Rate : 0.6187
P-Value [Acc > NIR] : 0.0003197
Kappa : 0.3201
Mcnemar's Test P-Value : 4.124e-07
Sensitivity : 0.4238
Specificity : 0.8735
Pos Pred Value : 0.6737
Neg Pred Value : 0.7110
Prevalence : 0.3813
Detection Rate : 0.1616
Detection Prevalence : 0.2399
Balanced Accuracy : 0.6487
'Positive' Class : 0
random forest - unweighted
Confusion Matrix and Statistics
test_pred 0 1
0 90 43
1 61 202
Accuracy : 0.7374
95% CI : (0.6911, 0.7801)
No Information Rate : 0.6187
P-Value [Acc > NIR] : 4.056e-07
Kappa : 0.4304
Mcnemar's Test P-Value : 0.09552
Sensitivity : 0.5960
Specificity : 0.8245
Pos Pred Value : 0.6767
Neg Pred Value : 0.7681
Prevalence : 0.3813
Detection Rate : 0.2273
Detection Prevalence : 0.3359
Balanced Accuracy : 0.7103
'Positive' Class : 0
random forest - weighted
Confusion Matrix and Statistics
test_pred_weight 0 1
0 90 41
1 61 204
Accuracy : 0.7424
95% CI : (0.6964, 0.7848)
No Information Rate : 0.6187
P-Value [Acc > NIR] : 1.285e-07
Kappa : 0.4399
Mcnemar's Test P-Value : 0.05993
Sensitivity : 0.5960
Specificity : 0.8327
Pos Pred Value : 0.6870
Neg Pred Value : 0.7698
Prevalence : 0.3813
Detection Rate : 0.2273
Detection Prevalence : 0.3308
Balanced Accuracy : 0.7143
'Positive' Class : 0
Also, I compared the top 20 important variables from the two random forests models, it looks like they are more or less the same
> intersect(top20_varName, top20_varName_weight)
[1] "asbr07d" "asbr07e" "asbr07f" "asbg04" "asbr07a" "atbr04" "asbr07c" "atbr03a" "atbg08a" "asbr07b" "atbg07f"
[12] "asbr06c" "asbg09c" "asbr02c" "atbg01" "asbg11c" "asbg11d" "asbg07a" "asbg07b"
> setdiff(top20_varName, top20_varName_weight)
[1] "asbr04"
Additional questions @nguyentr17
additionally, just a note:
tchwgt
, instead of totwgt
, if we do use the weight.Decided: Not use weight.