Closed yixinsun1216 closed 4 years ago
Results from toying with texas dataset:
honesty
seems to be the key parameter that moves around coefficient estimates in the texas results. When I compare estimates from tuning all parameters (sample.fraction, mtry, min.node.size, honesty.fraction, honesty.prune.leaves, alpha, imbalance.penalty), the estimates are not much different from when I tune just the honesty.fraction and honesty.prune.leaves parameterstune_regression_forest
function, I find that consistently, the best honesty.fraction to use is 0.7. dml
with just dml_n = 5
. Keeping honesty = TRUE
or honesty.fraction = .7
cuts down runtime a bit compared to honesty = FALSE
. So in the texas case specifically, it seems we should set honesty.fraction to 0.7 and keep tuning.parameters off to increase the speed of the function.
For users in general, the best approach should be if you have a small dataset, to figure out what honesty.fraction is best by running tune_regression_forest
, and perhaps keepng tune.parameters
turned off if speed is a concern.
Reading through what the
grf
people have to say about thehonesty
parameter in small datasets, the trade off we're making is thathonesty
should lead to less biased estimates, but with small datasets, further splitting the dataset means there might not be enough information for the function to even determine what good splits are in the data.But switching
honesty
on or off causes big swings in both size and point estimates of coefficients - can thetune.parameters
argument fix these swings and show us the "right" way to do things?