Extending dml to use lasso

yixinsun1216 commented 4 years ago

After creating the estimation dataset in dml_step(), we want to be able to choose either regression forest or lasso as our ml method for calculating gamma and delta. Instead of running regression_forest2 and predict_rf2 in dml_step(), our ml methods are all housed in the ml_functions script. From dml_step(), we use the run_ml() function, which calls the ml method that we want to use for estimation and prediction.

Currently I have implemented lasso using the basic cv.glmnet() function from the glmnetUtils package. Additional arguments that we have to pass into the master dml() function are:

ml, specifying the machine learning method we want to use
poly_degree, which specifies the polynomial degree to which we want to expand our original formula. This expanded formula is then what the lasso regression uses for estimating coefficients.

Next steps:

[x] Play around with glmnet arguments to see if we can make speed improvements
[x] Use the hdm package that uses double selection lasso and see how that compares to glmnet

yixinsun1216 commented 4 years ago

Notes to self:

Difference between glmnet() and cv.glmnet(): By default, glmnet() performs ridge or lasso regression for an automatically selected range of lambdas, which may not give the lowest test MSE. cv.glmnet() does k-fold cross-validation for glmnet (default is 10 folds), which selects the value of the tuning parameter lambda
the family parameter needs to match the type of the response variable
- For quantitative y, we have gaussian or poisson (non-negative counts)
- binomial for a variable that is a factor with 2 levels or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
- multinomial for a variable with more than 2 factor levels, or a matrix with nc columns of counts or proportions.
  - For either "binomial" or "multinomial", if y is presented as a vector, it will be coerced into a factor.
- For cox, y should be a two-column matrix with columns named ’time’ and ’status’. The latter is a binary variable, with ’1’ indicating death, and ’0’ indicating right censored. The function Surv() in package survival produces such a matrix.
- For mgaussian, y is a matrix of quantitative responses.

yixinsun1216 commented 4 years ago

Running the following dml bonus regression using lasso vs regression forest, n_dml = 101, lasso is MUCH faster (6 minutes vs an hour).

reg_data <-
  final_leases %>%
  filter(InSample) %>%
  mutate(Private = NParcels15 > 0 | Type == "RAL")

base_controls <-
  "Auction + bs(Acres, df = 7) + Term + RoyaltyRate"

tic("lasso")
m_lasso <-
  paste("BonusPerAcre", base_controls, sep = " ~ ") %>%
  paste("CentLat + CentLong + EffDate", sep = " | ") %>%
  as.formula %>%
  dml(reg_data, psi_plr, psi_plr_grad, psi_plr_op,
      n = 101, ml = "lasso", dml_seed = 123, family = "gaussian")
toc()

tic("regression forest")
m_RF <-
  paste("BonusPerAcre", base_controls, sep = " ~ ") %>%
  paste("CentLat + CentLong + EffDate", sep = " | ") %>%
  as.formula %>%
  dml(reg_data, psi_plr, psi_plr_grad, psi_plr_op,
      n = 101, ml = "regression_forest", dml_seed = 123)
toc()

yixinsun1216 commented 4 years ago

rlasso is about the same speed as lasso (6 minutes)

Note Chernozhukov et al uses 2 polynomial degrees in their examples

# =================================
LASSO

Call:
dml(f = ., d = reg_data, model = "linear", n = 101, dml_seed = 123, 
    ml = "lasso", family = "gaussian")

Coefficients:
                   Estimate Std. Error
Auction             1.54322      0.150
bs_Acres__df___7_1 -1.10692      0.442
bs_Acres__df___7_2 -0.51727      0.428
bs_Acres__df___7_3 -1.54689      0.423
bs_Acres__df___7_4  0.19787      0.442
bs_Acres__df___7_5 -1.85382      0.623
bs_Acres__df___7_6  0.42136      1.037
bs_Acres__df___7_7 -0.01575      0.601
Term               -0.42882      0.068
RoyaltyRate         0.04427      4.389

Number of Observations: 1,274

# =================================
RLASSO

Call:
dml(f = ., d = reg_data, model = "linear", n = 101, dml_seed = 123, 
    ml = "rlasso")

Coefficients:
                   Estimate Std. Error
Auction             1.32909      0.180
bs_Acres__df___7_1  0.00518      0.408
bs_Acres__df___7_2  0.31682      0.310
bs_Acres__df___7_3 -0.42648      0.405
bs_Acres__df___7_4  1.47701      0.378
bs_Acres__df___7_5 -0.14635      0.604
bs_Acres__df___7_6 -0.22703      1.098
bs_Acres__df___7_7  1.28221      0.580
Term               -0.40010      0.064
RoyaltyRate         0.02453      4.310

Number of Observations: 1,274

# =================================
RANDOM FOREST
Call:
dml(f = ., d = reg_data, psi = psi_plr, psi_grad = psi_plr_grad, 
    psi_op = psi_plr_op, n = 101, dml_seed = 123, ml = "regression_forest")
Coefficients:
                    Estimate Std. Error
Auction             0.917897      0.126
bs_Acres__df___7_1 -0.237388      0.283
bs_Acres__df___7_2  0.004222      0.262
bs_Acres__df___7_3 -0.426565      0.252
bs_Acres__df___7_4  0.218114      0.261
bs_Acres__df___7_5 -0.214611      0.371
bs_Acres__df___7_6 -0.139620      0.623
bs_Acres__df___7_7  0.454794      0.481
Term               -0.034175      0.036
RoyaltyRate         0.007012      1.367
Number of Observations: 1,274