yixinsun1216 / crossfit

Implementation of Double/Debiased Machine Learning approach
3 stars 2 forks source link

Extending dml to use lasso #2

Closed yixinsun1216 closed 3 years ago

yixinsun1216 commented 4 years ago

After creating the estimation dataset in dml_step(), we want to be able to choose either regression forest or lasso as our ml method for calculating gamma and delta. Instead of running regression_forest2 and predict_rf2 in dml_step(), our ml methods are all housed in the ml_functions script. From dml_step(), we use the run_ml() function, which calls the ml method that we want to use for estimation and prediction.

Currently I have implemented lasso using the basic cv.glmnet() function from the glmnetUtils package. Additional arguments that we have to pass into the master dml() function are:

  1. ml, specifying the machine learning method we want to use
  2. poly_degree, which specifies the polynomial degree to which we want to expand our original formula. This expanded formula is then what the lasso regression uses for estimating coefficients.

Next steps:

yixinsun1216 commented 4 years ago

Notes to self:

yixinsun1216 commented 4 years ago

Running the following dml bonus regression using lasso vs regression forest, n_dml = 101, lasso is MUCH faster (6 minutes vs an hour).

reg_data <-
  final_leases %>%
  filter(InSample) %>%
  mutate(Private = NParcels15 > 0 | Type == "RAL")

base_controls <-
  "Auction + bs(Acres, df = 7) + Term + RoyaltyRate"

tic("lasso")
m_lasso <-
  paste("BonusPerAcre", base_controls, sep = " ~ ") %>%
  paste("CentLat + CentLong + EffDate", sep = " | ") %>%
  as.formula %>%
  dml(reg_data, psi_plr, psi_plr_grad, psi_plr_op,
      n = 101, ml = "lasso", dml_seed = 123, family = "gaussian")
toc()

tic("regression forest")
m_RF <-
  paste("BonusPerAcre", base_controls, sep = " ~ ") %>%
  paste("CentLat + CentLong + EffDate", sep = " | ") %>%
  as.formula %>%
  dml(reg_data, psi_plr, psi_plr_grad, psi_plr_op,
      n = 101, ml = "regression_forest", dml_seed = 123)
toc()
yixinsun1216 commented 4 years ago

rlasso is about the same speed as lasso (6 minutes)

Note Chernozhukov et al uses 2 polynomial degrees in their examples

# =================================
LASSO

Call:
dml(f = ., d = reg_data, model = "linear", n = 101, dml_seed = 123, 
    ml = "lasso", family = "gaussian")

Coefficients:
                   Estimate Std. Error
Auction             1.54322      0.150
bs_Acres__df___7_1 -1.10692      0.442
bs_Acres__df___7_2 -0.51727      0.428
bs_Acres__df___7_3 -1.54689      0.423
bs_Acres__df___7_4  0.19787      0.442
bs_Acres__df___7_5 -1.85382      0.623
bs_Acres__df___7_6  0.42136      1.037
bs_Acres__df___7_7 -0.01575      0.601
Term               -0.42882      0.068
RoyaltyRate         0.04427      4.389

Number of Observations: 1,274

# =================================
RLASSO

Call:
dml(f = ., d = reg_data, model = "linear", n = 101, dml_seed = 123, 
    ml = "rlasso")

Coefficients:
                   Estimate Std. Error
Auction             1.32909      0.180
bs_Acres__df___7_1  0.00518      0.408
bs_Acres__df___7_2  0.31682      0.310
bs_Acres__df___7_3 -0.42648      0.405
bs_Acres__df___7_4  1.47701      0.378
bs_Acres__df___7_5 -0.14635      0.604
bs_Acres__df___7_6 -0.22703      1.098
bs_Acres__df___7_7  1.28221      0.580
Term               -0.40010      0.064
RoyaltyRate         0.02453      4.310

Number of Observations: 1,274

# =================================
RANDOM FOREST
Call:
dml(f = ., d = reg_data, psi = psi_plr, psi_grad = psi_plr_grad, 
    psi_op = psi_plr_op, n = 101, dml_seed = 123, ml = "regression_forest")
Coefficients:
                    Estimate Std. Error
Auction             0.917897      0.126
bs_Acres__df___7_1 -0.237388      0.283
bs_Acres__df___7_2  0.004222      0.262
bs_Acres__df___7_3 -0.426565      0.252
bs_Acres__df___7_4  0.218114      0.261
bs_Acres__df___7_5 -0.214611      0.371
bs_Acres__df___7_6 -0.139620      0.623
bs_Acres__df___7_7  0.454794      0.481
Term               -0.034175      0.036
RoyaltyRate         0.007012      1.367
Number of Observations: 1,274
tcovert commented 4 years ago

@rlsweeney you see this? both of the lasso implementations are a full 2 SE's bigger than the RF point estimates

yixinsun1216 commented 4 years ago

Comparing Results from Texas Paper and DML

Bonus Regressions

image

image


Output Linear Regressions

image

image


Output Pseudo-Poisson Regression

image

image

tcovert commented 4 years ago

thanks @yixinsun1216. I think the middle section is off because you aren't doing filter(!Censored), right?

tcovert commented 4 years ago

also, crazy how much larger the other DML approaches are (Lasso, which I assume is default settings for glmnet in a DML setup, and similar for HDM)