Closed yixinsun1216 closed 3 years ago
hi @yixinsun1216 finally got to looking at this. I think steps 1-3 (and your score computations) in the "implementing in R" above are correct, but I think in step 4, you want to make sure beta
is fixed at the value you solved for in step 2, right? beta
is part of the nuisance parameter set, and the goal is to estimate that on one part of the splits, and estimate theta based on that value from the other parts, right?
to answer your questions:
@tcovert I wrote up the rest of the procedures for implementing the two different approaches (finite vs concentrate out) and using lasso vs regression forest. Let me know if you see anything wrong in the implementation steps.
Remaining issues:
DBOEPerAcre
estimation, it can only implement with poly_degree =1
. Anything higher throws an error. Code below if you want to see how it fails
library(dplyr)
library(crossfit)
load("C:/Users/Yixin Sun/Dropbox (Personal)/EPIC/texas/generated_data/final_leases.Rda")
reg_data <- final_leases %>% filter(InSample) %>% mutate(Private = NParcels15 > 0 | Type == "RAL")
dboe_formula <- paste("DBOEPerAcre", "Auction", sep = " ~ ") %>% paste("Term + RoyaltyRate + Acres + CentLat + CentLong + EffDate", sep = " | ") %>% as.formula
dboe_data <- filter(reg_data, !Censored)
dboe_p_lasso <- dml(dboe_formula, dboe_data, model = "poisson", n = 11, worker = 4, dml_seed = NULL, ml = "lasso", poly_degree = 3, drop_na = FALSE, score = "finite") print(summary(dboe_p_lasso))
2. Thinking more how you can't get the nuisance parameters using regression forest, I definitely don't think I'm implementing the finite nuisance parameter method correctly with RF. My current process is:
1. Use lm/glm and run y on D and X -> use this to get s = X*B.
2. Use regression forest to find weights in the case of Poisson
3. Use regression forest of D on X to find m = E[D|X], using weights from step 2. Then plug s and m into the moment and continue as usual
Maybe you were right that we should keep it simple, and we can just use lasso for the finite nuisance parameter approach and user can choose between lasso or rf for the concentrating out approach. Otherwise not really sure how to implement this - what do you think?
3. For `rlasso`, have you figured out how to force `D` to be included in the lasso
You were probably right that we could just drop `rlasso` now since it only has limited use to us - what do you think?
Regarding 2: I'm pretty sure there is no kosher way to do FN with RF. The reason is that using the RF regression technology, the nuisance parameter is infinite dimensional, by construction: there is no finite parameter $\beta$ that a random forest spits out after you run it.
On top of this, the FN approach requires a first step calculation of the "beta" parameter in an lm or glm: the part of Y that is explained by X, holding D fixed. This is reasonably well defined in a regression technology that actually gives you a finite parameter as an answer (i.e., LASSO), because you can just give lasso a separate set of columns for D and a separate set for X and whatever interactions you want in it. There's no way to get this separate "additive" structure in RF: in every branch of every tree in an RF, the effects of D and X are always tangled up together.
The DML paper gives two ways to calculate the Neyman orthogonal scores. In our first attempt, we use the concentrating out approach for both the linear and poisson models. However, the concentrating out approach only works for poisson if the variable of interest is an indicator variable. So now we want to use the original method of constructing moments, so that our DML function can be used for any type of treatment variable.
Note details on Stata's version of the implementation can be found here
Constructing the Neyman orthogonal scores for Poisson:
Implementing in R
Y ~ D*theta + X*beta
. This gives us our estimates of theta_hat and beta_hat(1/N)sum_{i = 1}{N}(y_i - exp(d_i*theta' + x_i*beta'))(d_i - x_i*mu_hat')'
-(1/N)sum_{i = 1}{N}(exp(d_i*theta' + x_i*beta')(d_i - x_i*mu_hat')'*d
s^2 = (J_0^{-1})E[psi*psi'](J_0^{-1})
, where J_0 is the gradient of psi wrt thetan
times and find the median theta value as the final estimateQuestions
** DML cookbook for constructing scores