Open williamdiebel opened 3 months ago
On the issue of only achieving 20% overlap between the two datasets
The data in data_essay2_robustness_v2.rds
is global while my CSO data was focused on North America. If I restrict the data in data_essay2_robustness_v2.rds
to US and CA, then we have 90% overlap. However, the data only includes 275 firms from North America, which seems too low. The file rrpanel_comp_fs_fortune_cdp.rds
has 3,854 firms from North America, which seems more accurate.
The data description reads that data_essay2_robustness_v2.rds
is "panel data for a subset of firms." I am assuming that you first identified the CDP firms, then used CEM to identify controls, and then saved those firms as a subset. Is that correct? Here, we might need to repeat the CEM step on the full sample with firms who appoint a CSO as the treated group. Does that make sense or am I missing something?
Ahhh, makes sense about the CSO data being focused on NA firms! I'm sorry I hadn't noticed that.
Your interpretation of my data is correct. If we define treatment based on CSO appointments, it makes sense to start from rrpanel_comp_fs_fortune_cdp.rds
and then reimplement the CEM matching.
One potential small issue to note with the BoardEx data is that there appear to be a handful of duplicate gvkey-year combinations. I attempted to resolve duplicates in the existing script -- probably not worth much additional time but it might just be worth double checking how I handled those.
Also, so far, I've only looked at generating matches between the BoardEx data and my existing panel using gvkey-year. I noticed that the BoardEx data also has cusip values, so that's probably worth investigating further as a next step. My data also have cusip values from Compustat and RepRisk. Other potential ID variables to match on are ISIN (cusip and ISIN are relateable) and firm name. See vbl descriptions in the updated Read Me for more info on the potential ID vbls to use for data merging improvements.