williamdiebel / Finn-Will-Project

0 stars 0 forks source link

Merging data #1

Open williamdiebel opened 3 months ago

williamdiebel commented 3 months ago

One potential small issue to note with the BoardEx data is that there appear to be a handful of duplicate gvkey-year combinations. I attempted to resolve duplicates in the existing script -- probably not worth much additional time but it might just be worth double checking how I handled those.

Also, so far, I've only looked at generating matches between the BoardEx data and my existing panel using gvkey-year. I noticed that the BoardEx data also has cusip values, so that's probably worth investigating further as a next step. My data also have cusip values from Compustat and RepRisk. Other potential ID variables to match on are ISIN (cusip and ISIN are relateable) and firm name. See vbl descriptions in the updated Read Me for more info on the potential ID vbls to use for data merging improvements.

fpetersen13 commented 3 months ago

On the issue of only achieving 20% overlap between the two datasets

The data in data_essay2_robustness_v2.rds is global while my CSO data was focused on North America. If I restrict the data in data_essay2_robustness_v2.rds to US and CA, then we have 90% overlap. However, the data only includes 275 firms from North America, which seems too low. The file rrpanel_comp_fs_fortune_cdp.rds has 3,854 firms from North America, which seems more accurate.

The data description reads that data_essay2_robustness_v2.rds is "panel data for a subset of firms." I am assuming that you first identified the CDP firms, then used CEM to identify controls, and then saved those firms as a subset. Is that correct? Here, we might need to repeat the CEM step on the full sample with firms who appoint a CSO as the treated group. Does that make sense or am I missing something?

williamdiebel commented 3 months ago

Ahhh, makes sense about the CSO data being focused on NA firms! I'm sorry I hadn't noticed that.

Your interpretation of my data is correct. If we define treatment based on CSO appointments, it makes sense to start from rrpanel_comp_fs_fortune_cdp.rds and then reimplement the CEM matching.