wpinvestigative / arcos

https://wpinvestigative.github.io/arcos/
Other
30 stars 18 forks source link

Full ARCOS data is incomplete #10

Open jr-free opened 3 years ago

jr-free commented 3 years ago

Not a direct issue with the R or Python APIs, but the full ARCOS dataset is incomplete. Both the links on the WaPo landing page and this repo only contain data for the dates 2006-2012. The API functions also, while documented, do not necessarily return what may be expected. It seems some of the county queries will only return TAB data between 2006-2012.

Using the web API, it is possible to pull county data by drug for the period 2006-2014. I have not been able to do this with either the R or Python API. It also seems the wrapper for the county drug query is broken.

jeffcsauer commented 3 years ago

@unoriginaluid thanks for posting and highlighting this. Could you please post some more information about the issue?

For example, what county returns 'full' data via the web API but not via the wrapper?

There are known issues with the data associated with some data being so large that they will not work with the wrapper.

jr-free commented 3 years ago

There are a few counties that return complete data via the web API. As a note, we're focusing on Florida in our work, so I can only speak to FL counties. I was able to use pull the "full" data for the following counties using the web api (as a procedural point, I used the county_data_drug query on the web API to pull data for these):

'Clay', 'Duval', 'Baker', 'Saint Johns', 'Flagler', 'Putnam', 'Columbia', 'Bradford',
'Union', 'Lake', 'Seminole', 'Marion', 'Alachua', 'Gilchrist', 'Nassau'

Using the county_raw() wrapper on Clay and Duval, I was only able to get 2006-2012.

Re: the point of a broken function, drug_county_raw() doesn't work at all.

jeffcsauer commented 3 years ago

@unoriginaluid this is the same issue raised here.

The 2013 and 2014 data was not part of the original 2006-2012 data dump, and so it is likely that the API has not been comprehensively updated to access this data quite yet. The issue is on the radar!

jr-free commented 3 years ago

Thanks for following up on this, Jeff. I greatly appreciate the assistance.

andrewbtran commented 3 years ago

Alright, I've updated the API and R package so large files should no longer time out. Am currently running scripts to update the data that these functions are pulling from to replace on our server so we can have everything through 2014. Should take a week to run and swap out everything.

jeffcsauer commented 3 years ago

Amazing, thanks so much!

MLSun-A commented 3 years ago

Not a direct issue with the R or Python APIs, but the full ARCOS dataset is incomplete. Both the links on the WaPo landing page and this repo only contain data for the dates 2006-2012. The API functions also, while documented, do not necessarily return what may be expected. It seems some of the county queries will only return TAB data between 2006-2012.

Using the web API, it is possible to pull county data by drug for the period 2006-2014. I have not been able to do this with either the R or Python API. It also seems the wrapper for the county drug query is broken.

Hi, I am recently working with the full ARCOS dataset (downloaded from this link https://wpinvestigative.github.io/arcos/#download-the-raw-data) as well. However, from this data, I cannot observe the information of the year, and it only shows 42 columns. I was curious whether it is due to the results that I only open the first few thousand rows, or there is another raw dataset that provides all kinds of information such as year, county, drug name. Would you mind guiding me for the full dataset?

Thanks for your time and help!

jeffcsauer commented 3 years ago

@MLSun-A

Date is inferred from the column TRANSACTION_DATE. For example, you could create year by:

# After loading your data or subset of data into a dataframe called temp:
temp$Year <- as.numeric(str_sub(temp$TRANSACTION_DATE,-4,-1))

Got your email - responding soon!

MLSun-A commented 3 years ago

@jeffcsauer Thanks for your quick response and helpful reply! Appreciate your help.

accessarcos commented 2 years ago

@andrewbtran Is it possible for you to post the file size of the FULL ARCOS data set? I would like to make sure that we are using the correct data set. I am having issues with verifying the size. Also, do you know if there are any updates in the courts that they will be releasing any more years soon? Or does a motion have to be filed for them to do so? I know that many have moved on (between COVID, etc.), but this data is vital for so many studies that are being done. I work with jr-free and have spoken with jeff sauer. We're all academics. Thanks! ~ Mischa

andrewbtran commented 2 years ago

file has been updated to include 2013 and 2014 https://d2ty8gaf6rmowa.cloudfront.net/dea-pain-pill-database/bulk/arcos_all.tsv.gz