Is there a way to define a dataset that requires arbitrary code to clean up?

weecology / retriever

Quickly download, clean up, and install public datasets into a database management system

http://data-retriever.org

Other

307 stars 132 forks source link

Is there a way to define a dataset that requires arbitrary code to clean up? #1476

Closed Kodiologist closed 3 years ago

Kodiologist commented 4 years ago

The retriever-recipes repository seems to consist entirely of DataPackage JSON "scripts", and I'm sure that there are some things you might want to do to clean up a dataset that aren't supported in this format. Can you write a Python function to retrieve and clean up the data, for example?

Relatedly, is it possible to define parts of a dataset, so the user can choose which part to retrieve (e.g., a single year of a dataset spanning multiple years)?

henrykironde commented 4 years ago

@Kodiologist sorry I missed this. Thanks for the issue. You can create custom data packages in Python with specific reprocessing procedures. The Python files in the script folder are custom data packages. Here is one example https://github.com/weecology/retriever-recipes/blob/master/scripts/fao_global_capture_product.py

You can also request for a data package to be added by creating an issue. Let me know in case you need help

Kodiologist commented 4 years ago

@henrykironde Great, thank you. My group at Mount Sinai does a lot of work on models to predict air and pollution at high spatiotemporal resolutions for use in epidemiology (e.g., https://zenodo.org/record/3891475). My interest in Retriever is mostly in that it could be a convenient way to distribute some of our prediction products. I'll reach out if we decide to do that and have trouble.

By the way, https://www.data-retriever.org has a link "Adding datasets to the Data Retriever", to https://retriever.readthedocs.io/en/latest/scripts.html, but it's dead.

mauwazahmed commented 3 years ago

some basic tasks like checking datatype of values in a particular column and based on majority vote, get the apt datatype and check for values and replace them with Null.

For similar strings in a column and differing in case(upper or lower)..converting them to a single case so as to enable proper label encoding

same goes for length of data points in each cell and removing outlier values.

Boxplots can help in removing outlier parameters or changing distributions to achieve a more fit gaussian distribution.

henrykironde commented 3 years ago

@mauwazahmed is the data public? I could take a look at the data and see what I can do. I could help you to tailor a script that can do most of that.

use correct datatype and check for values and replace them with Null
single case
removing outlier values(We call say these are nulls to),The retriever can do this but we generally don't remove these since that should be part of the analysis. However, I could add that if it is special case.

mauwazahmed commented 3 years ago

@henrykironde just suggesting about some cleaning tasks that can be part of automation

ethanwhite commented 3 years ago

Closing since this was more of a discussion topic. Feel free to open specific requests in their own issues.