Closed Kodiologist closed 3 years ago
@Kodiologist sorry I missed this. Thanks for the issue. You can create custom data packages in Python
with specific reprocessing procedures. The Python
files in the script folder are custom data packages. Here is one example https://github.com/weecology/retriever-recipes/blob/master/scripts/fao_global_capture_product.py
You can also request for a data package to be added by creating an issue. Let me know in case you need help
@henrykironde Great, thank you. My group at Mount Sinai does a lot of work on models to predict air and pollution at high spatiotemporal resolutions for use in epidemiology (e.g., https://zenodo.org/record/3891475). My interest in Retriever is mostly in that it could be a convenient way to distribute some of our prediction products. I'll reach out if we decide to do that and have trouble.
By the way, https://www.data-retriever.org has a link "Adding datasets to the Data Retriever", to https://retriever.readthedocs.io/en/latest/scripts.html, but it's dead.
some basic tasks like checking datatype of values in a particular column and based on majority vote, get the apt datatype and check for values and replace them with Null.
For similar strings in a column and differing in case(upper or lower)..converting them to a single case so as to enable proper label encoding
same goes for length of data points in each cell and removing outlier values.
Boxplots can help in removing outlier parameters or changing distributions to achieve a more fit gaussian distribution.
@mauwazahmed is the data public? I could take a look at the data and see what I can do. I could help you to tailor a script that can do most of that.
@henrykironde just suggesting about some cleaning tasks that can be part of automation
Closing since this was more of a discussion topic. Feel free to open specific requests in their own issues.
The retriever-recipes repository seems to consist entirely of DataPackage JSON "scripts", and I'm sure that there are some things you might want to do to clean up a dataset that aren't supported in this format. Can you write a Python function to retrieve and clean up the data, for example?
Relatedly, is it possible to define parts of a dataset, so the user can choose which part to retrieve (e.g., a single year of a dataset spanning multiple years)?