Closed DumbMachine closed 3 years ago
Talking about this issue reminds me of a feature I had thought of last GSOC season. Dataset's sizes have been on the rise, usage of cloud services are common for training/inference/analysis. What do you think about a feature, which would help users choose their favored cloud provider and we provide a means to easily fetch/update and upload datasets? This might enable people to use customs datasets and upload/commit changes to dataset and their model @ethanwhite @henrykironde
I may be wrong but I think we do have what you mentioned in the Python
interface but not in the Command line
. Users can create, fetch datasets from their own repositories using get_script_upstream
but we have not tested this with any other repo other than the retriever-recipes repo
. You could add an issue and we improve this.
Thanks for opening this issue @DumbMachine. I definitely think getting some Kaggle datasets into the retriever would be useful and I think setting up API key support is a generally useful way for us to start incorporating datasets requiring some sort of authorization. A couple of quick thoughts related to your points above:
To download the dataset, we would have to call the kaggle cli.
The good news is that this is a Python package, so packaging should be OK, but if possible I think it would be good to call the package directly using it's Python interface (we've found calling things from the CLI to be fragile to maintain across different systems).
We won't be able to provide support for all database systems as many files on kaggle can be images/videos/other formats (like pretrained vecters).
For now I think we would want to hand pick datasets to include with a focus on tabular or spatial data. While we do have "download only" functionality for things like images and videos our real strength is in processing the other two data types at the moment.
After some searching around and reading the kaggle api code, I found that we can download files, without having to use the cli. The code for it is as follows:
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi() # This is authenticated automatically if the access_token is present in the ~/.kaggle folder otherwise, we can provide the location to that file as a parameters.
api.dataset_download_files("tunguz/movietweetings") # This will download the dataset at: https://www.kaggle.com/tunguz/movietweetings
It is pretty straight forward, we call the dataset_download_files or competetion_download_files according to the origin of the particular dataset. If this seems fine, I'll go ahead and open a PR for it.
Sounds good to me @DumbMachine. Ideally I think this will end up being two PRs. The first into this repo implementing the Kaggle API related code and any associated checks of the dataset recipe (i.e., to determine that the dataset is a Kaggle dataset). Ideally this PR would also include a test for this functionality, but to do this we'll need to figure out how to pass an api key securely on the continuous integration systems (we do this already, but would need to figure out the details for a key that needs to be stored in a file). The second PR would be to the https://github.com/weecology/retriever-recipes/ repo with a script for a tabular dataset from Kaggle that we can test the code against.
@ethanwhite, I have thought of the following flow, firstly retriever will search if the dataset is present in the current scripts if it is, then it does the normal. If not then it calls a Kaggle API function to check if kaggle has the dataset. If yes then that is set as the script name and returned, if not then it displays the error message. I'm thinking something along the lines of:
# reference https://github.com/weecology/retriever/blob/647441104d87979738104fe0dab9048f0c85c5be/retriever/lib/scripts.py#L170
# in the imports section
from kaggle.api.kaggle_api_extended import KaggleApi
...
...
if read_script:
return [read_script]
# Since the dataset was not found in scripts, it might be a kaggle dataset
api = KaggleApi()
api.authenticate()
kgl_search_results = api.dataset_list(search=arg)
if arg in kgl_search_results:
return [arg]
Make a seperate engine for Kaggle, with most functions as dummy since, we only need to make a single function call to download a dataset
api.dataset_download_files(<dataset_name>, <path>)
If this works, then I'll start working on the PR. Also, can you please tell me the datatype of data_sets_scripts, ref
Thanks @DumbMachine. I'm envisioning something a little different I think. The core benefit of the retriever is that it doesn't just download files, it does things with them (loads them into Python, databases, etc.). To support this we need information about each dataset, which is stored in the JSON recipes. So, I'm envisioning the following:
~/.kaggle
and if not send an error message to the user that they need an API key for the dataset with a link to a description of how to get one.KaggleAPI.dataset_download_files
to download the files instead of our standard download code.So, the design idea is that instead of just providing a wrapper for a function in the Kaggle api package, we want to use that function to download files that then get used in our pipeline. Does that make sense?
Yes this does make sense. Should this implementation be done using a new KaggleEngine
?
The engines are different output formats, not different input formats. I think the key logic of (2-4) will go in download_file
and download_files_from_archive
which are both in engine
:
To obtain the new field from the JSON recipe (1) it's possible that a change to load_json.py
might be necessary, but it may also just show up by default (that would take a little closer reading):
@henrykironde - does this all sound right to you?
Yes that is right. We shall add a key:value in the script ie. "kaggle": "true"
or "apitype": "kaggle"
.
In the download function, we shall check if this key:value is set to indicate that it is a kaggle datapackage being downloaded.
If It is a kaggle
dataset, use the KaggleAPI
to download the data.
Kaggle is home to numerous datasets and pre-trained models. Adding support to easily use the kaggle API would be beneficial to download and upload dataset. To interact with Kaggle API, user requires two things:
export KAGGLE_USERNAME=datadinosaur export KAGGLE_KEY=xxxxxxxxxxxxxx
.Once the access of above is ensured, the following commands can be used
We only have to take care of the
competitions & datasets
commands for inclusion into retriever, as those commands are used to download Kaggle Competition Datasets and other Datasets uploaded by kaggle users.To download the dataset, we would have to call the kaggle cli. This could be done with ease:
We won't be able to provide support for all database systems as many files on kaggle can be images/videos/other formats (like pretrained vecters). But this shouldn't be a problem concerning the initial release.
TLDR; we check for Kaggle API credentials then execute appropriate kaggle cli commands.