weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 132 forks source link

Add support for Kaggle Datasets #1407

Closed DumbMachine closed 3 years ago

DumbMachine commented 4 years ago

Kaggle is home to numerous datasets and pre-trained models. Adding support to easily use the kaggle API would be beneficial to download and upload dataset. To interact with Kaggle API, user requires two things:

Once the access of above is ensured, the following commands can be used

kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}

We only have to take care of the competitions & datasets commands for inclusion into retriever, as those commands are used to download Kaggle Competition Datasets and other Datasets uploaded by kaggle users.

To download the dataset, we would have to call the kaggle cli. This could be done with ease:

from subprocess import Popen, PIPE

process = Popen(['kaggle',  'datasets', 'download', '-d',  'alessiocorrado99/animals10'], stdout=PIPE, stderr=PIPE)
stdout, stderr = process.communicate()

We won't be able to provide support for all database systems as many files on kaggle can be images/videos/other formats (like pretrained vecters). But this shouldn't be a problem concerning the initial release.

TLDR; we check for Kaggle API credentials then execute appropriate kaggle cli commands.

DumbMachine commented 4 years ago

Talking about this issue reminds me of a feature I had thought of last GSOC season. Dataset's sizes have been on the rise, usage of cloud services are common for training/inference/analysis. What do you think about a feature, which would help users choose their favored cloud provider and we provide a means to easily fetch/update and upload datasets? This might enable people to use customs datasets and upload/commit changes to dataset and their model @ethanwhite @henrykironde

henrykironde commented 4 years ago

I may be wrong but I think we do have what you mentioned in the Python interface but not in the Command line. Users can create, fetch datasets from their own repositories using get_script_upstream but we have not tested this with any other repo other than the retriever-recipes repo. You could add an issue and we improve this.

ethanwhite commented 4 years ago

Thanks for opening this issue @DumbMachine. I definitely think getting some Kaggle datasets into the retriever would be useful and I think setting up API key support is a generally useful way for us to start incorporating datasets requiring some sort of authorization. A couple of quick thoughts related to your points above:

To download the dataset, we would have to call the kaggle cli.

The good news is that this is a Python package, so packaging should be OK, but if possible I think it would be good to call the package directly using it's Python interface (we've found calling things from the CLI to be fragile to maintain across different systems).

We won't be able to provide support for all database systems as many files on kaggle can be images/videos/other formats (like pretrained vecters).

For now I think we would want to hand pick datasets to include with a focus on tabular or spatial data. While we do have "download only" functionality for things like images and videos our real strength is in processing the other two data types at the moment.

DumbMachine commented 4 years ago

After some searching around and reading the kaggle api code, I found that we can download files, without having to use the cli. The code for it is as follows:

from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi() # This is authenticated automatically if the access_token is present in the ~/.kaggle folder otherwise, we can provide the location to that file as a parameters.
api.dataset_download_files("tunguz/movietweetings") # This will download the dataset at: https://www.kaggle.com/tunguz/movietweetings

It is pretty straight forward, we call the dataset_download_files or competetion_download_files according to the origin of the particular dataset. If this seems fine, I'll go ahead and open a PR for it.

ethanwhite commented 4 years ago

Sounds good to me @DumbMachine. Ideally I think this will end up being two PRs. The first into this repo implementing the Kaggle API related code and any associated checks of the dataset recipe (i.e., to determine that the dataset is a Kaggle dataset). Ideally this PR would also include a test for this functionality, but to do this we'll need to figure out how to pass an api key securely on the continuous integration systems (we do this already, but would need to figure out the details for a key that needs to be stored in a file). The second PR would be to the https://github.com/weecology/retriever-recipes/ repo with a script for a tabular dataset from Kaggle that we can test the code against.

DumbMachine commented 4 years ago

@ethanwhite, I have thought of the following flow, firstly retriever will search if the dataset is present in the current scripts if it is, then it does the normal. If not then it calls a Kaggle API function to check if kaggle has the dataset. If yes then that is set as the script name and returned, if not then it displays the error message. I'm thinking something along the lines of:

# reference https://github.com/weecology/retriever/blob/647441104d87979738104fe0dab9048f0c85c5be/retriever/lib/scripts.py#L170

# in the imports section
from kaggle.api.kaggle_api_extended import KaggleApi
...
... 

    if read_script:
        return [read_script]

# Since the dataset was not found in scripts, it might be a kaggle dataset
    api = KaggleApi()
    api.authenticate()
    kgl_search_results = api.dataset_list(search=arg)
    if arg in kgl_search_results:
        return [arg]

Make a seperate engine for Kaggle, with most functions as dummy since, we only need to make a single function call to download a dataset

api.dataset_download_files(<dataset_name>, <path>)

If this works, then I'll start working on the PR. Also, can you please tell me the datatype of data_sets_scripts, ref

ethanwhite commented 4 years ago

Thanks @DumbMachine. I'm envisioning something a little different I think. The core benefit of the retriever is that it doesn't just download files, it does things with them (loads them into Python, databases, etc.). To support this we need information about each dataset, which is stored in the JSON recipes. So, I'm envisioning the following:

  1. Add a JSON field to indicate that the dataset is on Kaggle (we can expand this to other data sources with API keys later)
  2. Check to see if that field is present and indicates that the dataset is a Kaggle dataset
  3. If so, check to see if an API key is present in ~/.kaggle and if not send an error message to the user that they need an API key for the dataset with a link to a description of how to get one.
  4. If the key is present, then use KaggleAPI.dataset_download_files to download the files instead of our standard download code.

So, the design idea is that instead of just providing a wrapper for a function in the Kaggle api package, we want to use that function to download files that then get used in our pipeline. Does that make sense?

DumbMachine commented 4 years ago

Yes this does make sense. Should this implementation be done using a new KaggleEngine ?

ethanwhite commented 4 years ago

The engines are different output formats, not different input formats. I think the key logic of (2-4) will go in download_file and download_files_from_archive which are both in engine:

To obtain the new field from the JSON recipe (1) it's possible that a change to load_json.py might be necessary, but it may also just show up by default (that would take a little closer reading):

@henrykironde - does this all sound right to you?

henrykironde commented 4 years ago

Yes that is right. We shall add a key:value in the script ie. "kaggle": "true" or "apitype": "kaggle". In the download function, we shall check if this key:value is set to indicate that it is a kaggle datapackage being downloaded. If It is a kaggle dataset, use the KaggleAPI to download the data.