Added Socrata API to retriever, Added tests for the socrata functions

Aakash3101 commented 3 years ago

Added the Socrata API

Total number of datasets supported : 85,244 out of 213,965

Added -socrata flag to retriever ls command. Use retriever ls -socrata <name> to display a list of suggestions of the dataset name you are looking for. Then, select the dataset name, and the details of the selected dataset name will be displayed. Example :
Added --socrata flag to retriever install and retriever download commands. Use retriever install --socrata <engine> <socrata-id> to install the socrata dataset which is identified by that socrata-id, first it downloads the raw data, creates a script for the data, updates the script accordingly, and then install the script into the engine. And use retriever download --socrata <engine> <socrata-id> to download the raw data files of the socrata dataset identified by its socrata-id. Example :
retriever download
retriever install

Added tests for all the Socrata functions.

Notes:

retriever ls -socrata with an empty input will display the retriever ls -h -socrata output. An input with a spelling error will return a list with 0 results.
retriever download --socrata <id> with an incorrectly formatted id would display Id <id> is incorrectly formatted and will not identify any asset.
retriever download --socrata <id>, with an id that identifies a unsupported dataset will display Dataset cannot be downloaded using Socrata API because <datatype> type datasets are not supported
retriever currently supports only tabular type datasets, which excludes {map: tabular} datasets.
Datasets that are tabular and do not have any column names in the metadata, will display the following for the retriever install/download command:
If a Socrata dataset is already downloaded, it does not download the data again or its script exists, then the script creation process is not repeated. Example:

Aakash3101 commented 3 years ago

Service 'python_retriever' failed to build : The command '/bin/sh -c pip install psycopg2-binary -U' returned a non-zero code: 1 causes to fail the tests.

Aakash3101 commented 3 years ago

Updates

Changed the retriever ls -socrata command to retriever ls -s command.
Removed user prompt to name the script for the socrata dataset. The script name is of the format socrata-<socrata_id>. E.g. socrata-35s3-nmpm.
Shifted the inquirer module import in defaults.py into a try-except block. To run the retriever ls -s command, the user should install the inquirer module. An error message would pop up if inquirer is not installed while running the retriever ls -s command. Added the KeyboardInterrupt for Ctrl+C event while selecting names.
Downloading and installing socrata datasets has changed from previous method to retriever download socrata-35s3-nmpm and retriever install postgres socrata-35s3-nmpm. Format for dataset name is socrata-<dataset-id>.
install_socrata_dataset changed to create_socrata_dataset.

Interface examples

socrata_autocomplete_search() The input argument should be a list of strings. The return type is also a list of strings which are the autocompleted names.


>>> import retriever as rt
>>> names = rt.socrata_autocomplete_search(['clinic', '2015', '2016'])
>>> for name in names:
...     print(name)
... 
2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More Service Providers
2015 - 2016 Clinical Quality Comparison (>=5 Providers) by Geography
2016 & 2015 Clinic Quality Comparisons for Clinics with Fewer than Five Service Providers

2. `socrata_dataset_info()`
The input argument should be a string (valid dataset name returned by `socrata_autocomplete_search`). The function returns a list of dicts, because there are multiple datasets on socrata with same name (e.g. Building Permits).
```python
>>> import retriever as rt
>>> resource = rt.socrata_dataset_info('2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More Service Providers')
>>> from pprint import pprint
>>> pprint(resource)
[{'description': 'This data set includes comparative information for clinics '
                 'with five or more physicians for medical claims in 2015 - '
                 '2016. \r\n'
                 '\r\n'
                 'This data set was calculated by the Utah Department of '
                 'Health, Office of Healthcare Statistics (OHCS) using Utah’s '
                 'All Payer Claims Database (APCD).',
  'domain': 'opendata.utah.gov',
  'id': '35s3-nmpm',
  'link': 'https://opendata.utah.gov/Health/2016-2015-Clinic-Quality-Comparisons-for-Clinics-w/35s3-nmpm',
  'name': '2016 & 2015 Clinic Quality Comparisons for Clinics with Five or '
          'More Service Providers',
  'type': {'dataset': 'tabular'}}]

find_socrata_dataset_by_id() The input argument of the function should be the four-by-four socrata dataset identifier (e.g. 35s3-nmpm). The function returns a dict which contains metadata about the dataset.

>>> import retriever as rt
>>> from pprint import pprint
>>> resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
>>> pprint(resource)
{'datatype': 'tabular',
'description': 'This data set includes comparative information for clinics '
            'with five or more physicians for medical claims in 2015 - '
            '2016. \r\n'
            '\r\n'
            'This data set was calculated by the Utah Department of '
            'Health, Office of Healthcare Statistics (OHCS) using Utah’s '
            'All Payer Claims Database (APCD).',
'domain': 'opendata.utah.gov',
'homepage': 'https://opendata.utah.gov/Health/2016-2015-Clinic-Quality-Comparisons-for-Clinics-w/35s3-nmpm',
'id': '35s3-nmpm',
'keywords': ['socrata'],
'name': '2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More '
     'Service Providers'}

update_socrata_contents(json_file, script_name, url, resource) This function updates the contents of the socrata script created by create_socrata_dataset(). The input arguments are:
- json_file = The content of the script created
- script_name = The name of the script
- url = The url through which the dataset is downloaded
- resource = The object returned by the find_socrata_dataset_by_id() The function returns True, json_file if the resource dict is correct, otherwise, it returns False, None.
```
import retriever as rt
import json
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH
```

resource = rt.find_socrata_dataset_by_id('35s3-nmpm') filename = resource["id"] + '.csv' url = 'https://' + resource["domain"] + '/resource/' + filename script_name = 'socrata-35s3-nmpm' script_path = SOCRATA_SCRIPT_WRITE_PATH script_filename = scriptname.replace("-","") + ".json" with open(f"{script_path}/{script_filename}", "r") as f: json_file = json.load(f) f.close() result, json_file = rt.update_socrata_contents(json_file, script_name, url, resource)

5. `update_socrata_script(script_name, filename, url, resource, script_path)`
This function renames the script, calls the `update_socrata_contents()`, and then writes the new content returned by `update_socrata_contents()`
```python
import retriever as rt
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH

script_path = SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
filename = resource["id"] + '.csv'
url = 'https://' + resource["domain"] + '/resource/' + filename
script_name = 'socrata-35s3-nmpm'
rt.update_socrata_script(script_name, filename, url, resource, script_path)

create_socrata_dataset(engine, name, resource, script_path=None) This function creates socrata dataset scripts for retriever. This function downloads the raw data, creates the script, then updates it and at last, it installs the dataset using that script.

import retriever as rt
from retriever.engines import choose_engine
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH

# engine = choose_engine({'command':'install', 'engine':'postgres'})
# OR
engine = choose_engine({'command': 'download'})
script_path = SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
name = 'socrata-35s3-nmpm'
rt.create_socrata_dataset(engine, name, resource, script_path)

Downloading a socrata dataset

import retriever as rt
rt.download('socrata-35s3-nmpm')

Installing a socrata dataset

import retriever as rt
rt.install_postgres('socrata-35s3-nmpm')

henrykironde commented 3 years ago

Looks like a style error 🙂, so says flakes. we have many of those like, if args.dataset is not None: >>> if args.dataset

Aakash3101 commented 3 years ago

Looks like a style error , so says flakes. we have many of those like, if args.dataset is not None: >>> if args.dataset

Sorry my bad @henrykironde

Aakash3101 commented 3 years ago

@henrykironde I just realised that the rt.download() won't download raw data files for those socrata datasets whose script is not yet generated.

Update

I have added the download functionality in the retriever/lib/download.py file. And now the rt.download() can work correctly.

Aakash3101 commented 3 years ago

Refactoring is complete, do let me know if you want to make any other changes @henrykironde

henrykironde commented 3 years ago

Why do we get these results ?

 ✗ retriever ls -socrata     
Autocomplete suggestions : Total 0 results

(testenv) ➜  retriever git:(test) ✗

Aakash3101 commented 3 years ago

Why do we get these results ?

 ✗ retriever ls -socrata     
Autocomplete suggestions : Total 0 results

(testenv) ➜  retriever git:(test) ✗

I found it out just now, that you are using retriever ls -socrata. I have updated it to retriever ls -s


╰─ retriever ls -s 
usage: retriever ls [-h] [-l L [L ...]] [-k K [K ...]] [-v V [V ...]]
                    [-s S [S ...]]
retriever ls: error: argument -s: expected at least one argument

╰─ retriever ls -socrata 
Autocomplete suggestions : Total 0 results

Here is the update log comment link

henrykironde commented 3 years ago

@Aakash3101 lets us test this together. Whenever you ready ping me here and on gitter.

Aakash3101 commented 3 years ago

@Aakash3101 lets us test this together. Whenever you ready ping me here and on gitter.

Yeah let's test it now

henrykironde commented 3 years ago

Thank you the work done on this issue @Aakash3101

weecology / retriever