weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 132 forks source link

Added Socrata API to retriever, Added tests for the socrata functions #1600

Closed Aakash3101 closed 3 years ago

Aakash3101 commented 3 years ago

Added the Socrata API

Total number of datasets supported : 85,244 out of 213,965

Added tests for all the Socrata functions.

Notes:

Aakash3101 commented 3 years ago

Service 'python_retriever' failed to build : The command '/bin/sh -c pip install psycopg2-binary -U' returned a non-zero code: 1 causes to fail the tests.

Aakash3101 commented 3 years ago

Updates

Interface examples

  1. socrata_autocomplete_search() The input argument should be a list of strings. The return type is also a list of strings which are the autocompleted names.
    
    >>> import retriever as rt
    >>> names = rt.socrata_autocomplete_search(['clinic', '2015', '2016'])
    >>> for name in names:
    ...     print(name)
    ... 
    2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More Service Providers
    2015 - 2016 Clinical Quality Comparison (>=5 Providers) by Geography
    2016 & 2015 Clinic Quality Comparisons for Clinics with Fewer than Five Service Providers
2. `socrata_dataset_info()`
The input argument should be a string (valid dataset name returned by `socrata_autocomplete_search`). The function returns a list of dicts, because there are multiple datasets on socrata with same name (e.g. Building Permits).
```python
>>> import retriever as rt
>>> resource = rt.socrata_dataset_info('2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More Service Providers')
>>> from pprint import pprint
>>> pprint(resource)
[{'description': 'This data set includes comparative information for clinics '
                 'with five or more physicians for medical claims in 2015 - '
                 '2016. \r\n'
                 '\r\n'
                 'This data set was calculated by the Utah Department of '
                 'Health, Office of Healthcare Statistics (OHCS) using Utah’s '
                 'All Payer Claims Database (APCD).',
  'domain': 'opendata.utah.gov',
  'id': '35s3-nmpm',
  'link': 'https://opendata.utah.gov/Health/2016-2015-Clinic-Quality-Comparisons-for-Clinics-w/35s3-nmpm',
  'name': '2016 & 2015 Clinic Quality Comparisons for Clinics with Five or '
          'More Service Providers',
  'type': {'dataset': 'tabular'}}]
  1. find_socrata_dataset_by_id() The input argument of the function should be the four-by-four socrata dataset identifier (e.g. 35s3-nmpm). The function returns a dict which contains metadata about the dataset.
    >>> import retriever as rt
    >>> from pprint import pprint
    >>> resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
    >>> pprint(resource)
    {'datatype': 'tabular',
    'description': 'This data set includes comparative information for clinics '
                'with five or more physicians for medical claims in 2015 - '
                '2016. \r\n'
                '\r\n'
                'This data set was calculated by the Utah Department of '
                'Health, Office of Healthcare Statistics (OHCS) using Utah’s '
                'All Payer Claims Database (APCD).',
    'domain': 'opendata.utah.gov',
    'homepage': 'https://opendata.utah.gov/Health/2016-2015-Clinic-Quality-Comparisons-for-Clinics-w/35s3-nmpm',
    'id': '35s3-nmpm',
    'keywords': ['socrata'],
    'name': '2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More '
         'Service Providers'}
  2. update_socrata_contents(json_file, script_name, url, resource) This function updates the contents of the socrata script created by create_socrata_dataset(). The input arguments are:
    • json_file = The content of the script created
    • script_name = The name of the script
    • url = The url through which the dataset is downloaded
    • resource = The object returned by the find_socrata_dataset_by_id() The function returns True, json_file if the resource dict is correct, otherwise, it returns False, None.
      
      import retriever as rt
      import json
      from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH

resource = rt.find_socrata_dataset_by_id('35s3-nmpm') filename = resource["id"] + '.csv' url = 'https://' + resource["domain"] + '/resource/' + filename script_name = 'socrata-35s3-nmpm' script_path = SOCRATA_SCRIPT_WRITE_PATH script_filename = scriptname.replace("-","") + ".json" with open(f"{script_path}/{script_filename}", "r") as f: json_file = json.load(f) f.close() result, json_file = rt.update_socrata_contents(json_file, script_name, url, resource)

5. `update_socrata_script(script_name, filename, url, resource, script_path)`
This function renames the script, calls the `update_socrata_contents()`, and then writes the new content returned by `update_socrata_contents()`
```python
import retriever as rt
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH

script_path = SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
filename = resource["id"] + '.csv'
url = 'https://' + resource["domain"] + '/resource/' + filename
script_name = 'socrata-35s3-nmpm'
rt.update_socrata_script(script_name, filename, url, resource, script_path)
  1. create_socrata_dataset(engine, name, resource, script_path=None) This function creates socrata dataset scripts for retriever. This function downloads the raw data, creates the script, then updates it and at last, it installs the dataset using that script.
import retriever as rt
from retriever.engines import choose_engine
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH

# engine = choose_engine({'command':'install', 'engine':'postgres'})
# OR
engine = choose_engine({'command': 'download'})
script_path = SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
name = 'socrata-35s3-nmpm'
rt.create_socrata_dataset(engine, name, resource, script_path)
  1. Downloading a socrata dataset

    import retriever as rt
    rt.download('socrata-35s3-nmpm')
  2. Installing a socrata dataset

    import retriever as rt
    rt.install_postgres('socrata-35s3-nmpm')
henrykironde commented 3 years ago

Looks like a style error 🙂, so says flakes. we have many of those like, if args.dataset is not None: >>> if args.dataset

Aakash3101 commented 3 years ago

Looks like a style error , so says flakes. we have many of those like, if args.dataset is not None: >>> if args.dataset

Sorry my bad @henrykironde

Aakash3101 commented 3 years ago

@henrykironde I just realised that the rt.download() won't download raw data files for those socrata datasets whose script is not yet generated.


Update

I have added the download functionality in the retriever/lib/download.py file. And now the rt.download() can work correctly.

Aakash3101 commented 3 years ago

Refactoring is complete, do let me know if you want to make any other changes @henrykironde

henrykironde commented 3 years ago

Why do we get these results ?

 ✗ retriever ls -socrata     
Autocomplete suggestions : Total 0 results

(testenv) ➜  retriever git:(test) ✗ 
Aakash3101 commented 3 years ago

Why do we get these results ?

 ✗ retriever ls -socrata     
Autocomplete suggestions : Total 0 results

(testenv) ➜  retriever git:(test) ✗ 

I found it out just now, that you are using retriever ls -socrata. I have updated it to retriever ls -s


╰─ retriever ls -s 
usage: retriever ls [-h] [-l L [L ...]] [-k K [K ...]] [-v V [V ...]]
                    [-s S [S ...]]
retriever ls: error: argument -s: expected at least one argument

╰─ retriever ls -socrata 
Autocomplete suggestions : Total 0 results

Here is the update log comment link

henrykironde commented 3 years ago

@Aakash3101 lets us test this together. Whenever you ready ping me here and on gitter.

Aakash3101 commented 3 years ago

@Aakash3101 lets us test this together. Whenever you ready ping me here and on gitter.

Yeah let's test it now

henrykironde commented 3 years ago

Thank you the work done on this issue @Aakash3101