Closed Aakash3101 closed 3 years ago
Service 'python_retriever' failed to build : The command '/bin/sh -c pip install psycopg2-binary -U' returned a non-zero code: 1
causes to fail the tests.
Changed the retriever ls -socrata
command to retriever ls -s
command.
Removed user prompt to name the script for the socrata dataset. The script name is of the format socrata-<socrata_id>
.
E.g. socrata-35s3-nmpm
.
Shifted the inquirer
module import in defaults.py
into a try-except block. To run the retriever ls -s
command, the user should install the inquirer
module. An error message would pop up if inquirer
is not installed while running the retriever ls -s
command. Added the KeyboardInterrupt
for Ctrl+C
event while selecting names.
Downloading and installing socrata datasets has changed from previous method to retriever download socrata-35s3-nmpm
and retriever install postgres socrata-35s3-nmpm
. Format for dataset name is socrata-<dataset-id>
.
install_socrata_dataset
changed to create_socrata_dataset
.
socrata_autocomplete_search()
The input argument should be a list of strings. The return type is also a list of strings which are the autocompleted names.
>>> import retriever as rt
>>> names = rt.socrata_autocomplete_search(['clinic', '2015', '2016'])
>>> for name in names:
... print(name)
...
2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More Service Providers
2015 - 2016 Clinical Quality Comparison (>=5 Providers) by Geography
2016 & 2015 Clinic Quality Comparisons for Clinics with Fewer than Five Service Providers
2. `socrata_dataset_info()`
The input argument should be a string (valid dataset name returned by `socrata_autocomplete_search`). The function returns a list of dicts, because there are multiple datasets on socrata with same name (e.g. Building Permits).
```python
>>> import retriever as rt
>>> resource = rt.socrata_dataset_info('2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More Service Providers')
>>> from pprint import pprint
>>> pprint(resource)
[{'description': 'This data set includes comparative information for clinics '
'with five or more physicians for medical claims in 2015 - '
'2016. \r\n'
'\r\n'
'This data set was calculated by the Utah Department of '
'Health, Office of Healthcare Statistics (OHCS) using Utah’s '
'All Payer Claims Database (APCD).',
'domain': 'opendata.utah.gov',
'id': '35s3-nmpm',
'link': 'https://opendata.utah.gov/Health/2016-2015-Clinic-Quality-Comparisons-for-Clinics-w/35s3-nmpm',
'name': '2016 & 2015 Clinic Quality Comparisons for Clinics with Five or '
'More Service Providers',
'type': {'dataset': 'tabular'}}]
find_socrata_dataset_by_id()
The input argument of the function should be the four-by-four socrata dataset identifier (e.g. 35s3-nmpm
). The function returns a dict which contains metadata about the dataset.
>>> import retriever as rt
>>> from pprint import pprint
>>> resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
>>> pprint(resource)
{'datatype': 'tabular',
'description': 'This data set includes comparative information for clinics '
'with five or more physicians for medical claims in 2015 - '
'2016. \r\n'
'\r\n'
'This data set was calculated by the Utah Department of '
'Health, Office of Healthcare Statistics (OHCS) using Utah’s '
'All Payer Claims Database (APCD).',
'domain': 'opendata.utah.gov',
'homepage': 'https://opendata.utah.gov/Health/2016-2015-Clinic-Quality-Comparisons-for-Clinics-w/35s3-nmpm',
'id': '35s3-nmpm',
'keywords': ['socrata'],
'name': '2016 & 2015 Clinic Quality Comparisons for Clinics with Five or More '
'Service Providers'}
update_socrata_contents(json_file, script_name, url, resource)
This function updates the contents of the socrata script created by create_socrata_dataset()
. The input arguments are:
find_socrata_dataset_by_id()
The function returns True, json_file
if the resource dict is correct, otherwise, it returns False, None
.
import retriever as rt
import json
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm') filename = resource["id"] + '.csv' url = 'https://' + resource["domain"] + '/resource/' + filename script_name = 'socrata-35s3-nmpm' script_path = SOCRATA_SCRIPT_WRITE_PATH script_filename = scriptname.replace("-","") + ".json" with open(f"{script_path}/{script_filename}", "r") as f: json_file = json.load(f) f.close() result, json_file = rt.update_socrata_contents(json_file, script_name, url, resource)
5. `update_socrata_script(script_name, filename, url, resource, script_path)`
This function renames the script, calls the `update_socrata_contents()`, and then writes the new content returned by `update_socrata_contents()`
```python
import retriever as rt
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH
script_path = SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
filename = resource["id"] + '.csv'
url = 'https://' + resource["domain"] + '/resource/' + filename
script_name = 'socrata-35s3-nmpm'
rt.update_socrata_script(script_name, filename, url, resource, script_path)
create_socrata_dataset(engine, name, resource, script_path=None)
This function creates socrata dataset scripts for retriever. This function downloads the raw data, creates the script, then updates it and at last, it installs the dataset using that script.import retriever as rt
from retriever.engines import choose_engine
from retriever.lib.defaults import SOCRATA_SCRIPT_WRITE_PATH
# engine = choose_engine({'command':'install', 'engine':'postgres'})
# OR
engine = choose_engine({'command': 'download'})
script_path = SOCRATA_SCRIPT_WRITE_PATH
resource = rt.find_socrata_dataset_by_id('35s3-nmpm')
name = 'socrata-35s3-nmpm'
rt.create_socrata_dataset(engine, name, resource, script_path)
Downloading a socrata dataset
import retriever as rt
rt.download('socrata-35s3-nmpm')
Installing a socrata dataset
import retriever as rt
rt.install_postgres('socrata-35s3-nmpm')
Looks like a style error 🙂, so says flakes.
we have many of those like, if args.dataset is not None:
>>> if args.dataset
Looks like a style error , so says flakes. we have many of those like,
if args.dataset is not None:
>>>if args.dataset
Sorry my bad @henrykironde
@henrykironde I just realised that the rt.download()
won't download raw data files for those socrata datasets whose script is not yet generated.
I have added the download functionality in the retriever/lib/download.py
file. And now the rt.download()
can work correctly.
Refactoring is complete, do let me know if you want to make any other changes @henrykironde
Why do we get these results ?
✗ retriever ls -socrata
Autocomplete suggestions : Total 0 results
(testenv) ➜ retriever git:(test) ✗
Why do we get these results ?
✗ retriever ls -socrata Autocomplete suggestions : Total 0 results (testenv) ➜ retriever git:(test) ✗
I found it out just now, that you are using retriever ls -socrata
. I have updated it to retriever ls -s
╰─ retriever ls -s
usage: retriever ls [-h] [-l L [L ...]] [-k K [K ...]] [-v V [V ...]]
[-s S [S ...]]
retriever ls: error: argument -s: expected at least one argument
╰─ retriever ls -socrata
Autocomplete suggestions : Total 0 results
Here is the update log comment link
@Aakash3101 lets us test this together. Whenever you ready ping me here and on gitter.
@Aakash3101 lets us test this together. Whenever you ready ping me here and on gitter.
Yeah let's test it now
Thank you the work done on this issue @Aakash3101
Added the Socrata API
Total number of datasets supported : 85,244 out of 213,965
Added
-socrata
flag toretriever ls
command. Useretriever ls -socrata <name>
to display a list of suggestions of the dataset name you are looking for. Then, select the dataset name, and the details of the selected dataset name will be displayed. Example :Added
--socrata
flag toretriever install
andretriever download
commands. Useretriever install --socrata <engine> <socrata-id>
to install the socrata dataset which is identified by thatsocrata-id
, first it downloads the raw data, creates a script for the data, updates the script accordingly, and then install the script into the engine. And useretriever download --socrata <engine> <socrata-id>
to download the raw data files of the socrata dataset identified by itssocrata-id
. Example :retriever download
retriever install
Added tests for all the Socrata functions.
Notes:
retriever ls -socrata
with an empty input will display theretriever ls -h -socrata
output. An input with a spelling error will return a list with 0 results.retriever download --socrata <id>
with an incorrectly formatted id would displayId <id> is incorrectly formatted and will not identify any asset.
retriever download --socrata <id>
, with anid
that identifies a unsupported dataset will displayDataset cannot be downloaded using Socrata API because <datatype> type datasets are not supported
retriever currently supports only
tabular
type datasets, which excludes{map: tabular}
datasets.Datasets that are
tabular
and do not have any column names in the metadata, will display the following for theretriever install/download
command:If a Socrata dataset is already downloaded, it does not download the data again or its script exists, then the script creation process is not repeated. Example: