weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 132 forks source link

Update Portal script to get all datasets from the GitHub repository #786

Closed henrykironde closed 7 years ago

henrykironde commented 7 years ago

The home page of the portal dataset has more data that need to be included.

ethanwhite commented 7 years ago

In addition to getting all datasets (not just the rodents) let's also update to use the GitHub repository:

https://github.com/weecology/PortalData

The main portal script should use the latest release and we should add a separate portal-dev (or something equivalent that uses master. The access to master will be useful for our Portal Forecasting project (e.g., https://github.com/weecology/portalPredictions/pull/35/).

kvnamipara commented 7 years ago

@henrykironde @ethanwhite Do we need to create separate json file for ant,plant and precipitation dataset or we want to add it by modifying portal.json .

ethanwhite commented 7 years ago

Let's modify the current json file to get it all at once.

kvnamipara commented 7 years ago

@ethanwhite @henrykironde Dataset for plant is given category wise like annual, perennial, summer, winter, and also year wise. Data set for ants and precipitation is also given category wise.

I updated portal.json for Portal_plant_summer_annual_19831988 data.

Do we want to make separate table for different dataset(like portal_plant_winter_annual_19831988 ,etc.) or we want to add all data to one plant table only?

Also the species table for plant dataset is not available in .csv format. Although table can be seen on web-page of metadata of given plant data. what to do about this?

ethanwhite commented 7 years ago

@kvnamipara we should actually be getting the Portal Data from https://github.com/weecology/PortalData (see the 1st comment on this issue). Sorry for the confusion. We want the data from the most recent release, so we should use links of the form:

https://raw.githubusercontent.com/weecology/PortalData/v1.0.1/Rodents/Portal_rodent.csv

If we're doing this using a JSON file.

kvnamipara commented 7 years ago

@ethanwhite I got this. but portal data from https://github.com/weecology/PortalData include many other data(like ants, weather). Should I include other data to this portal dataset?

Which is more better, JSON or python script? why?

ethanwhite commented 7 years ago

We want all of the csv files in the Ants, Plants, Rodents, Weather, and SiteandMethods directories. (I have an issue in asking why the main plant table is split in half: https://github.com/weecology/PortalData/issues/54).

In general JSON is better for cases where it will work. It is part of a standard and simpler to create and understand. The only reason to use a Python script here would be because we could then download all of the files at once using archive available on GitHub.

kvnamipara commented 7 years ago

@ethanwhite @henrykironde
As mentioned in Portaldata issue, they are updating cover with null value in file that contain data before 2015. But they have not made changes till now. Should i create json file with old data or with updated file containing cover.

Dataset are dependent on the organisation/people who collected the data. And if they update the data, there are chances of changing of dataset file from which we are getting the file. If they changes the schema (like in this case they are going to add cover), will this not cause problem for us?

ethanwhite commented 7 years ago

Dataset are dependent on the organisation/people who collected the data. And if they update the data, there are chances of changing of dataset file from which we are getting the file. If they changes the schema (like in this case they are going to add cover), will this not cause problem for us?

Yes, and that certainly happens, but one of the strengths of the Data Retriever's approach is that a single change to the Data Retriever script will get things working again for anyone who uses our system. We are developing a system to regularly check if the underlying data has changed in a problematic way so that we can quickly update the scripts.

As mentioned in Portaldata issue, they are updating cover with null value in file that contain data before 2015. But they have not made changes till now. Should i create json file with old data or with updated file containing cover.

The final result for the portal data will be there being a single file instead of two files with the cover column in both. One option would be to fork that repo, combine those two files, and submit a PR. We can probably get it merged in quickly and then you can just use that file. The other option is to build the script using the existing file that has the cover column. Then when the new file is created all we'll have to do is change the name of a single file.

kvnamipara commented 7 years ago

@ethanwhite @henrykironde Should we consider unknown values as missing values? one column was particularly contained both unknown and null values.

ethanwhite commented 7 years ago

Should we consider unknown values as missing values? one column was particularly contained both unknown and null values.

What file/column is this occuring in and how are unknown values being identified?

kvnamipara commented 7 years ago

@ethanwhite Portal_plant_census_dates.csvin plant contain both 'unknown' and 'none' values. and i have also seen many other files containing the same.