Closed henrykironde closed 7 years ago
In addition to getting all datasets (not just the rodents) let's also update to use the GitHub repository:
https://github.com/weecology/PortalData
The main portal
script should use the latest release and we should add a separate portal-dev
(or something equivalent that uses master
. The access to master
will be useful for our Portal Forecasting project (e.g., https://github.com/weecology/portalPredictions/pull/35/).
@henrykironde @ethanwhite Do we need to create separate json file for ant,plant and precipitation dataset or we want to add it by modifying portal.json .
Let's modify the current json file to get it all at once.
@ethanwhite @henrykironde Dataset for plant is given category wise like annual, perennial, summer, winter, and also year wise. Data set for ants and precipitation is also given category wise.
I updated portal.json for Portal_plant_summer_annual_19831988 data.
Do we want to make separate table for different dataset(like portal_plant_winter_annual_19831988 ,etc.) or we want to add all data to one plant table only?
Also the species table for plant dataset is not available in .csv format. Although table can be seen on web-page of metadata of given plant data. what to do about this?
@kvnamipara we should actually be getting the Portal Data from https://github.com/weecology/PortalData (see the 1st comment on this issue). Sorry for the confusion. We want the data from the most recent release, so we should use links of the form:
https://raw.githubusercontent.com/weecology/PortalData/v1.0.1/Rodents/Portal_rodent.csv
If we're doing this using a JSON file.
@ethanwhite I got this. but portal data from https://github.com/weecology/PortalData include many other data(like ants, weather). Should I include other data to this portal dataset?
Which is more better, JSON or python script? why?
We want all of the csv files in the Ants, Plants, Rodents, Weather, and SiteandMethods directories. (I have an issue in asking why the main plant table is split in half: https://github.com/weecology/PortalData/issues/54).
In general JSON is better for cases where it will work. It is part of a standard and simpler to create and understand. The only reason to use a Python script here would be because we could then download all of the files at once using archive available on GitHub.
@ethanwhite @henrykironde
As mentioned in Portaldata issue, they are updating cover with null value in file that contain data before 2015. But they have not made changes till now. Should i create json file with old data or with updated file containing cover.
Dataset are dependent on the organisation/people who collected the data. And if they update the data, there are chances of changing of dataset file from which we are getting the file. If they changes the schema (like in this case they are going to add cover), will this not cause problem for us?
Dataset are dependent on the organisation/people who collected the data. And if they update the data, there are chances of changing of dataset file from which we are getting the file. If they changes the schema (like in this case they are going to add cover), will this not cause problem for us?
Yes, and that certainly happens, but one of the strengths of the Data Retriever's approach is that a single change to the Data Retriever script will get things working again for anyone who uses our system. We are developing a system to regularly check if the underlying data has changed in a problematic way so that we can quickly update the scripts.
As mentioned in Portaldata issue, they are updating cover with null value in file that contain data before 2015. But they have not made changes till now. Should i create json file with old data or with updated file containing cover.
The final result for the portal data will be there being a single file instead of two files with the cover column in both. One option would be to fork that repo, combine those two files, and submit a PR. We can probably get it merged in quickly and then you can just use that file. The other option is to build the script using the existing file that has the cover column. Then when the new file is created all we'll have to do is change the name of a single file.
@ethanwhite @henrykironde Should we consider unknown values as missing values? one column was particularly contained both unknown and null values.
Should we consider unknown values as missing values? one column was particularly contained both unknown and null values.
What file/column is this occuring in and how are unknown values being identified?
@ethanwhite
Portal_plant_census_dates.csv
in plant contain both 'unknown'
and 'none'
values. and i have also seen many other files containing the same.
The home page of the portal dataset has more data that need to be included.