Closed davharris closed 7 years ago
The other option is to use the eBird Reference Dataset, which includes landcover etc. and is more compact, but which is only available by special request and isn't updated as frequently. Apparently, they had problems with people selling the data or something.
Thanks, this is great!
We looked into adding eBird about a year ago and the conclusion at the time was to avoid adding datasets which require account creation to download. We could automate account creation and signing in pretty easily, but I'm sure some data owners won't appreciate that.
It's really too bad not to include eBird since it's such a high-value dataset. Now that the Retriever is being used by more than a handful of people, maybe it's time to revisit that conversation. I'm thinking we can just require users to download the raw data files themselves, and then automate everything else. @ethanwhite, do you have any thoughts?
Thanks Dave!
Ben - I like this general idea. An ideal solution might be that when someone starts to install a dataset that requires either account creation or filling out a form we have the Retriever notify them that they need to do so and then automatically open a browser at the appropriate point in the process. The user then fills out the appropriate form or creates an account and downloads the file(s) to the appropriate destination, at which point the Retriever takes over again. If this gets a bit complicated to implement then simply notifying the user that they need to download the file(s) to the appropriate directory and then install the dataset again should work just fine (and might be the easier thing to start with).
Makes sense to me. I've made some minor changes to the code above, BTW.
When I talked to the eBird DBA about the Reference dataset (which will probably be more useful to most people than the Basic dataset I used here), he said that they were looking into ways to make it more accessible again (i.e. only a login required rather than a special request). Importing the reference dataset would probably be much more work, though.
Making useful data that's a lot of work to import easy to deal with is pretty much the Retriever's goal in life, so if they make general access to it possible we'll definitely tackle the Reference dataset.
I will be using eBird frequently next year - along with some members of my new lab. Anything to make getting and using the data easier would be fantastic!
On Sat, Jun 29, 2013 at 4:58 PM, Ben Morris notifications@github.comwrote:
Thanks, this is great!
We looked into adding eBird about a year ago and the conclusion at the time was to avoid adding datasets which require account creation to download. We could automate account creation and signing in pretty easily, but I'm sure some data owners won't appreciate that.
It's really too bad not to include eBird since it's such a high-value dataset. Now that the Retriever is being used by more than a handful of people, maybe it's time to revisit that conversation. I'm thinking we can just require users to download the raw data files themselves, and then automate everything else. @ethanwhite https://github.com/ethanwhite, do you have any thoughts?
— Reply to this email directly or view it on GitHubhttps://github.com/weecology/retriever/issues/90#issuecomment-20237952 .
Wow this is really cool, thanks Dave!
If someone would like to try setting up a Retriever script for this it should be really easy. I'd be happy to help coach someone through the process.
BTW, @mbjones reminded me that I might have some other useful stuff, so I thought I'd post it here.
Here's a set of criteria we used to determine which records to include. Lots of arbitrary decisions, but possibly a useful starting point. It's loosely based on the criteria used in Fink et al.'s paper on their Spatio-Temporal Ecological Model (or whatever STEM stands for).
CREATE VIEW validrows AS
SELECT * FROM eBird
WHERE
COUNTRY_CODE='US' AND
ALL_SPECIES_REPORTED = 1 AND
effort_distance_km < 8.1 AND
duration_minutes < 180 AND
PROTOCOL_TYPE='eBird - Traveling Count' AND
EXTRACT('year' FROM observation_date) >= 2002 AND
EXTRACT('year' FROM observation_date) <= 2012 AND
EXTRACT('month' FROM observation_date) >= 5 AND
EXTRACT('month' FROM observation_date) <= 7 AND
EXTRACT('hour' FROM time_observations_started) >= 5 AND
EXTRACT('hour' FROM time_observations_started) < 20
;
Where did this end up? I'm looking for this solution with the eBird Reference Dataset and it seems like this would work. Any pointers to the easiest way to get this going would be appreciated.
Hi @tomauer - I just downloaded the eBird Reference Dataset to take a look. I downloaded it without a login from http://ebird.org/downloads/erd/, but in the past this hasn't been an openly available dataset. Do you know if this has changed or is this link just accidentally live? (I just emailed eBird to ask) If this is intentional it eliminates one of the things getting in the way of us supporting eBird, which was a need to build functionality (or tell folks how to hack around our current limitations) to work with data that wasn't available at a stable url.
The reference dataset appears to be structured quite differently from the one that @davharris was working with above, and is a great example of a dataset that the Retriever would be beneficial for working with (i.e., you need to combine a dozen files, from an archive, all of which are cross-tab structured for some reason). This is complex enough that it will need to be a Python script instead of one of the simple text scripts. I'd be happy to work on this after the holidays or even happier to help someone else work on it. This is a dataset that a lot of people, including my group, are interested in, so if it's really publicly available now it's definitely worth some effort.
Thanks @ethanwhite !
I believe that the new open access is a recent and intentional change, but I'd wait to hear back from eBird to be certain.
Having the text script way to set this up would be ideal, as our current alternative is to take our own time to stand it up in PostgreSQL, which presents substantial overhead (I can passingly handle using PostgreSQL), so I'd be happy to do what I can to make this available to the Retriever so that we can save some time on our end, especially as the quarterly updates to the eBird Reference Dataset are produced.
@ethanwhite The CLO is now operating as a DataONE member node, and they distribute the eBird Observation Dataset via that repository now. So it should have a stable download link. It differs from the eBird Reference Dataset that is discussed and linked above. I've pasted some relevant identifiers and URLs below. The nice thing about the DataONE API is that it keeps track if data sets move and/or get replaced by newer versions. You should always be able to call the DataONE resolution REST service to find the current location for a given versioned data object. For the eBird Observation Dataset data object, the resolution service call could be done with curl
using:
curl https://cn.dataone.org/cn/v1/resolve/CLOEODDATA.05192014.1
Using curl, that should return an XML document listing all the URLs from which the data object can be downloaded (currently just one, listed below). Here are some relevant identifiers and URLs for the eBird Observation Dataset (EOD):
You can call all of the DataONE service APIs (resolve(), getSystemMetadata(), get()) using these identifiers in REST calls.
Unfortunately, the eBird Observation Dataset (EOD) is pretty sparse in terms of detail, with most of the attributes stripped out of it to the point that's is basically just a presence only (with count) description of whether a species is at a given lat/lng.
My team is moving forward with the eBird Reference Dataset (ERD), so if you decide to pursue that avenue, I'm happy to help with column optimization, as I'm very familiar with that dataset and what the field values are.
Thanks @mbjones! I hadn't heard that CLO had joined up. That's great news. And thanks for the pointers on the DataONE APIs.
@tomauer - thanks for the information on the differences between the sets. Given this move towards making the data openly available I'm happy to work on incorporating both of these datasets.
Any information on optimal column types/sizes would definitely be useful given the sizes of the datasets. I'd also be curious to hear how you're envisioning working with the cross-tab aspect of the ERD. I'm not sure why they've structured it this way, and my initial plan would be to convert it to a standard format of one row per checklist per species (dropping all of the zeros).
I've been spending the past few days trying to pull the eBird Basic Dataset into Postgres with no luck. I've gotten the Sampling Event Data file in, but the World File always ends with a set of different errors. If anyone has had any luck getting the eBird Basic Dataset or the eBird Reference Dataset into a database, please let me know.
@ethanwhite I'd be happy to help work on this and contributing to the retriever with help from someone who can troubleshoot Postgres errors better than I can.
@tomauer what error are you getting? I vaguely remember @davharris having some troubles with character encoding issues in the middle of the very large file. Segmenting the file and parallel processing it with a tool like iconv might help if you have issues like that. And as @davharris said above there were unbalanced quotes as well. Are you seeing either of these errors in the ERD load?
@tomauer we're starting work on getting eBird into the Retriever (see #252, #253, #254). I'd be curious to hear why you switched to the eBird Basic Dataset from the ERD, to help us prioritize our efforts. Also, is it correct that the Basic dataset has to be requested rather than downloaded from a stable link?
@ethanwhite @davharris After a few more days of fighting with it, I was able to import the EBD World File and the EBD Sampling Data File into Postgres. Below are the SQL statements that I used to successfully do that. I had to first manually strip out two null characters in the dataset. I did that using EmEditor and the line number that Postgres gave me as an error when it hit the null characters. Some of my column types for new columns in this version might be a little weak or unnecessarily long.
I switched to the EBD because the structure of the ERD is absurd (every species as a column?!?!?!). We don't have a need for the co-variate data at this point, so I didn't want to deal with the cross-tab aspect of the ERD either. To access the Basic dataset you do need to request access through eBird and the URL is locked. You have to be logged into eBird.
eBird World File
CREATE TABLE ebdrelNov2014 (
GLOBAL_UNIQUE_IDENTIFIER char(50),
TAXONOMIC_ORDER numeric,
CATEGORY varchar(20),
COMMON_NAME varchar(70),
SCIENTIFIC_NAME varchar(70),
SUBSPECIES_COMMON_NAME varchar(70),
SUBSPECIES_SCIENTIFIC_NAME varchar(70),
OBSERVATION_COUNT varchar(8),
BREEDING_BIRD_ATLAS_CODE char(2),
AGE_SEX text,
COUNTRY varchar(50),
COUNTRY_CODE char(2),
STATE_PROVINCE varchar(50),
STATE_CODE char(10),
COUNTY varchar(50),
COUNTY_CODE char(12),
IBA_CODE char(16),
BCR_CODE varchar(50),
LOCALITY text,
LOCALITY_ID char(10),
LOCALITY_TYPE char(2),
LATITUDE double precision,
LONGITUDE double precision,
OBSERVATION_DATE date,
TIME_OBSERVATIONS_STARTED time,
TRIP_COMMENTS text,
SPECIES_COMMENTS text,
OBSERVER_ID char(12),
FIRST_NAME text,
LAST_NAME text,
SAMPLING_EVENT_IDENTIFIER char(12),
PROTOCOL_TYPE varchar(50),
PROJECT_CODE varchar(20),
DURATION_MINUTES int,
EFFORT_DISTANCE_KM real,
EFFORT_AREA_HA real,
NUMBER_OBSERVERS int,
ALL_SPECIES_REPORTED int,
GROUP_IDENTIFIER varchar(10),
APPROVED int,
REVIEWED int,
REASON char(17),
X text
);
COPY ebdrelNov2014
FROM E'D:\\eBird\\ebd_relNov-2014.txt'
HEADER
CSV
QUOTE E'\5'
DELIMITER E'\t'
;
eBird Sampling Data File
CREATE TABLE ebdsamplingrelnov2014 (
COUNTRY varchar(50),
COUNTRY_CODE varchar(50),
STATE varchar(50),
STATE_CODE char(10),
COUNTY varchar(50),
COUNTY_CODE char(12),
IBA_CODE char(16),
BCR_CODE varchar(50),
LOCALITY text,
LOCALITY_ID char(10),
LOCALITY_TYPE char(2),
LATITUDE double precision,
LONGITUDE double precision,
OBSERVATION_DATE date,
TIME_OBSERVATIONS_STARTED time,
TRIP_COMMENTS text,
OBSERVER_ID char(12),
FIRST_NAME text,
LAST_NAME text,
SAMPLING_EVENT_IDENTIFIER char(12),
PROTOCOL_TYPE varchar(50),
PROJECT_CODE varchar(20),
DURATION_MINUTES int,
EFFORT_DISTANCE_KM real,
EFFORT_AREA_HA real,
NUMBER_OBSERVERS int,
ALL_SPECIES_REPORTED int,
GROUP_IDENTIFIER varchar(10),
X text
);
COPY ebdsamplingrelnov2014
FROM E'D:\\eBird\\ebd_sampling_relNov-2014.txt'
HEADER
CSV
ENCODING 'UTF8'
QUOTE E'\b'
DELIMITER E'\t'
;
The other reason I didn't want to use the eBird Reference Dataset is that it only currently goes through 2012. We need as much data as we can get for our purposes and the eBird Basic Dataset has everything.
Thanks for the feedback and insight into the table structure and issues with the data @tomauer. I also just heard back from the eBird folks and apparently having the ERD data public was actually a mistake and they are planning on putting it back behind the same access wall as the Basic Dataset is behind. As such I think we'll focus on the Observation Dataset (since it's publicly available; I'm almost done with this one) and working on supporting the Basic Dataset for fuller access to the data since it's has the core data and lacks the structural issues of ERD.
eBird has basically pulled all of their data back behind closed doors, so I'm going to go ahead and close all of the eBird issues until such time as they decide to participate in the community in a more open way that we can readily help support.
Hi @ethanwhite, are you referring to the eBird Basic Dataset (EBD) being behind the access wall? I now work at eBird and know that we don't want our data to be perceived as closed. Have you been in contact with anyone here? Or are you just basing this on the way EBD is accessed?
@tomauer - I'd love to be wrong about this data being openly accessible. My perception is based on a combination of meaningful efforts to access the data over several years and some interactions with folks at eBird.
Originally we could find no way to access the data without creating a login, explaining why we wanted to use the data, getting permission, and then downloading a one time snapshot of the data. The eBird Observation Dataset then became available via DataONE, which was awesome and we started working on supporting it in the Data Retriever. We then found the eBird Reference Dataset openly available online and started working to support it, but out of an abundance of caution I emailed the eBird folks to make sure it was supposed to be public and they said no and removed the public access. Then later the DataONE copy of the eBird Observation Dataset disappeared without announcement and repeated queries to DataONE folks went unanswered.
So, that's the story. If I'm wrong and there are publicly accessible links to the eBird data that would be great and we'd be happy to support it again.
I just did a little looking around and came across this description of how to currently get the data. I then started the process for requesting the data again (I've done this in the past, but apparently that permission expires). In general I would say that while there's clearly some desire to make the data available to the community, which is great, that this is basically a closed dataset.
To download the data you need:
Both (2) and (3) make the dataset closed (i.e., not open) by any definition of "open data" that I'm aware of. I'm sure that there are good reasons for the decisions that eBird has made in this area, but one of the consequences is that since the data isn't open others can't readily build tools to make engaging with that data easier and more effective.
If you're interested in retaining the restrictions but allowing us and others to support the dataset then providing access to a permanently linked version of the data via an API key is an option. In that context we can instruct users to obtain a personal key and then store it where the software can access it. I've suggested this to a couple of other closed data sources that want people to use their data, but haven't had any takers yet. Once a valuable data source goes this route we'll put supporting it on the road map. My understanding is that the current eBird API only supports recent sightings. We'd be interested in supporting one of the full datasets.
@ethanwhite You make a lot of good points about the "openness" of the data. We're talking about it here and may end up revising some of terms in the future. Most of this is quite a ways above me. Thanks for your feedback.
The current API does only support recent sightings, but there is a new API that is in the works that should have more features.
Thanks @tomauer. I greatly appreciate your engaging with this. Let me know if there's anything I can do to help at any point.
@tomauer Just FYI, I started with your table definition and COPY approach and have been using it to load eBird EBD data into PostgreSQL successfully. I customized the COPY a little, added some indexes and such... Just posted a gist of it: https://gist.github.com/ses4j/22ccd5350eb28697d25c962d32879431
Hey all,
I've mostly gotten the eBird data into a PostgreSQL/PostGIS database, and I thought I'd share my code with you in case you wanted to integrate it into something more robust with EcoDataRetriever. If you know how to optimize it better, I'd love to hear what you come up with.
If you do decide to include it, please acknowledge Matt Jones and Jim Regetz, since they helped me through this.
Let me know if you have any questions!
Dave
PS the "world" data set unzips to be 50 gigabytes, so you'll probably want to work with something smaller...
(Edited to add some amazing PostGIS queries and some better commets, etc.)
PS: After poking around a bit more, it looks like I should have used doubles rather than reals to store lat/lon. I had misread the documentation about how much precision was used for reals.