weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 132 forks source link

an eBird Basic Dataset workflow #90

Closed davharris closed 7 years ago

davharris commented 11 years ago

Hey all,

I've mostly gotten the eBird data into a PostgreSQL/PostGIS database, and I thought I'd share my code with you in case you wanted to integrate it into something more robust with EcoDataRetriever. If you know how to optimize it better, I'd love to hear what you come up with.

If you do decide to include it, please acknowledge Matt Jones and Jim Regetz, since they helped me through this.

Let me know if you have any questions!

Dave

PS the "world" data set unzips to be 50 gigabytes, so you'll probably want to work with something smaller...

-- Data file available via http://ebird.org/ebird/data/download

-- commands to extract the text file from the tarball:
   -- tar xvf ebd_relMay-2013.tar
   -- gunzip ebd_relMay-2013.txt.gz
-- WARNING: The resulting file is almost 50 gigabytes!

-- In retrospect, there's probably some premature optimization for some of these columns: if the data set changes,
-- it might be safer to use longer varchar arguments.
CREATE TABLE eBird (
  GLOBAL_UNIQUE_IDENTIFIER     char(50),      -- always 45-47 characters needed (so far)
  TAXONOMIC_ORDER              numeric,       -- Probably not needed
  CATEGORY                     varchar(20),   -- Probably 10 would be safe
  COMMON_NAME                  varchar(70),   -- Some hybrids have really long names
  SCIENTIFIC_NAME              varchar(70),   --  ''
  SUBSPECIES_COMMON_NAME       varchar(70),   --  ''
  SUBSPECIES_SCIENTIFIC_NAME   varchar(70),   --  ''
  OBSERVATION_COUNT            varchar(8),    -- Someone saw 1.3 million Auklets.
                                              -- Unfortunately, it can't be an integer 
                                              -- because some are just presence/absence
  BREEDING_BIRD_ATLAS_CODE     char(2),       -- need to confirm that these are always length 2
  AGE_SEX                      text,          -- Potentially long, but almost always blank
  COUNTRY                      varchar(50),   -- long enough for "Saint Helena, Ascension and Tristan da Cunha"
  COUNTRY_CODE                 char(2),       -- alpha-2 codes
  STATE_PROVINCE               varchar(50),   -- no idea if this is long enough? U.S. Virgin Islands may be almost 30
  SUBNATIONAL1_CODE            char(10),      -- looks standardized at 5 characters?
  COUNTY                       varchar(50),   -- who knows how long it could be
  SUBNATIONAL2_CODE            char(12),      -- looks standardized at 9 characters?
  IBA_CODE                     char(16),
  LOCALITY                     text,          -- unstructured/potentially long
  LOCALITY_ID                  char(10),      -- maximum observed so far is 8
  LOCALITY_TYPE                char(2),       -- short codes
  LATITUDE                     real,          -- Is this the appropriate level of precision?
  LONGITUDE                    real,          --    ''
  OBSERVATION_DATE             date,          -- Do I need to specify YMD somehow?
  TIME_OBSERVATIONS_STARTED    time,          -- How do I make this a time?
  TRIP_COMMENTS                text,          -- Comments are long, unstructured, 
  SPECIES_COMMENTS             text,          --    and inconsistent, but sometimes interesting
  OBSERVER_ID                  char(12),      -- max of 9 in the data I've seen so far
  FIRST_NAME                   text,          -- Already have observer IDs
  LAST_NAME                    text,          -- ''
  SAMPLING_EVENT_IDENTIFIER    char(12),      -- Probably want to index on this.
  PROTOCOL_TYPE                varchar(50),   -- Needs to be at least 30 for sure.
  PROJECT_CODE                 varchar(20),   -- Needs to be at least 10 for sure.
  DURATION_MINUTES             int,           -- bigint?
  EFFORT_DISTANCE_KM           real,          -- precision?
  EFFORT_AREA_HA               real,          -- precision?
  NUMBER_OBSERVERS             int,           -- just a small int
  ALL_SPECIES_REPORTED         int,           -- Seems to always be 1 or 0.  Maybe I could make this Boolean?
  GROUP_IDENTIFIER             varchar(10),   -- Appears to be max of 7 or 8
  APPROVED                     int,           -- Can be Boolean?
  REVIEWED                     int,           -- Can be Boolean?
  REASON                       char(17),      -- May need to be longer if data set includes unvetted data
  X                            text           -- Blank
);

COPY eBird
  FROM '/home/dharris/eBird/ebd_relMay-2013.txt' 
  HEADER
  CSV
  QUOTE E'\5'       -- The file has unbalanced quotes. Using an obscure character as a quote mark instead.
  DELIMITER E'\t';

-- Note: it's probably slightly faster to load postgis and add a geographic column first (see below).
-- I'm keeping the original ordering in this document for accuracy's sake.
CREATE INDEX ON eBird (sampling_event_identifier)

-- Example query: SELECT SCIENTIFIC_NAME FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
-- Example query: SELECT count(SCIENTIFIC_NAME) FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';

CREATE EXTENSION postgis;
ALTER TABLE eBird ADD COLUMN geog geography(POINT,4326); -- I hope 4326 is correct...
UPDATE eBird SET geog = ST_GeogFromText('POINT(' || longitude || ' ' ||  latitude || ')');
CREATE INDEX geog_index ON eBird USING GIST (geog); 

-- Example query: find all the species within 1000 of my dorm:
-- SELECT SCIENTIFIC_NAME FROM eBird WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.6972 34.4208)'), 1000);

-- Slightly fancier version:
-- SELECT DISTINCT SCIENTIFIC_NAME, COMMON_NAME FROM eBird 
--   WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.855385 34.417239)'), 1000) 
--   ORDER BY SCIENTIFIC_NAME;

(Edited to add some amazing PostGIS queries and some better commets, etc.)

PS: After poking around a bit more, it looks like I should have used doubles rather than reals to store lat/lon. I had misread the documentation about how much precision was used for reals.

davharris commented 11 years ago

The other option is to use the eBird Reference Dataset, which includes landcover etc. and is more compact, but which is only available by special request and isn't updated as frequently. Apparently, they had problems with people selling the data or something.

bendmorris commented 11 years ago

Thanks, this is great!

We looked into adding eBird about a year ago and the conclusion at the time was to avoid adding datasets which require account creation to download. We could automate account creation and signing in pretty easily, but I'm sure some data owners won't appreciate that.

It's really too bad not to include eBird since it's such a high-value dataset. Now that the Retriever is being used by more than a handful of people, maybe it's time to revisit that conversation. I'm thinking we can just require users to download the raw data files themselves, and then automate everything else. @ethanwhite, do you have any thoughts?

ethanwhite commented 11 years ago

Thanks Dave!

Ben - I like this general idea. An ideal solution might be that when someone starts to install a dataset that requires either account creation or filling out a form we have the Retriever notify them that they need to do so and then automatically open a browser at the appropriate point in the process. The user then fills out the appropriate form or creates an account and downloads the file(s) to the appropriate destination, at which point the Retriever takes over again. If this gets a bit complicated to implement then simply notifying the user that they need to download the file(s) to the appropriate directory and then install the dataset again should work just fine (and might be the easier thing to start with).

davharris commented 11 years ago

Makes sense to me. I've made some minor changes to the code above, BTW.

When I talked to the eBird DBA about the Reference dataset (which will probably be more useful to most people than the Basic dataset I used here), he said that they were looking into ways to make it more accessible again (i.e. only a login required rather than a special request). Importing the reference dataset would probably be much more work, though.

ethanwhite commented 11 years ago

Making useful data that's a lot of work to import easy to deal with is pretty much the Retriever's goal in life, so if they make general access to it possible we'll definitely tackle the Reference dataset.

sarahsupp commented 11 years ago

I will be using eBird frequently next year - along with some members of my new lab. Anything to make getting and using the data easier would be fantastic!

On Sat, Jun 29, 2013 at 4:58 PM, Ben Morris notifications@github.comwrote:

Thanks, this is great!

We looked into adding eBird about a year ago and the conclusion at the time was to avoid adding datasets which require account creation to download. We could automate account creation and signing in pretty easily, but I'm sure some data owners won't appreciate that.

It's really too bad not to include eBird since it's such a high-value dataset. Now that the Retriever is being used by more than a handful of people, maybe it's time to revisit that conversation. I'm thinking we can just require users to download the raw data files themselves, and then automate everything else. @ethanwhite https://github.com/ethanwhite, do you have any thoughts?

— Reply to this email directly or view it on GitHubhttps://github.com/weecology/retriever/issues/90#issuecomment-20237952 .

dmcglinn commented 10 years ago

Wow this is really cool, thanks Dave!

ethanwhite commented 10 years ago

If someone would like to try setting up a Retriever script for this it should be really easy. I'd be happy to help coach someone through the process.

davharris commented 10 years ago

BTW, @mbjones reminded me that I might have some other useful stuff, so I thought I'd post it here.

Here's a set of criteria we used to determine which records to include. Lots of arbitrary decisions, but possibly a useful starting point. It's loosely based on the criteria used in Fink et al.'s paper on their Spatio-Temporal Ecological Model (or whatever STEM stands for).

CREATE VIEW validrows AS 
SELECT * FROM eBird
WHERE
  COUNTRY_CODE='US' AND 
  ALL_SPECIES_REPORTED = 1 AND 
  effort_distance_km < 8.1 AND
  duration_minutes < 180 AND
  PROTOCOL_TYPE='eBird - Traveling  Count' AND
  EXTRACT('year' FROM observation_date) >= 2002 AND
  EXTRACT('year' FROM observation_date) <= 2012 AND
  EXTRACT('month' FROM observation_date) >= 5 AND
  EXTRACT('month' FROM observation_date) <= 7 AND
  EXTRACT('hour' FROM time_observations_started) >= 5 AND
  EXTRACT('hour' FROM time_observations_started) < 20
;
tomauer commented 9 years ago

Where did this end up? I'm looking for this solution with the eBird Reference Dataset and it seems like this would work. Any pointers to the easiest way to get this going would be appreciated.

ethanwhite commented 9 years ago

Hi @tomauer - I just downloaded the eBird Reference Dataset to take a look. I downloaded it without a login from http://ebird.org/downloads/erd/, but in the past this hasn't been an openly available dataset. Do you know if this has changed or is this link just accidentally live? (I just emailed eBird to ask) If this is intentional it eliminates one of the things getting in the way of us supporting eBird, which was a need to build functionality (or tell folks how to hack around our current limitations) to work with data that wasn't available at a stable url.

The reference dataset appears to be structured quite differently from the one that @davharris was working with above, and is a great example of a dataset that the Retriever would be beneficial for working with (i.e., you need to combine a dozen files, from an archive, all of which are cross-tab structured for some reason). This is complex enough that it will need to be a Python script instead of one of the simple text scripts. I'd be happy to work on this after the holidays or even happier to help someone else work on it. This is a dataset that a lot of people, including my group, are interested in, so if it's really publicly available now it's definitely worth some effort.

tomauer commented 9 years ago

Thanks @ethanwhite !

I believe that the new open access is a recent and intentional change, but I'd wait to hear back from eBird to be certain.

Having the text script way to set this up would be ideal, as our current alternative is to take our own time to stand it up in PostgreSQL, which presents substantial overhead (I can passingly handle using PostgreSQL), so I'd be happy to do what I can to make this available to the Retriever so that we can save some time on our end, especially as the quarterly updates to the eBird Reference Dataset are produced.

mbjones commented 9 years ago

@ethanwhite The CLO is now operating as a DataONE member node, and they distribute the eBird Observation Dataset via that repository now. So it should have a stable download link. It differs from the eBird Reference Dataset that is discussed and linked above. I've pasted some relevant identifiers and URLs below. The nice thing about the DataONE API is that it keeps track if data sets move and/or get replaced by newer versions. You should always be able to call the DataONE resolution REST service to find the current location for a given versioned data object. For the eBird Observation Dataset data object, the resolution service call could be done with curl using:

curl https://cn.dataone.org/cn/v1/resolve/CLOEODDATA.05192014.1

Using curl, that should return an XML document listing all the URLs from which the data object can be downloaded (currently just one, listed below). Here are some relevant identifiers and URLs for the eBird Observation Dataset (EOD):

You can call all of the DataONE service APIs (resolve(), getSystemMetadata(), get()) using these identifiers in REST calls.

tomauer commented 9 years ago

Unfortunately, the eBird Observation Dataset (EOD) is pretty sparse in terms of detail, with most of the attributes stripped out of it to the point that's is basically just a presence only (with count) description of whether a species is at a given lat/lng.

My team is moving forward with the eBird Reference Dataset (ERD), so if you decide to pursue that avenue, I'm happy to help with column optimization, as I'm very familiar with that dataset and what the field values are.

ethanwhite commented 9 years ago

Thanks @mbjones! I hadn't heard that CLO had joined up. That's great news. And thanks for the pointers on the DataONE APIs.

@tomauer - thanks for the information on the differences between the sets. Given this move towards making the data openly available I'm happy to work on incorporating both of these datasets.

Any information on optimal column types/sizes would definitely be useful given the sizes of the datasets. I'd also be curious to hear how you're envisioning working with the cross-tab aspect of the ERD. I'm not sure why they've structured it this way, and my initial plan would be to convert it to a standard format of one row per checklist per species (dropping all of the zeros).

tomauer commented 9 years ago

I've been spending the past few days trying to pull the eBird Basic Dataset into Postgres with no luck. I've gotten the Sampling Event Data file in, but the World File always ends with a set of different errors. If anyone has had any luck getting the eBird Basic Dataset or the eBird Reference Dataset into a database, please let me know.

@ethanwhite I'd be happy to help work on this and contributing to the retriever with help from someone who can troubleshoot Postgres errors better than I can.

mbjones commented 9 years ago

@tomauer what error are you getting? I vaguely remember @davharris having some troubles with character encoding issues in the middle of the very large file. Segmenting the file and parallel processing it with a tool like iconv might help if you have issues like that. And as @davharris said above there were unbalanced quotes as well. Are you seeing either of these errors in the ERD load?

ethanwhite commented 9 years ago

@tomauer we're starting work on getting eBird into the Retriever (see #252, #253, #254). I'd be curious to hear why you switched to the eBird Basic Dataset from the ERD, to help us prioritize our efforts. Also, is it correct that the Basic dataset has to be requested rather than downloaded from a stable link?

tomauer commented 9 years ago

@ethanwhite @davharris After a few more days of fighting with it, I was able to import the EBD World File and the EBD Sampling Data File into Postgres. Below are the SQL statements that I used to successfully do that. I had to first manually strip out two null characters in the dataset. I did that using EmEditor and the line number that Postgres gave me as an error when it hit the null characters. Some of my column types for new columns in this version might be a little weak or unnecessarily long.

I switched to the EBD because the structure of the ERD is absurd (every species as a column?!?!?!). We don't have a need for the co-variate data at this point, so I didn't want to deal with the cross-tab aspect of the ERD either. To access the Basic dataset you do need to request access through eBird and the URL is locked. You have to be logged into eBird.

eBird World File

CREATE TABLE ebdrelNov2014 (
    GLOBAL_UNIQUE_IDENTIFIER    char(50),
    TAXONOMIC_ORDER             numeric,
    CATEGORY                    varchar(20),
    COMMON_NAME                 varchar(70),
    SCIENTIFIC_NAME             varchar(70),
    SUBSPECIES_COMMON_NAME      varchar(70),
    SUBSPECIES_SCIENTIFIC_NAME  varchar(70),
    OBSERVATION_COUNT           varchar(8),
    BREEDING_BIRD_ATLAS_CODE    char(2),
    AGE_SEX                     text,
    COUNTRY                     varchar(50),
    COUNTRY_CODE                char(2),
    STATE_PROVINCE              varchar(50),
    STATE_CODE          char(10),
    COUNTY                      varchar(50),
    COUNTY_CODE             char(12),
    IBA_CODE                    char(16),
    BCR_CODE                varchar(50),
    LOCALITY                    text,
    LOCALITY_ID                 char(10),
    LOCALITY_TYPE               char(2),
    LATITUDE                    double precision,
    LONGITUDE                   double precision,
    OBSERVATION_DATE            date,
    TIME_OBSERVATIONS_STARTED   time,
    TRIP_COMMENTS               text,
    SPECIES_COMMENTS            text,
    OBSERVER_ID                 char(12),
    FIRST_NAME                  text,
    LAST_NAME                   text, 
    SAMPLING_EVENT_IDENTIFIER   char(12),
    PROTOCOL_TYPE               varchar(50),
    PROJECT_CODE                varchar(20),
    DURATION_MINUTES            int,
    EFFORT_DISTANCE_KM          real,
    EFFORT_AREA_HA              real,
    NUMBER_OBSERVERS            int,
    ALL_SPECIES_REPORTED        int,
    GROUP_IDENTIFIER            varchar(10),
    APPROVED                    int,
    REVIEWED                    int,
    REASON                      char(17),
    X                           text
);

COPY ebdrelNov2014
    FROM E'D:\\eBird\\ebd_relNov-2014.txt' 
    HEADER
    CSV
    QUOTE E'\5'
    DELIMITER E'\t'
;

eBird Sampling Data File

CREATE TABLE ebdsamplingrelnov2014 (
    COUNTRY                     varchar(50),
    COUNTRY_CODE                varchar(50),
    STATE                       varchar(50),
    STATE_CODE                  char(10),
    COUNTY                      varchar(50),
    COUNTY_CODE                 char(12),
    IBA_CODE                    char(16),
    BCR_CODE                    varchar(50),
    LOCALITY                    text,
    LOCALITY_ID                 char(10),
    LOCALITY_TYPE               char(2),
    LATITUDE                    double precision,
    LONGITUDE                   double precision,
    OBSERVATION_DATE            date,
    TIME_OBSERVATIONS_STARTED   time,
    TRIP_COMMENTS               text,
    OBSERVER_ID                 char(12),
    FIRST_NAME                  text,
    LAST_NAME                   text, 
    SAMPLING_EVENT_IDENTIFIER   char(12),
    PROTOCOL_TYPE               varchar(50),
    PROJECT_CODE                varchar(20),
    DURATION_MINUTES            int,
    EFFORT_DISTANCE_KM          real,
    EFFORT_AREA_HA              real,
    NUMBER_OBSERVERS            int,
    ALL_SPECIES_REPORTED        int,
    GROUP_IDENTIFIER            varchar(10),
    X                           text
);

COPY ebdsamplingrelnov2014
    FROM E'D:\\eBird\\ebd_sampling_relNov-2014.txt' 
    HEADER
    CSV
    ENCODING 'UTF8'
    QUOTE E'\b'
    DELIMITER E'\t'
;
tomauer commented 9 years ago

The other reason I didn't want to use the eBird Reference Dataset is that it only currently goes through 2012. We need as much data as we can get for our purposes and the eBird Basic Dataset has everything.

ethanwhite commented 9 years ago

Thanks for the feedback and insight into the table structure and issues with the data @tomauer. I also just heard back from the eBird folks and apparently having the ERD data public was actually a mistake and they are planning on putting it back behind the same access wall as the Basic Dataset is behind. As such I think we'll focus on the Observation Dataset (since it's publicly available; I'm almost done with this one) and working on supporting the Basic Dataset for fuller access to the data since it's has the core data and lacks the structural issues of ERD.

ethanwhite commented 7 years ago

eBird has basically pulled all of their data back behind closed doors, so I'm going to go ahead and close all of the eBird issues until such time as they decide to participate in the community in a more open way that we can readily help support.

tomauer commented 7 years ago

Hi @ethanwhite, are you referring to the eBird Basic Dataset (EBD) being behind the access wall? I now work at eBird and know that we don't want our data to be perceived as closed. Have you been in contact with anyone here? Or are you just basing this on the way EBD is accessed?

ethanwhite commented 7 years ago

@tomauer - I'd love to be wrong about this data being openly accessible. My perception is based on a combination of meaningful efforts to access the data over several years and some interactions with folks at eBird.

Originally we could find no way to access the data without creating a login, explaining why we wanted to use the data, getting permission, and then downloading a one time snapshot of the data. The eBird Observation Dataset then became available via DataONE, which was awesome and we started working on supporting it in the Data Retriever. We then found the eBird Reference Dataset openly available online and started working to support it, but out of an abundance of caution I emailed the eBird folks to make sure it was supposed to be public and they said no and removed the public access. Then later the DataONE copy of the eBird Observation Dataset disappeared without announcement and repeated queries to DataONE folks went unanswered.

So, that's the story. If I'm wrong and there are publicly accessible links to the eBird data that would be great and we'd be happy to support it again.

ethanwhite commented 7 years ago

I just did a little looking around and came across this description of how to currently get the data. I then started the process for requesting the data again (I've done this in the past, but apparently that permission expires). In general I would say that while there's clearly some desire to make the data available to the community, which is great, that this is basically a closed dataset.

To download the data you need:

  1. A login
  2. Permission that requires explaining what you want to do with the data
  3. Agreeing to a non-standard license agreement that prohibits redistribution of raw data (and even derived results without the non-standard license attached)

Both (2) and (3) make the dataset closed (i.e., not open) by any definition of "open data" that I'm aware of. I'm sure that there are good reasons for the decisions that eBird has made in this area, but one of the consequences is that since the data isn't open others can't readily build tools to make engaging with that data easier and more effective.

If you're interested in retaining the restrictions but allowing us and others to support the dataset then providing access to a permanently linked version of the data via an API key is an option. In that context we can instruct users to obtain a personal key and then store it where the software can access it. I've suggested this to a couple of other closed data sources that want people to use their data, but haven't had any takers yet. Once a valuable data source goes this route we'll put supporting it on the road map. My understanding is that the current eBird API only supports recent sightings. We'd be interested in supporting one of the full datasets.

tomauer commented 7 years ago

@ethanwhite You make a lot of good points about the "openness" of the data. We're talking about it here and may end up revising some of terms in the future. Most of this is quite a ways above me. Thanks for your feedback.

The current API does only support recent sightings, but there is a new API that is in the works that should have more features.

ethanwhite commented 7 years ago

Thanks @tomauer. I greatly appreciate your engaging with this. Let me know if there's anything I can do to help at any point.

ses4j commented 4 years ago

@tomauer Just FYI, I started with your table definition and COPY approach and have been using it to load eBird EBD data into PostgreSQL successfully. I customized the COPY a little, added some indexes and such... Just posted a gist of it: https://gist.github.com/ses4j/22ccd5350eb28697d25c962d32879431