whosonfirst / whosonfirst-www-spelunker

A simple Flask-based spelunker for poking around Who's On First data
BSD 3-Clause "New" or "Revised" License
7 stars 9 forks source link

Bundler should not include features with an edtf:cessation date or mz:is_current of 0 #89

Open stepps00 opened 7 years ago

stepps00 commented 7 years ago

Features with an edtf:deprecated date are not showing up in bundles, which is correct. However, features with an edtf:cessation date or mz:is_current value of 0 are showing up in bundles.

Since features with an edtf:cessation date, an edtf:deprecated date, and/or a mz:is_current value of 0 are no longer current in Who's On First, they should not be returned in bundler results.

For example, downloading all neighbourhood descendants in San Jose (link) returns features with a edtf:cessation date.

nvkelso commented 7 years ago

Should this be the default (to not include them), but enable including them also possible to get the whole set of data (similar to Null Island flag)?

On Fri, Feb 10, 2017 at 12:11 PM, Stephen Epps notifications@github.com wrote:

Features with an edtf:deprecated date are not showing up in bundles, which is correct. However, features with an edtf:cessation date or mz:is_current value of 0 are showing up in bundles.

Since features with an edtf:cessation date, an edtf:deprecated date, and/or a mz:is_current value of 0 are no longer current on Who's On First, they should not be returned in bundler results.

For example, downloading all neighbourhood descendants in San Jose (link https://whosonfirst.mapzen.com/spelunker/download/85922347/?exclude=nullisland#12/37.3504/-121.9270) returns features with a edtf:cessation date.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/whosonfirst/whosonfirst-www-spelunker/issues/89, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0EOyMrEG03KyNJ-JYLJdst7-XRw6_eks5rbMR0gaJpZM4L9zdD .

stepps00 commented 7 years ago

We should definitely remove features with a cessation or deprecated date for the default bundles. I'm open to having a check-box to enable bundling "all" features, if that's an option.

thisisaaronland commented 7 years ago

This should be a "sensible defaults" thing where unless the user/UI specifies that they want to include historical data the only current things are included.

Things that are deprecated should be excluded by default (which may already be the case... I don't remember?)

I would like to let the Bundler "bake" as-is a week or so, to see what else shakes out during that time, but we will schedule this as a 1.1 thing.

dphiffer commented 7 years ago

I'm looking into this one, and it does seem like something weird is going on. Still trying to pin down where in the process it's happening.

dphiffer commented 7 years ago

Here is an example query that illustrates the mz:is_current behavior. That second result for Berryessa includes edtf: properties, but not edtf:deprecated which is set to 2017/02/21.

curl -XPOST -s "https://whosonfirst-api.mapzen.com/?api_key=(snip)" -d "method=whosonfirst.places.getDescendants&id=85922347&placetype=neighbourhood&extras=edtf:,mz:&per_page=2&exclude=nullisland" | jq .
  1. Is the API sensitive to mis-encoded EDTF values?
  2. Is the API not finding edtf:deprecated for some reason?
dphiffer commented 7 years ago

Oh and here are the API results:

{
  "places": [
    {
      "wof:id": 85802971,
      "wof:parent_id": "1108782553",
      "wof:name": "Alviso",
      "wof:placetype": "neighbourhood",
      "wof:country": "US",
      "wof:repo": "whosonfirst-data",
      "edtf:cessation": "uuuu",
      "edtf:inception": "uuuu",
      "mz:tier_locality": "5",
      "mz:is_funky": "0",
      "mz:filesize": "49152",
      "mz:max_zoom": "18",
      "mz:is_landuse_aoi": "0",
      "mz:is_hard_boundary": "0",
      "mz:categories": [],
      "mz:hierarchy_label": "1",
      "mz:tier_metro": "1",
      "mz:min_zoom": "14",
      "mz:is_official": "0",
      "mz:uri": "https://whosonfirst.mapzen.com/data/858/029/71/85802971.geojson"
    },
    {
      "wof:id": 85805881,
      "wof:parent_id": "1108782553",
      "wof:name": "Berryessa",
      "wof:placetype": "neighbourhood",
      "wof:country": "US",
      "wof:repo": "whosonfirst-data",
      "edtf:cessation": "2017/02/21",
      "edtf:inception": "uuuu",
      "mz:tier_locality": "5",
      "mz:is_funky": "0",
      "mz:filesize": "61440",
      "mz:max_zoom": "18",
      "mz:is_landuse_aoi": "0",
      "mz:is_current": "0",
      "mz:is_hard_boundary": "0",
      "mz:categories": [],
      "mz:hierarchy_label": "1",
      "mz:tier_metro": "1",
      "mz:min_zoom": "12",
      "mz:is_official": "0",
      "mz:uri": "https://whosonfirst.mapzen.com/data/858/058/81/85805881.geojson"
    }
  ],
  "next_query": "method=whosonfirst.places.getDescendants&id=85922347&placetype=neighbourhood&exclude=nullisland&extras=edtf%3A%2Cmz%3A&per_page=2&cursor=(snip)",
  "total": 600,
  "page": null,
  "pages": 300,
  "per_page": 2,
  "cursor": "(snip)",
  "stat": "ok"
}
dphiffer commented 7 years ago

I updated Berryessa to use 2017-02-21 (and also mz:is_current integer 0 instead of string "0") and that seems to have fixed the problem (i.e., it doesn't appear in API results, and thus no longer gets bundled).

Checking a few other records, Evergreen, Almaden Valley, and Blossom Valley I see a similar issue with edtf:deprecated (all set to 2017/02/21).

dphiffer commented 7 years ago

Upshot is I think explanation #1 is true: the API is in fact sensitive to mis-encoded EDTF values. Personally I prefer this behavior as-is, since it can help surface issues in the data.

thisisaaronland commented 7 years ago

@dphiffer there is now code to explicitly cast mz:is_current as the correct data type in https://github.com/whosonfirst/py-mapzen-whosonfirst-export/commit/bb760b5be51e8e2c1557136317becc6c3671dbaa

There is also an update to the ES schema to not store those values are strings but it has neither been applied to the existing indices (that on deck...) nor have they been re-indexed yet: https://github.com/whosonfirst/es-whosonfirst-schema/commit/9a6a6959876f6484cf0646b16bdc5d5eeac7f0f7

thisisaaronland commented 7 years ago

@dphiffer FYI:

https://whosonfirst.mapzen.com/spelunker/recent/?&is_current=-1 https://whosonfirst.mapzen.com/spelunker/recent/?&is_current=0 https://whosonfirst.mapzen.com/spelunker/recent/?&is_current=1

I still need to add is_current hooks to the API.

thisisaaronland commented 7 years ago

@dphiffer :

curl -X GET 'https://whosonfirst-api.mapzen.com?method=whosonfirst.places.search&api_key=mapzen-xxxxxx&is_current=1&name=wine&placetype=venue&format=json'

{
  "places": [
    {
      "wof:id": 588378691,
      "wof:parent_id": "85882195",
      "wof:name": "Local Kitchen & Wine Merchant",
      "wof:placetype": "venue",
      "wof:country": "US",
      "wof:repo": "whosonfirst-data-venue-us-ca"
    },
    {
      "wof:id": 572153071,
      "wof:parent_id": "85869177",
      "wof:name": "Brooklyn Wine Exchange",
      "wof:placetype": "venue",
      "wof:country": "US",
      "wof:repo": "whosonfirst-data-venue-us-ny"
    }
  ],
  "next_query": null,
  "total": 2,
  "page": 1,
  "per_page": 100,
  "pages": 1,
  "cursor": null,
  "stat": "ok"
}

The is_current parameter does not consider anything other than the mz:is_current property. In theory any record a non-empty / not-uuuu value should have an appropriate is_current property but that may not always be the case as of this writing.

Anyway, that is a separate issue as is handling edtf: dates (for which I will file a separate ticket).

thisisaaronland commented 7 years ago

@dphiffer

curl -s -X GET 'https://whosonfirst-api.mapzen.com?method=whosonfirst.places.search&api_key=mapzen-xxxxxx&is_ceased=1&placetype=country&extras=edtf:&format=json' 

{
    "cursor": null,
    "next_query": null,
    "page": 1,
    "pages": 1,
    "per_page": 100,
    "places": [
        {
            "edtf:cessation": "1991-06~",
            "edtf:inception": "1945-11-29",
            "wof:country": "XX",
            "wof:id": 1126113567,
            "wof:name": "Socialist Federal Republic of Yugoslavia",
            "wof:parent_id": "102191581",
            "wof:placetype": "country",
            "wof:repo": "whosonfirst-data"
        },
        {
            "edtf:cessation": "1945-11-29",
            "edtf:inception": "1918-12~",
            "wof:country": "XX",
            "wof:id": 1108955789,
            "wof:name": "Kingdom of Yugoslavia",
            "wof:parent_id": "102191581",
            "wof:placetype": "country",
            "wof:repo": "whosonfirst-data"
        },
        {
            "edtf:cessation": "1991-09~",
            "edtf:inception": "1991-06~",
            "wof:country": "XX",
            "wof:id": 1108955787,
            "wof:name": "Socialist Federal Republic of Yugoslavia",
            "wof:parent_id": "102191581",
            "wof:placetype": "country",
            "wof:repo": "whosonfirst-data"
        },
        {
            "edtf:cessation": "1992-03~",
            "edtf:inception": "1991-09~",
            "wof:country": "XX",
            "wof:id": 1108955785,
            "wof:name": "Socialist Federal Republic of Yugoslavia",
            "wof:parent_id": "102191581",
            "wof:placetype": "country",
            "wof:repo": "whosonfirst-data"
        },
        {
            "edtf:cessation": "2006",
            "edtf:inception": "1992-03~",
            "wof:country": "XX",
            "wof:id": 1108955783,
            "wof:name": "Federal Republic of Yugoslavia",
            "wof:parent_id": "102191581",
            "wof:placetype": "country",
            "wof:repo": "whosonfirst-data"
        }
    ],
    "stat": "ok",
    "total": 5
}

Note: The API does not currently support OR queries so there is no way to query for the union of is_current=1 or is_ceased=0 (I mean that's the whole purpose of the mz:is_current property but whatever...) only the interesection. One step at a time...