whosonfirst-data / whosonfirst-data-postalcode-gb

Who's On First postal code data for GB
https://whosonfirst.mapzen.com
Other
2 stars 2 forks source link

Refresh the May 2020 sync #6

Closed tomtaylor closed 2 years ago

tomtaylor commented 2 years ago

The version of the ONS Postcode Directory in master is the May 2020 sync. This PR tidies up a few loose ends from that sync, in two commits:

The first updates all the records that have a Null Island geometry (mostly the Northern Ireland postcodes, starting with 'BT') to set their src:geom to unknown, rather than os.

The second re PIPs all the records against the latest GB admin repo.

Combined this should reduce the diff noise when do the August 2021 sync.

missinglink commented 2 years ago

Changeset Review

Method

rm -rf /tmp/whosonfirst-data-postalcode-gb
mkdir /tmp/whosonfirst-data-postalcode-gb

git clone \
  --progress \
  --branch exportify-may-2020 \
  --depth 3 \
  --single-branch \
  --bare \
  https://github.com/whosonfirst-data/whosonfirst-data-postalcode-gb.git \
  /tmp/whosonfirst-data-postalcode-gb

git --git-dir=/tmp/whosonfirst-data-postalcode-gb \
  diff b37d47eaf68f9c58c61753ac8079524cb7337e58 133455545f80696bf940373c4d7778d651b83c2d

Added campus_id. I've not seen these before in canonical WOF, I believe the campus placetype if fairly new and not widely adopted, where do these values come from?

The example below shows 1746415163 is a park.

Is this correct or not, I can't tell πŸ€·β€β™‚οΈ

diff --git a/data/116/071/283/8/1160712838.geojson b/data/116/071/283/8/1160712838.geojson
index b28a58be5914..dd9ca231642e 100644
--- a/data/116/071/283/8/1160712838.geojson
+++ b/data/116/071/283/8/1160712838.geojson
@@ -27,7 +27,8 @@
       101853469,
       404227469,
       1360698549,
-      85633159
+      85633159,
+      1746415163
     ],
     "wof:breaches": [],
     "wof:country": "GB",
@@ -35,6 +36,7 @@
     "wof:geomhash": "3db04d5d6cef40dd3e55534de1758879",
     "wof:hierarchy": [
       {
+        "campus_id": 1746415163,
         "country_id": 85633159,
         "county_id": 1360698815,
         "localadmin_id": 404439855,
@@ -45,7 +47,7 @@
       }
     ],
     "wof:id": 1160712838,
-    "wof:lastmodified": 1644779445,
+    "wof:lastmodified": 1660311824,
     "wof:name": "BN7 9DU",
     "wof:parent_id": -1,
     "wof:placetype": "postalcode",

Change to "src:geom": "unknown", is this correct, what is the purpose of this change?

diff --git a/data/116/071/297/0/1160712970.geojson b/data/116/071/297/0/1160712970.geojso
n
index 89d45433ea2a..5509117240c7 100644
--- a/data/116/071/297/0/1160712970.geojson
+++ b/data/116/071/297/0/1160712970.geojson
@@ -20,7 +20,7 @@
     "os:district_code": "",
     "os:positional_quality_indicator": "9",
     "os:region_code": "",
-    "src:geom": "os",
+    "src:geom": "unknown",
     "wof:belongsto": [],
     "wof:breaches": [],
     "wof:country": "GB",
@@ -32,7 +32,7 @@
       }
     ],
     "wof:id": 1160712970,
-    "wof:lastmodified": 1644779445,
+    "wof:lastmodified": 1660224399,
     "wof:name": "SN86 6DE",
     "wof:parent_id": -1,
     "wof:placetype": "postalcode",

Other than those two changes, which are numerous; a quick scan of the diff showed only PIP changes which I'd assume are correct as the parent polygons have changed.

missinglink commented 2 years ago

Regarding the hierarchy, I don't think it makes sense for a postalcode to be 'parented by' a campus: https://github.com/whosonfirst/whosonfirst-placetypes

Screenshot 2022-08-18 at 15 11 57
tomtaylor commented 2 years ago

Added campus_id. I've not seen these before in canonical WOF, I believe the campus placetype if fairly new and not widely adopted, where do these values come from?

Hmmm, I just ran a PIP against admin-gb. I guess there's campus records in there. So I should probably re-PIP and filter to the records that are definitely above postalcode in the hierarchy.

Change to "src:geom": "unknown", is this correct, what is the purpose of this change?

So, some postcodes which have invalid/license restricted geometry had src:geom set to os in the past. Where there's no geometry I've set it to unknown, which I believe is correct.

thisisaaronland commented 2 years ago

The campus placetype is principally used for airports but is meant for places like: airports, university campuses, office complexes (for example the "Googleplex" or Apple's "Space Donut").

I can't speak to Great Britain but there are examples of campuses intersecting and overlapping with postal codes in the US. For example SFO and 94128 are the same geographic place.

missinglink commented 2 years ago

@stepps00 can you please clarify if it's desirable/undesirable to have these postalcode records parented by a campus? there are numerous cases of this within the changeset.

stepps00 commented 2 years ago

Looking at the whosonfirst-placetypes repo (and the chart above), postalcode records should be parented by locality, localadmin, county, or region records.. that being said, I don't think a campus parent would break anything. Any postalcode record parented by a campus would just be updated the next time it was PIP'd.

It shouldn't cause any issues, if we were to import those changes.. also:

Where there's no geometry I've set it to unknown, which I believe is correct.

That is correct :+1:

missinglink commented 2 years ago

I don't think a campus parent would break anything

It's hard to say what this might, or might not break for consumers of the data 🀷

I'm actually not sure within Pelias what the effect will be, since we don't import the campus placetype, we might end up displaying references to WOF objects which we aren't able to resolve πŸ€”

A community contribution just landed recently which performs more robust checking of which parent fields are valid, so it could end up causing a fatal error at import time which would need to be resolved by the Pelias maintainers before WOF imports would work again.

Personally, I'd prefer to see this fixed before merging, it seems to be a tooling error, consumers of WOF code their integration based on the whosonfirst-placetypes taxonomy, so it's hard to say what might end up breaking across the various integrations which assume that campus is 'lower' in the taxonomy than postalcode.

Unfortunately redoing the PIP is a time-consuming process so I also understand the desire to 'let it slide' as Tom has put in significant time already, but it's worth considering that it might end up costing more time overall if we don't address it.

tomtaylor commented 2 years ago

Hello - thanks for reviewing this! I went on holiday, hence the slow reply.

Very happy to re-PIP this and limit to the valid parent features. Will follow up in the next few days.

thisisaaronland commented 2 years ago

Postal codes have always been considered to be "odd balls" because, as everyone is fond of pointing out, they often don't represent geographies.

That said, the definition for the postal code placetypes only lists [ "locality", "localadmin", "county", "region" ] as valid parent types:

https://github.com/whosonfirst/whosonfirst-placetypes/blob/main/placetypes/postalcode.json#L5

@tomtaylor - What were you using to do the PIP work for these postal codes? If you are doing a database query without ancestor filtering that would explain why the campus records are being used as parents. If you're using code I've written in the past then... bugs, maybe?

The following code should do the right thing:

https://github.com/whosonfirst/go-whosonfirst-spatial-hierarchy/blob/main/resolver.go#L125-L201

tomtaylor commented 2 years ago

@tomtaylor - What were you using to do the PIP work for these postal codes?

Ah ha! I was handrolling this stuff, just using the regular PIP endpoint in spatial.

thisisaaronland commented 2 years ago

Strictly speaking the Go code is incomplete and does not account for some things like parts of New York City having multiple counties but should be enough for postal codes in GB. See also:

https://github.com/whosonfirst/py-mapzen-whosonfirst-hierarchy/blob/master/mapzen/whosonfirst/hierarchy/__init__.py

And:

https://github.com/whosonfirst/go-whosonfirst-spatial-hierarchy/issues/1

tomtaylor commented 2 years ago

So, I've got this working, but I'm a little confused about the results. I've built an rtree index from all the admin GB files, inside my app. For features that are ceased, it seems to empty the hierarchy and belongs_to, but set the parent_id to the parent feature. For features that are active, it set the parent_id to -1. Is that correct behaviour?

thisisaaronland commented 2 years ago

No, that sounds like a bug or a gotcha setting up the code. Is the relevant code somewhere I can look at?

tomtaylor commented 2 years ago

The code is here: https://github.com/whosonfirst/wof-sync-os-postcodes/blob/fast-pip/pipclient/pipclient.go

Called from: https://github.com/whosonfirst/wof-sync-os-postcodes/blob/fast-pip/cmd/wof-sync-os-postcodes/main.go#L77-L84

tomtaylor commented 2 years ago

e.g.

diff --git a/data/391/159/62/39115962.geojson b/data/391/159/62/39115962.geojson
index a63d231782c7f..64230164c891e 100644
--- a/data/391/159/62/39115962.geojson
+++ b/data/391/159/62/39115962.geojson
@@ -1,64 +1,51 @@
 {
   "id": 39115962,
   "type": "Feature",
   "properties": {
     "date:cessation_lower": "2019-04-01",
     "date:cessation_upper": "2019-04-01",
     "date:inception_lower": "2017-03-01",
     "date:inception_upper": "2017-03-01",
     "edtf:cessation": "2019-04-01",
     "edtf:inception": "2017-03-01",
     "geom:area": 0,
     "geom:bbox": "-1.135252,52.998288,-1.135252,52.998288",
     "geom:latitude": 52.998288,
     "geom:longitude": -1.135252,
     "iso:country": "GB",
     "mz:hierarchy_label": 1,
     "mz:is_current": 0,
     "os:country_code": "E92000001",
     "os:county_code": "E10000024",
     "os:district_code": "E07000173",
     "os:positional_quality_indicator": "1",
     "os:region_code": "E12000004",
     "src:geom": "os",
-    "wof:belongsto": [
-      1125994433,
-      404227469,
-      1360698583,
-      85633159,
-      1360698809,
-      1360817357
-    ],
+    "wof:belongsto": [],
     "wof:breaches": [],
     "wof:country": "GB",
     "wof:created": 1583797869,
     "wof:geomhash": "f53de94684025bb717a654c4f48aa378",
     "wof:hierarchy": [
       {
-        "country_id": 85633159,
-        "county_id": 1360698809,
-        "localadmin_id": 1360817357,
-        "locality_id": 1125994433,
-        "macroregion_id": 404227469,
-        "postalcode_id": 39115962,
-        "region_id": 1360698583
+        "postalcode_id": 39115962
       }
     ],
     "wof:id": 39115962,
-    "wof:lastmodified": 1644779543,
+    "wof:lastmodified": 1661888343,
     "wof:name": "NG5 0JN",
-    "wof:parent_id": -1,
+    "wof:parent_id": 1125994433,
     "wof:placetype": "postalcode",
     "wof:repo": "whosonfirst-data-postalcode-gb",
     "wof:superseded_by": [],
     "wof:supersedes": [],
     "wof:tags": []
   },
   "bbox": [
     -1.135252,
     52.998288,
     -1.135252,
     52.998288
   ],
   "geometry": {"coordinates":[-1.135252,52.998288],"type":"Point"}
 }
tomtaylor commented 2 years ago

In the same run:

diff --git a/data/504/556/519/504556519.geojson b/data/504/556/519/504556519.geojson
index ebd56a0a595ed..9eef9ccba9a41 100644
--- a/data/504/556/519/504556519.geojson
+++ b/data/504/556/519/504556519.geojson
@@ -1,66 +1,66 @@
 {
   "id": 504556519,
   "type": "Feature",
   "properties": {
     "date:inception_lower": "1980-01-01",
     "date:inception_upper": "1980-01-01",
     "edtf:cessation": "",
     "edtf:inception": "1980-01-01",
     "geom:area": 0,
     "geom:bbox": "-4.724263,55.938221,-4.724263,55.938221",
     "geom:latitude": 55.938221,
     "geom:longitude": -4.724263,
     "gp:parent_id": "12696200",
     "iso:country": "GB",
     "mz:hierarchy_label": 1,
     "mz:is_current": 1,
     "os:country_code": "S92000003",
     "os:county_code": "S99999999",
     "os:district_code": "S12000018",
     "os:positional_quality_indicator": "1",
     "os:region_code": "S99999999",
     "src:geom": "os",
     "wof:belongsto": [
       1360698699,
       85633159,
       1360699121,
       404455299,
       101750635,
       404227471
     ],
     "wof:breaches": [],
     "wof:concordances": {
       "gp:id": "27731622"
     },
     "wof:country": "GB",
     "wof:created": 1583701424,
     "wof:geomhash": "724c53f40cf41bb01cd91d877ccd4517",
     "wof:hierarchy": [
       {
         "country_id": 85633159,
         "county_id": 1360699121,
         "localadmin_id": 404455299,
         "locality_id": 101750635,
         "macroregion_id": 404227471,
         "postalcode_id": 504556519,
         "region_id": 1360698699
       }
     ],
     "wof:id": 504556519,
-    "wof:lastmodified": 1644780132,
+    "wof:lastmodified": 1661885269,
     "wof:name": "PA15 2HG",
-    "wof:parent_id": 101750635,
+    "wof:parent_id": -1,
     "wof:placetype": "postalcode",
     "wof:repo": "whosonfirst-data-postalcode-gb",
     "wof:superseded_by": [],
     "wof:supersedes": [],
     "wof:tags": []
   },
   "bbox": [
     -4.724263,
     55.938221,
     -4.724263,
     55.938221
   ],
   "geometry": {"coordinates":[-4.724263,55.938221],"type":"Point"}
 }
thisisaaronland commented 2 years ago

Okay, I will poke around things shortly.

thisisaaronland commented 2 years ago

I am still poking around and need to step away from the computer shortly but have you / can you confirm that the underlying PointInPolygon method is returning > 0 results:

https://github.com/whosonfirst/go-whosonfirst-spatial-hierarchy/blob/245f73c1e27bca7297415e693ed5c08909b47f92/resolver.go#L125

https://github.com/whosonfirst/go-whosonfirst-spatial-hierarchy/blob/main/resolver.go#L93

If it's not (and it should) then the FirstButForgivingSPRResultsFunc callback will return nil, nil which would trigger the -1 parent ID here:

https://github.com/whosonfirst/go-whosonfirst-spatial-hierarchy/blob/245f73c1e27bca7297415e693ed5c08909b47f92/resolver.go#L41-L45

thisisaaronland commented 2 years ago

I haven't run this across the entirely of -postalcode-gb yet nor have I tested it with the in-memory rtree spatial database but it does seem to work with a SQLite spatial database (of -admin-gb). Details here:

https://gist.github.com/thisisaaronland/e2f60fc6f06c74803c7aceb0c8786956

tomtaylor commented 2 years ago

Thanks @thisisaaronland! So, I had this running with SQLite first, using the prebuilt database from https://data.geocode.earth/wof/dist/sqlite/whosonfirst-data-admin-gb-latest.db.bz2 but it slow enough that I quickly switched to use rtree. Will I get the same performance with your approach? I'm not sure whether the prebuilt DBs have the spatial indexes in place. cc: @missinglink.

tomtaylor commented 2 years ago

I think maybe the distribution SQLite databases don't have quite the same indexes - building an SQLite database with go-whosonfirst-sqlite-features-index seems a lot faster.

missinglink commented 2 years ago

I'm not familiar with the WOF PIP tooling, but you're right about the distribution files, they have never contained any spatial indices.

We publish an alternative version of the data which is in a format SpatiaLite recognises, maybe this is the correct file to use? I'm not sure if that's compatible with the WOF tooling.

note: some time ago SpatiaLite changed their recommendation on how to use the sparse spatial index, they now advocate querying the SpatialIndex directly, I'm not sure if this is strictly compatible with the SQLite Rtree module, if that's what being used in the code, although I believe that table still exists internally, it might be invisible by default.

sqlite3 whosonfirst-data-admin-gb-latest.db .tables
ancestors     concordances  geojson       names         spr

sqlite3 whosonfirst-data-admin-gb-latest.spatial.db .tables
ElementaryGeometries                point_in_polygon
KNN                                 property
SpatialIndex                        search
data_licenses                       search_config
geom_cols_ref_sys                   search_data
geometry                            search_docsize
geometry_columns                    search_idx
geometry_columns_auth               search_vocab
geometry_columns_field_infos        shard
geometry_columns_statistics         shard_subdivide
geometry_columns_time               spatial_ref_sys
hierarchy                           spatial_ref_sys_all
hierarchy_insert_proxy              spatial_ref_sys_aux
idx_geometry_geom                   spatialite_history
idx_geometry_geom_node              sql_statements_log
idx_geometry_geom_parent            vector_layers
idx_geometry_geom_rowid             vector_layers_auth
idx_shard_geom                      vector_layers_field_infos
idx_shard_geom_node                 vector_layers_statistics
idx_shard_geom_parent               views_geometry_columns
idx_shard_geom_rowid                views_geometry_columns_auth
idx_shard_subdivide_geom            views_geometry_columns_field_infos
idx_shard_subdivide_geom_node       views_geometry_columns_statistics
idx_shard_subdivide_geom_parent     virts_geometry_columns
idx_shard_subdivide_geom_rowid      virts_geometry_columns_auth
name                                virts_geometry_columns_field_infos
place                               virts_geometry_columns_statistics

you can find both versions at https://geocode.earth/data/whosonfirst/ the spatial ones are lower down the page.

missinglink commented 2 years ago

The underlying rtree index does exist in the SpatiaLite databases:

VIRTUAL TABLE "idx_shard_subdivide_geom" USING rtree(pkid, xmin, xmax, ymin, ymax)
thisisaaronland commented 2 years ago

@missinglink The go-whosonfirst-spatial package defines interfaces for common WOF spatial operations (basically just PIP right now) in order to support a variety of implementations. Both the in-memory RTree and SQLite version use rtrees to do the heavy-lifting and then an in-memory raycasting for final filtering. It would be useful to have a Spatialite version as well but once SQLite got rtrees by default I opted for that because Spatialite has proven to be fussy and complicated to install in the past. Maybe that's changed?

missinglink commented 2 years ago

Spatialite has proven to be fussy and complicated to install in the past

Yeah I would definitely agree with that statement, but yes I do believe that since version 5 was released that hopefully it's easier to use.

I have opted to compile everything from scratch and dockerize it so that I don't have to worry about the dynamic C libs.

From a performance perspective the SpatiaLite files I generate offer both far lower latency along with more predictable performance independent of the geometry size.

The trick is simple, there is a second table called 'shard' where the geometries are quartered recursively as long as they contain more than n vertices.

The resulting shards are indexed within an r-tree, then when you search your raycasting is limited to n.

They're two very different models, one where you rely on in-memory data structures which is more convenient and the other where you leverage an external library and database to do the work.

thisisaaronland commented 2 years ago

@tomtaylor FYI:

For example:

$> go run cmd/update/main.go \
     -spatial-database-uri rtree:// \
     -spatial-source /usr/local/data/whosonfirst-data-admin-gb \
     -writer-uri stdout:// \
     /usr/local/data/whosonfirst-data-postalcode-gb

By default this imports the -spatial-rtree package. If you need to import a different implementation you'll need to clone the cmd/update/main.go tool and add the relevant import statement.

tomtaylor commented 2 years ago

OK, I've force pushed this branch to two commits.

The first is exactly the same as before - sorts out the src:geom property.

The second is a full re-PIP + export of all records, using the tooling that correctly sets wof:belongs_to, wof:hierarchy and wof:parent_id. Things to look for:

missinglink commented 2 years ago

Looks good, nice to add the continent placetype :+1:

I did see something might be concerning in the new diff:

index 8930aa3a275c..fcfcaa8826ac 100644
--- a/data/116/071/308/2/1160713082.geojson
+++ b/data/116/071/308/2/1160713082.geojson
@@ -23,11 +23,12 @@
     "src:geom": "os",
     "wof:belongsto": [
       404227469,
-      1360698619,
       85633159,
       1360699087,
       404452219,
-      1360757901
+      1360757901,
+      102191581,
+      1360698621
     ],
     "wof:breaches": [],
     "wof:country": "GB",
@@ -35,19 +36,20 @@
     "wof:geomhash": "5b8e3e3f899fe8e186b92a25aa561958",
     "wof:hierarchy": [
       {
+        "continent_id": 102191581,
         "country_id": 85633159,
         "county_id": 1360699087,
         "localadmin_id": 404452219,
         "locality_id": 1360757901,
         "macroregion_id": 404227469,
         "postalcode_id": 1160713082,
-        "region_id": 1360698619
+        "region_id": 1360698621
       }
     ],
     "wof:id": 1160713082,
-    "wof:lastmodified": 1644779445,
+    "wof:lastmodified": 1661974369,
     "wof:name": "TS2 1YD",
-    "wof:parent_id": -1,
+    "wof:parent_id": 1360757901,
     "wof:placetype": "postalcode",
     "wof:repo": "whosonfirst-data-postalcode-gb",
     "wof:superseded_by": [],

You can see the region_id has been updated from 1360698619 to 1360698621, I spot checked the polygon intersection and found that they don't seem to intersect at all πŸ€”

Screenshot 2022-09-01 at 12 00 06 Screenshot 2022-09-01 at 12 00 15

I did a quick search for TS2 1YD and found that North Yorkshire is actually the correct County, is this an error with the PIP process?

missinglink commented 2 years ago

Agh so maybe Durham:1360698621 is being inherited from the immediate parent Thornaby:1360757901 and not assigned by the PIP process directly?

tomtaylor commented 2 years ago

Ugh, I'm honestly not sure. My code blindly follows whatever the default resolver behaviour is: https://github.com/whosonfirst/wof-sync-os-postcodes/blob/master/pipclient/pipclient.go#L42

missinglink commented 2 years ago

Makes sense if that's what it's doing, we have a similar strategy in some PIP implementations, it's less resource intensive to just query for the lowest layer which intersects and then copy the hierarchy from the parent.

So.. while it's not technically correct I think we can roll with it and potentially address it in subsequent PR.

This diff looks great, let's merge it? 🚒

tomtaylor commented 2 years ago

Hurrah!

missinglink commented 2 years ago

The bundling code runs at 0 */6 * * * β€œAt minute 0 past every 6th hour.”. So these changes should appear at https://geocode.earth/data/whosonfirst/ in the next half day or so.

thisisaaronland commented 2 years ago

In the interest of getting this merged I am okay with "not dealing with this" this time but: Technically, a record that ends up with a new parent ID (that is not -1) should be superseded.

I will have a think about whether/where this can happen inside of the existing "hierarchy" code. It may require a different layer since the hierarchy code only deals with "bytes" rather than "writers" (something along the lines of the "update" code above).