whosonfirst-data / whosonfirst-data-postalcode-gb

Who's On First postal code data for GB
https://whosonfirst.mapzen.com
Other
2 stars 2 forks source link

Sync with ONS Feb 2020 data release #4

Closed tomtaylor closed 4 years ago

tomtaylor commented 4 years ago

This supersedes #3, but does the same thing.

It's now synced against the Feb 2020 release of the ONS Postcode Directory and uses an updated go-whosonfirst-export which includes go-whosonfirst-format, to format the GeoJSON files to WOF standards.

I'd really like to get this merged, as the UK postcode data is badly out of date.

I appreciate it's a big diff, so please let me know what I can do to help.

stepps00 commented 4 years ago

Hey @tomtaylor - thanks for getting this PR in!

I'm in the process of QAing these changes now.. I'd also like to get them into WOF. A few notes:

  1. It looks like the spacing between the properties and property values is different between what is in master and what is being proposed. Example:
-    "wof:id":438039223,
-    "wof:lastmodified":1474571147,
-    "wof:name":"B93 0AA",
-    "wof:parent_id":404453641,
-    "wof:placetype":"postalcode",
-    "wof:repo":"whosonfirst-data-postalcode-gb",
-    "wof:superseded_by":[],
-    "wof:supersedes":[],
-    "wof:tags":[]
-},
+    "wof:id": 438039223,
+    "wof:lastmodified": 1583786247,
+    "wof:name": "B93 0AA",
+    "wof:parent_id": 404453641,
+    "wof:placetype": "postalcode",
+    "wof:repo": "whosonfirst-data-postalcode-gb",
+    "wof:superseded_by": [],
+    "wof:supersedes": [],
+    "wof:tags": []
+  },

I assume this is coming from your usage of the go-whosonfirst-export tool? Maybe @thisisaaronland can chime in about the spacing - is this expected when using the Go export tools? I haven't seen this when using the wof-exportify tool (or Python version of the export code).

  1. It looks like brand new features are being created in this changeset. An example being:
{
  "id": 1160712856,
  "type": "Feature",
  "properties": {
    "edtf:cessation": "2001-01-01",
    "edtf:inception": "1999-06-01",
    "geom:area": 0,
    "geom:bbox": "-1.467511,51.791435,-1.467511,51.791435",
    "geom:latitude": 51.791435,
    "geom:longitude": -1.467511,
    "iso:country": "GB",
    "mz:hierarchy_label": 1,
    "mz:is_current": 0,
    "os:country_code": "E92000001",
    "os:county_code": "E10000025",
    "os:district_code": "E07000181",
    "os:positional_quality_indicator": "6",
    "os:region_code": "E12000008",
    "src:geom": "os",
    "wof:belongsto": [
      1360698839, 
      404450543, 
      101852913, 
      404227469, 
      1360698585, 
      85633159
    ],
    "wof:breaches": [],
    "wof:country": "GB",
    "wof:created": 1583786921,
    "wof:geomhash": "f4a74f77b275d57d7607b49c1cd10d83",
    "wof:hierarchy": [
      {
        "country_id": 85633159,
        "county_id": 1360698839,
        "localadmin_id": 404450543,
        "locality_id": 101852913,
        "macroregion_id": 404227469,
        "region_id": 1360698585
      }
    ],
    "wof:id": 1160712856,
    "wof:lastmodified": 1583786921,
    "wof:name": "OX8 6FT",
    "wof:parent_id": -1,
    "wof:placetype": "postalcode",
    "wof:repo": "whosonfirst-data-postalcode-gb",
    "wof:superseded_by": [],
    "wof:supersedes": [],
    "wof:tags": []
  },
  "bbox": [
    -1.467511, 
    51.791435, 
    -1.467511, 
    51.791435
  ],
  "geometry": {"coordinates":[-1.467511,51.791435],"type":"Point"}
}

Could you explain a bit about these features? They seem to be new, but also include a cessation date in 2001 and are flagged as "not current".

  1. Do you have examples of any records that have changed geometries? I'm unable to find one or a case where we're storing an alt-geometry. I just want to take a look at a few more examples of changes.
tomtaylor commented 4 years ago

Thanks @stepps00 - I'll try and answer below.

  1. It looks like the spacing between the properties and property values is different between what is in master and what is being proposed. [...]

I assume this is coming from your usage of the go-whosonfirst-export tool? Maybe @thisisaaronland can chime in about the spacing - is this expected when using the Go export tools? I haven't seen this when using the wof-exportify tool (or Python version of the export code).

That's right - that's what go-whosonfirst-format does. I think the outcome of this discussion wasn't conclusive about having a binary consistent WOF format.

For this situation, I would personally prefer the Go tooling to be aligned with the outputs from other languages to reduce the diff noise, but that feels like it might be a minority view (or at least we hadn't concluded how to implement that).

FWIW, collapsing the whitespace between JSON key and value would be a one-line change in go-whosonfirst-format: https://github.com/tomtaylor/go-whosonfirst-format/blob/ef23a835c34e0ef0cf8022483d946e38ad785df7/format.go#L93

  1. It looks like brand new features are being created in this changeset. [...]

Could you explain a bit about these features? They seem to be new, but also include a cessation date in 2001 and are flagged as "not current".

That's right. These are historical postcodes. The ONS Postcode Directory contains every postcode ever commissioned, including those that have since been deprecated. I felt it important to include these for the fullest data set possible. In my situation, it's reasonably likely a user may search for a postcode that has been removed, and I want to return a result.

Just to be super clear: there are also new features that are current. 39136626 is an example.

  1. Do you have examples of any records that have changed geometries? I'm unable to find one or a case where we're storing an alt-geometry. I just want to take a look at a few more examples of changes.

I don't think I've seen any significant geometry changes. My understanding is that UK postcodes get deprecated, not reassigned to a new location.

Let me know if I can help with anything else!

stepps00 commented 4 years ago

Hey @tomtaylor - sorry for the delay here.. I've had a bit of time to take another look, here are some notes:

I'm fine with the spacing as is, though I don't know the impact that will have on he distribution files and/or other tools like the Spelunker, for example. I'll default to what @thisisaaronland suggests here.

As far as the historical postcodes go, that all makes sense and it looks like the edtf:* and mz:* properties have been filled out correctly for these records. Should any of these historical postalcodes be superseded by another postalcode record?

A few other things I noticed about the changes:

I've assigned a reviewer to this PR.. aside from the comments above, I think this PR looks good. Thanks for all the work!

tomtaylor commented 4 years ago

Thanks @stepps00, comments below.

As far as the historical postcodes go, that all makes sense and it looks like the edtf:* and mz:* properties have been filled out correctly for these records. Should any of these historical postalcodes be superseded by another postalcode record?

Just so I understand your question, do you mean have I deprecated a postcode that was then reintroduced in a different feature, and so should refer to the newer feature in a 'superseded by' key? I don't believe so. In some situations I've fixed the case of the postcode name from mixed case (Se14 1aa to SE14 1AA), but I don't believe anything has been deprecated and then replaced.

  • postalcode_id is no longer in the wof:hierarchy for many (all?) records.
    • Was this key/value removed in the wof:hierarchy for any particular reason?

Hmmm, good question. The hierarchy was generated by from go-whosonfirst-pip-v2, by pointing it at whosonfirst-data-admin-gb and using the response verbatim. I don't think I know enough about how the hierarchy should work - happy to take advice here.

  • New records' wof:id values are even-numbered

    • We typically derive our wof:id values from Brooklyn Integers, but other sources are okay to use, as long as the values are new.
    • Up until this point, we've only used odd-numbered wof:id values, so I'll default to @thisisaaronland on what to do here, too.

Apologies, I hadn't realised this. I generated >500k new IDs and was trying to reduce the load on the ID servers by balancing between the three. They are all newly minted values. I can rerun this again if it's an issue.

  • Looks like os:positional_quality_indicator values are strings, should they be int values?

I think they were originally strings in the features that had this key already, so I've kept them as that. I might be wrong though.

I've assigned a reviewer to this PR.. aside from the comments above, I think this PR looks good. Thanks for all the work!

Thanks for your help!

stepps00 commented 4 years ago

Some more comments, in line:

Just so I understand your question, do you mean have I deprecated a postcode that was then reintroduced in a different feature, and so should refer to the newer feature in a 'superseded by' key? I don't believe so. In some situations I've fixed the case of the postcode name from mixed case (Se14 1aa to SE14 1AA), but I don't believe anything has been deprecated and then replaced.

Yeah, I'm just curious if there are cases where a "new" postalcode actually replaces an "old" postalcode. As in, are there cases where a postalcode was deprecated, but one (or more) postalcodes were introduced in it's place? It might be tough to untangle since we're dealing with Point-type geometries.. but I'm curious.

Hmmm, good question. The hierarchy was generated by from go-whosonfirst-pip-v2, by pointing it at whosonfirst-data-admin-gb and using the response verbatim. I don't think I know enough about how the hierarchy should work - happy to take advice here.

Interesting. I usually point folks to this doc, which outlines how to setup a PIP service to update hierarchies. Was your process similar to this?

Apologies, I hadn't realised this. I generated >500k new IDs and was trying to reduce the load on the ID servers by balancing between the three. They are all newly minted values. I can rerun this again if it's an issue.

Gotcha.. I'll wait for a second pair of eyes before I suggest replacing all of these new integer values but we may need to mint new, unique IDs.

I think they were originally strings in the features that had this key already, so I've kept them as that. I might be wrong though.

If we know that these are all integer values, we should store them as integer values.. if not, string values are okay. Either way, we should also update the properties repo to make sure we classify each of these properties.

tomtaylor commented 4 years ago

Yeah, I'm just curious if there are cases where a "new" postalcode actually replaces an "old" postalcode. As in, are there cases where a postalcode was deprecated, but one (or more) postalcodes were introduced in it's place? It might be tough to untangle since we're dealing with Point-type geometries.. but I'm curious.

I don't think it's possible to know from the ONS data whether this is the case I'm afraid.

Interesting. I usually point folks to this doc, which outlines how to setup a PIP service to update hierarchies. Was your process similar to this?

Yeah, I ran: ./bin/wof-pip-server -mode directory ../whosonfirst-data-admin-gb, which returned a JSON blob of places, which I converted to a hierarchy here: https://github.com/tomtaylor/wof-sync-os-postcodes/blob/cbbd2cff941a62075f993d116f65248b8bae84a1/wofdata/wofdata.go#L500

I wouldn't expect postalcodes to be in the hierarchy (postalcodes aren't hierarchical in the UK), unless the hierarchy is meant to include the feature it represents?

If we know that these are all integer values, we should store them as integer values.. if not, string values are okay. Either way, we should also update the properties repo to make sure we classify each of these properties.

It does look like this will always be 1-6 or 8-9. (Reference page 35 of the User Guide PDF provided with the data). It's the status of the positional information.

Shows the status of the assigned grid reference.
1 = within the building of the matched address closest to the postcode mean;
2 = as for status value 1, except by visual inspection of Landline maps (Scotland only); 3 = approximate to within 50 metres;
4 = postcode unit mean (mean of matched addresses with the same postcode, but not snapped to a building);
5 = imputed by ONS, by reference to surrounding postcode grid references;
6 = postcode sector mean, (mainly PO Boxes);
8 = postcode terminated prior to Gridlink® initiative, last known ONS postcode grid reference1;
9 = no grid reference available

I can update the properties reference. Would prefer not to change to an int, but only because rebuilding all these files takes bloody ages! 😂

stepps00 commented 4 years ago

I wouldn't expect postalcodes to be in the hierarchy (postalcodes aren't hierarchical in the UK), unless the hierarchy is meant to include the feature it represents?

Yes, the wof:hierarchy is typically inclusive of the feature itself. The existing values that I've QA'd do look okay, but it would be inconsistent for us to import the wof:hierarchy properties without listing out the postalcode_ids.

I can update the properties reference. Would prefer not to change to an int, but only because rebuilding all these files takes bloody ages! 😂

Yeah, I definitely understand.. it wouldn't break anything per se, but it is ideal for us to keep these values as integers.

I've been able to validate the repo and build a SQLite distribution file with your changes, which is great. I'm going to suggest the following before importing:

If you're unable to update the hierarchy or os:* properties, no problem. I have the branch checked out and I can work on it, too. Let me know what you think.

thisisaaronland commented 4 years ago

Comments:

There is also some (relatively) complex logic around hierarchies here that I don't think has been ported to Go yet:

https://github.com/whosonfirst/py-mapzen-whosonfirst-hierarchy/blob/master/mapzen/whosonfirst/hierarchy/__init__.py

The distinction between "PIP" and "hierarchy" being that the former is Just Math (tm) and the latter tries to account for the vagueries and subtleties of place. I may be able to spend a bit of time next week with PIP/hierarchy stuff and if nothing draft a document outlining the various pieces, known-knowns and the various "to do" s.

stepps00 commented 4 years ago

Since the changes in this PR all validate and I was able to create a distribution file, I'm going to suggest merging this PR as-is. There may be follow-up work to update the hierarchies, but I don't think those are blocking this work from being imported.

Once merged, this changeset would also be a useful test case in using non-BI wof:id values.

I'll leave this open for any follow-up comments and to QA a few more records, but I'd like to merge this by early next week.

thisisaaronland commented 4 years ago

Okay, I will continue having a think about this and raise any concerns (if there any) by Monday.

tomtaylor commented 4 years ago

Thanks folks - let me know if you need me to do anything.

thisisaaronland commented 4 years ago

I guess my only concern is consistency with regards to odd numbers (Brooklyn Integers) but, I mean... really, it's not that important.

So, let's just it's okay and move on to other things.

stepps00 commented 4 years ago

Sounds good, I'll merge this now. We can always file follow-ups if the even-numbered wof:id values turn out to be an issue. Thanks @tomtaylor for submitting this PR!

And FWIW, some feature counts and times from the dist build:

12:13:35.766464 [wof-dist-build] STATUS time to index all (2642440) : 21h26m13.135946606s

tomtaylor commented 4 years ago

Hurrah! Thank you everyone!