Closed tomtaylor closed 4 years ago
Hey @tomtaylor - thanks for getting this PR in!
I'm in the process of QAing these changes now.. I'd also like to get them into WOF. A few notes:
master
and what is being proposed. Example:- "wof:id":438039223,
- "wof:lastmodified":1474571147,
- "wof:name":"B93 0AA",
- "wof:parent_id":404453641,
- "wof:placetype":"postalcode",
- "wof:repo":"whosonfirst-data-postalcode-gb",
- "wof:superseded_by":[],
- "wof:supersedes":[],
- "wof:tags":[]
-},
+ "wof:id": 438039223,
+ "wof:lastmodified": 1583786247,
+ "wof:name": "B93 0AA",
+ "wof:parent_id": 404453641,
+ "wof:placetype": "postalcode",
+ "wof:repo": "whosonfirst-data-postalcode-gb",
+ "wof:superseded_by": [],
+ "wof:supersedes": [],
+ "wof:tags": []
+ },
I assume this is coming from your usage of the go-whosonfirst-export
tool? Maybe @thisisaaronland can chime in about the spacing - is this expected when using the Go export tools? I haven't seen this when using the wof-exportify tool (or Python version of the export code).
{
"id": 1160712856,
"type": "Feature",
"properties": {
"edtf:cessation": "2001-01-01",
"edtf:inception": "1999-06-01",
"geom:area": 0,
"geom:bbox": "-1.467511,51.791435,-1.467511,51.791435",
"geom:latitude": 51.791435,
"geom:longitude": -1.467511,
"iso:country": "GB",
"mz:hierarchy_label": 1,
"mz:is_current": 0,
"os:country_code": "E92000001",
"os:county_code": "E10000025",
"os:district_code": "E07000181",
"os:positional_quality_indicator": "6",
"os:region_code": "E12000008",
"src:geom": "os",
"wof:belongsto": [
1360698839,
404450543,
101852913,
404227469,
1360698585,
85633159
],
"wof:breaches": [],
"wof:country": "GB",
"wof:created": 1583786921,
"wof:geomhash": "f4a74f77b275d57d7607b49c1cd10d83",
"wof:hierarchy": [
{
"country_id": 85633159,
"county_id": 1360698839,
"localadmin_id": 404450543,
"locality_id": 101852913,
"macroregion_id": 404227469,
"region_id": 1360698585
}
],
"wof:id": 1160712856,
"wof:lastmodified": 1583786921,
"wof:name": "OX8 6FT",
"wof:parent_id": -1,
"wof:placetype": "postalcode",
"wof:repo": "whosonfirst-data-postalcode-gb",
"wof:superseded_by": [],
"wof:supersedes": [],
"wof:tags": []
},
"bbox": [
-1.467511,
51.791435,
-1.467511,
51.791435
],
"geometry": {"coordinates":[-1.467511,51.791435],"type":"Point"}
}
Could you explain a bit about these features? They seem to be new, but also include a cessation date in 2001 and are flagged as "not current".
Thanks @stepps00 - I'll try and answer below.
- It looks like the spacing between the properties and property values is different between what is in
master
and what is being proposed. [...]I assume this is coming from your usage of the
go-whosonfirst-export
tool? Maybe @thisisaaronland can chime in about the spacing - is this expected when using the Go export tools? I haven't seen this when using the wof-exportify tool (or Python version of the export code).
That's right - that's what go-whosonfirst-format
does. I think the outcome of this discussion wasn't conclusive about having a binary consistent WOF format.
For this situation, I would personally prefer the Go tooling to be aligned with the outputs from other languages to reduce the diff noise, but that feels like it might be a minority view (or at least we hadn't concluded how to implement that).
FWIW, collapsing the whitespace between JSON key and value would be a one-line change in go-whosonfirst-format
: https://github.com/tomtaylor/go-whosonfirst-format/blob/ef23a835c34e0ef0cf8022483d946e38ad785df7/format.go#L93
- It looks like brand new features are being created in this changeset. [...]
Could you explain a bit about these features? They seem to be new, but also include a cessation date in 2001 and are flagged as "not current".
That's right. These are historical postcodes. The ONS Postcode Directory contains every postcode ever commissioned, including those that have since been deprecated. I felt it important to include these for the fullest data set possible. In my situation, it's reasonably likely a user may search for a postcode that has been removed, and I want to return a result.
Just to be super clear: there are also new features that are current. 39136626
is an example.
- Do you have examples of any records that have changed geometries? I'm unable to find one or a case where we're storing an alt-geometry. I just want to take a look at a few more examples of changes.
I don't think I've seen any significant geometry changes. My understanding is that UK postcodes get deprecated, not reassigned to a new location.
Let me know if I can help with anything else!
Hey @tomtaylor - sorry for the delay here.. I've had a bit of time to take another look, here are some notes:
I'm fine with the spacing as is, though I don't know the impact that will have on he distribution files and/or other tools like the Spelunker, for example. I'll default to what @thisisaaronland suggests here.
As far as the historical postcodes go, that all makes sense and it looks like the edtf:*
and mz:*
properties have been filled out correctly for these records. Should any of these historical postalcodes be superseded by another postalcode record?
A few other things I noticed about the changes:
postalcode_id
is no longer in the wof:hierarchy
for many (all?) records.
wof:hierarchy
for any particular reason?wof:id
values are even-numbered
wof:id
values from Brooklyn Integers, but other sources are okay to use, as long as the values are new.wof:id
values, so I'll default to @thisisaaronland on what to do here, too.os:positional_quality_indicator
values are strings, should they be int values?I've assigned a reviewer to this PR.. aside from the comments above, I think this PR looks good. Thanks for all the work!
Thanks @stepps00, comments below.
As far as the historical postcodes go, that all makes sense and it looks like the
edtf:*
andmz:*
properties have been filled out correctly for these records. Should any of these historical postalcodes be superseded by another postalcode record?
Just so I understand your question, do you mean have I deprecated a postcode that was then reintroduced in a different feature, and so should refer to the newer feature in a 'superseded by' key? I don't believe so. In some situations I've fixed the case of the postcode name from mixed case (Se14 1aa
to SE14 1AA
), but I don't believe anything has been deprecated and then replaced.
postalcode_id
is no longer in thewof:hierarchy
for many (all?) records.
- Was this key/value removed in the
wof:hierarchy
for any particular reason?
Hmmm, good question. The hierarchy was generated by from go-whosonfirst-pip-v2
, by pointing it at whosonfirst-data-admin-gb
and using the response verbatim. I don't think I know enough about how the hierarchy should work - happy to take advice here.
New records'
wof:id
values are even-numbered
- We typically derive our
wof:id
values from Brooklyn Integers, but other sources are okay to use, as long as the values are new.- Up until this point, we've only used odd-numbered
wof:id
values, so I'll default to @thisisaaronland on what to do here, too.
Apologies, I hadn't realised this. I generated >500k new IDs and was trying to reduce the load on the ID servers by balancing between the three. They are all newly minted values. I can rerun this again if it's an issue.
- Looks like
os:positional_quality_indicator
values are strings, should they be int values?
I think they were originally strings in the features that had this key already, so I've kept them as that. I might be wrong though.
I've assigned a reviewer to this PR.. aside from the comments above, I think this PR looks good. Thanks for all the work!
Thanks for your help!
Some more comments, in line:
Just so I understand your question, do you mean have I deprecated a postcode that was then reintroduced in a different feature, and so should refer to the newer feature in a 'superseded by' key? I don't believe so. In some situations I've fixed the case of the postcode name from mixed case (Se14 1aa to SE14 1AA), but I don't believe anything has been deprecated and then replaced.
Yeah, I'm just curious if there are cases where a "new" postalcode actually replaces an "old" postalcode. As in, are there cases where a postalcode was deprecated, but one (or more) postalcodes were introduced in it's place? It might be tough to untangle since we're dealing with Point-type geometries.. but I'm curious.
Hmmm, good question. The hierarchy was generated by from go-whosonfirst-pip-v2, by pointing it at whosonfirst-data-admin-gb and using the response verbatim. I don't think I know enough about how the hierarchy should work - happy to take advice here.
Interesting. I usually point folks to this doc, which outlines how to setup a PIP service to update hierarchies. Was your process similar to this?
Apologies, I hadn't realised this. I generated >500k new IDs and was trying to reduce the load on the ID servers by balancing between the three. They are all newly minted values. I can rerun this again if it's an issue.
Gotcha.. I'll wait for a second pair of eyes before I suggest replacing all of these new integer values but we may need to mint new, unique IDs.
I think they were originally strings in the features that had this key already, so I've kept them as that. I might be wrong though.
If we know that these are all integer values, we should store them as integer values.. if not, string values are okay. Either way, we should also update the properties repo to make sure we classify each of these properties.
Yeah, I'm just curious if there are cases where a "new" postalcode actually replaces an "old" postalcode. As in, are there cases where a postalcode was deprecated, but one (or more) postalcodes were introduced in it's place? It might be tough to untangle since we're dealing with Point-type geometries.. but I'm curious.
I don't think it's possible to know from the ONS data whether this is the case I'm afraid.
Interesting. I usually point folks to this doc, which outlines how to setup a PIP service to update hierarchies. Was your process similar to this?
Yeah, I ran: ./bin/wof-pip-server -mode directory ../whosonfirst-data-admin-gb
, which returned a JSON blob of places, which I converted to a hierarchy here: https://github.com/tomtaylor/wof-sync-os-postcodes/blob/cbbd2cff941a62075f993d116f65248b8bae84a1/wofdata/wofdata.go#L500
I wouldn't expect postalcodes to be in the hierarchy (postalcodes aren't hierarchical in the UK), unless the hierarchy is meant to include the feature it represents?
If we know that these are all integer values, we should store them as integer values.. if not, string values are okay. Either way, we should also update the properties repo to make sure we classify each of these properties.
It does look like this will always be 1-6 or 8-9. (Reference page 35 of the User Guide PDF provided with the data). It's the status of the positional information.
Shows the status of the assigned grid reference.
1 = within the building of the matched address closest to the postcode mean;
2 = as for status value 1, except by visual inspection of Landline maps (Scotland only); 3 = approximate to within 50 metres;
4 = postcode unit mean (mean of matched addresses with the same postcode, but not snapped to a building);
5 = imputed by ONS, by reference to surrounding postcode grid references;
6 = postcode sector mean, (mainly PO Boxes);
8 = postcode terminated prior to Gridlink® initiative, last known ONS postcode grid reference1;
9 = no grid reference available
I can update the properties reference. Would prefer not to change to an int, but only because rebuilding all these files takes bloody ages! 😂
I wouldn't expect postalcodes to be in the hierarchy (postalcodes aren't hierarchical in the UK), unless the hierarchy is meant to include the feature it represents?
Yes, the wof:hierarchy
is typically inclusive of the feature itself. The existing values that I've QA'd do look okay, but it would be inconsistent for us to import the wof:hierarchy
properties without listing out the postalcode_ids.
I can update the properties reference. Would prefer not to change to an int, but only because rebuilding all these files takes bloody ages! 😂
Yeah, I definitely understand.. it wouldn't break anything per se, but it is ideal for us to keep these values as integers.
I've been able to validate the repo and build a SQLite distribution file with your changes, which is great. I'm going to suggest the following before importing:
wof:hierarchy
properties, to include the postalcode_id
values of the records themselvesos:positional_quality_indicator
If you're unable to update the hierarchy or os:* properties, no problem. I have the branch checked out and I can work on it, too. Let me know what you think.
Comments:
White space is not relevant with regards to the Spelunker or any tooling. As @tomtaylor mentioned there is an ongoing discussion about formatting and canonical representations but it hasn't been finalized.
Brooklyn Integers versus the Mission, London Integers. Strictly speaking it doesn't (shouldn't) matter but the convention for "core" WOF records has been to use BI.
Hierarchy / PIP stuff - I think what's happening is that the PIP server only returns ancestors and the Python export code ensures that any given record's ID is included in the hierarchy. If the Go export code doesn't do this it should.
There is also some (relatively) complex logic around hierarchies here that I don't think has been ported to Go yet:
The distinction between "PIP" and "hierarchy" being that the former is Just Math (tm) and the latter tries to account for the vagueries and subtleties of place. I may be able to spend a bit of time next week with PIP/hierarchy stuff and if nothing draft a document outlining the various pieces, known-knowns and the various "to do" s.
Since the changes in this PR all validate and I was able to create a distribution file, I'm going to suggest merging this PR as-is. There may be follow-up work to update the hierarchies, but I don't think those are blocking this work from being imported.
Once merged, this changeset would also be a useful test case in using non-BI wof:id
values.
I'll leave this open for any follow-up comments and to QA a few more records, but I'd like to merge this by early next week.
Okay, I will continue having a think about this and raise any concerns (if there any) by Monday.
Thanks folks - let me know if you need me to do anything.
I guess my only concern is consistency with regards to odd numbers (Brooklyn Integers) but, I mean... really, it's not that important.
So, let's just it's okay and move on to other things.
Sounds good, I'll merge this now. We can always file follow-ups if the even-numbered wof:id
values turn out to be an issue. Thanks @tomtaylor for submitting this PR!
And FWIW, some feature counts and times from the dist build:
12:13:35.766464 [wof-dist-build] STATUS time to index all (2642440) : 21h26m13.135946606s
Hurrah! Thank you everyone!
This supersedes #3, but does the same thing.
It's now synced against the Feb 2020 release of the ONS Postcode Directory and uses an updated
go-whosonfirst-export
which includesgo-whosonfirst-format
, to format the GeoJSON files to WOF standards.I'd really like to get this merged, as the UK postcode data is badly out of date.
I appreciate it's a big diff, so please let me know what I can do to help.