whosonfirst / whosonfirst-cookbook

Instructions and How-To Documentation for Who's on First
15 stars 2 forks source link

Formalize an "exporting WOF documents" specification #35

Open thisisaaronland opened 4 years ago

thisisaaronland commented 4 years ago

Some background:

@Joxit noted that:

I minified a wof document and then I played to `Spot the difference` between go and python exporter and here are what I found : 

1. No precision reduction on coordinates
2. No use of utf-8 hexa codes
3. Extra space after the colon
4. 2 spaces instead of 4 for the indentation
5. Sometime an extra space at the end of the line

Example for 2, 3, 4

diff
-    "name:jpn_x_preferred":[
-        "\u30a2\u30f3\u30c6\u30a3\u30fc\u30d6"
+    "name:jpn_x_preferred": [
+      "アンティーブ"
    ],

Example for 1 and 5

diff
  "bbox": [
-    7.064374,
-    43.54155,
-    7.145075,
-    43.622545
+    7.06437411574864, 
+    43.54155041682094, 
+    7.14507537244949, 
+    43.62254464156442
]

To which I replied:

It's probably time to put together a reference document that

1. Makes explicit encoding decisions (for example I think we decided decimal precision would be capped at 13 points but could be less).
2. Language bindings can consult and test against.

And then @tomtaylor said:

@straup we should probably decide how we want to proceed here. I think the options are:

1. Treat the Python exporter as 'the spec', and align all the other implementations to that. Pros: the least noise in the commits. Cons: the spec is pretty weird in places and can be hard to implement in other languages (e.g. a 4 space, then 2 space indentation)

2. Come up with a new spec, and align all the implementations to that. Pros: a sensible spec could be easy to implement in all languages. Cons: potentially a lot of noise in the diffs.

3. Some kind of compromise, trying to make the spec a little more consistent, while also trying to minimise the noise in the diffs.

Also - where's the canonical discussion of this?

cc @nvkelso @stepps00

References:

Which are used respectively by:

For the purposes of this issue I think we are only discussing the first two packages. As in: How do blobs of GeoJSON get marshaled and not what the content of those blobs is.

thisisaaronland commented 4 years ago

Here are the rules such as they have been formalized in code and/or explicitly written down anywhere. 1-4 have been copied from the comments in py-mapzen-whosonfirst-geojson.

  1. Until further notice geometries may have up to but not exceeding (14) decimal points. This is probably serious overkill but it's also what Quattroshapes does so we're just going to leave it as is for now.

  2. Everything else is truncated to (6) decimal points

  3. Trailing zeros are removed from all coordinates. Mostly this is to account for Python deciding to use scientific notation on a whim which is super annoying. To that end we are enforcing some standards which raises the larger question of why we let anyone specify a precision at all. But that is tomorrow's problem...

  4. WOF (GeoJSON) properties are indented but geometries are not. Additionally geometries are placed at the end of a Feature record. This is an explicit design decision to make opening a WOF record in a vanilla text editor easier. The idea being that geometries aren't going to be edited by hand but properties might be. This decision means we can't use off-the-shelf JSON encoders and generally have to write our own marshaling code which is unfortunate.

  5. WOF (GeoJSON) properties (keys) are sorted and encoded alphabetically.

  6. WOF (GeoJSON) documents use 2-space indenting. More specifically: That is what we decided in the beginning and what the py-mapzen-whosonfirst-export encoder defaults to but we have never been strict about spacing. Maybe we should?

  7. WOF (GeoJSON) Features should have a top-level id property (that maps to properties.wof:id).

  8. WOF (GeoJSON) Features should have a top-level bbox array.

thisisaaronland commented 4 years ago

To @Joxit 's point about "No use of utf-8 hexa codes" in the go-whosonfirst-export and -format packages:

  1. I think this is as much about Python's weird relationship with Unicode prior to versions 3.x
  2. I don't know how Python 3.x would encode the same values. Does it use hexa-codes or... ?
  3. I like the idea of being able to open and edit a WOF document in a pre-Unicode text editor but maybe a) those don't exist anymore and b) even if they do it's not really helpful since the benefits of Unicode outweigh everything else and c) who can read and decipher hexa-codes in their head anyway?
tomtaylor commented 4 years ago

Thanks for bringing this together!

On point 6: it's worth mentioning that there's a lot of documents out there with mixed space indentation, varying between 2, 4 and 0 spaces, depending on the depth and key!

To try and summarise, I think we're trying to decide on:

Things that I'd consider fixed are:

Have I missed anything?

thisisaaronland commented 4 years ago

That looks about right, yeah. Can you talk a bit more about these and how/why they are important:

One thing that hasn't been mentioned so far are "remarks" files which were developed to address the need for comments (remarks, even) that weren't suited for a WOF document specifically:

https://github.com/whosonfirst/whosonfirst-cookbook/blob/master/definition/remarks_files.md

I mention them because I guess I've never imagined a situation where any given property would need to be wrapped at a given length (outside of having an explicit newline character which I think we assume(d) would never happen). Is there a particular use case you're thinking of?

As far as indenting offsets go, is there value/benefit in being strict about the length of those offsets? The principal aim of indenting was to make it easier to read a WOF document in a text editor or a browser window and not anything else.

There is something to be said for being able to compare the bytes of two documents in canonical form but maybe that's a secondary formatting that follows all the default rules but has no indenting? I don't know if that just makes things more confusing or not...

Joxit commented 4 years ago

3. I like the idea of being able to open and edit a WOF document in a pre-Unicode text editor but maybe a) those don't exist anymore and b) even if they do it's not really helpful since the benefits of Unicode outweigh everything else and c) who can read and decipher hexa-codes in their head anyway?

a) true, b) true, c) nobody except cyborgs :robot: :laughing:

I think your list is complete @tomtaylor

If a spec is created, the update should by lazy => only for new and updated documents. Too many changes will increase the size of the git index anyway, but it would be cool if it could be bounded (using squash ?).

My feelings:

tomtaylor commented 4 years ago

Can you talk a bit more about these and how/why they are important:

  • line lengths
  • (related to above) how arrays break over lines

Different formatters do different things with arrays, sometimes guided by an ideal maximum line length.

For example:

{
  "key": [1, 2, 3]
}

vs

{
  "key": [
    1,
    2,
    3
  ]
}
thisisaaronland commented 4 years ago

Apologies for the radio silence on this.

For indenting, I think the rule(s) should be:

Trailing whitespace SHOULD be avoided as a rule but as with indentation it is not worth making a big deal over. As with indentation, if this is not the representation used to compare bytes insignificant whitespace should be left to the discretion of users, I think.

I agree with @Joxit about encodings but can someone confirm that Python 3 does the "right" thing?

I also agree that 14 decimal points is a lot but there were reasons (not that I remember them... @nvkelso ?). They can be shorter (per the rule about trimming zeros from coordinates) but I think the compromise was that they just can't be longer than 14 decimal points.

@tomtaylor : Arrays have always been indented with one item per line since that's what Python's JSON encoder has done so I'd prefer to just stick with that.

Thoughts?

Joxit commented 4 years ago

I'm ok with your 4 rules !

I'm not a python guy, but I did two tests :

$ python2
>>> print("அஜாக்ஸியோ")
அஜாக்ஸியோ
>>> print("\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb")
\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb
>>> unicode(u"அஜாக்ஸியோ")
u'\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb'
$ python3
>>> print("\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb")
அஜாக்ஸியோ
>>> print("அஜாக்ஸியோ")
அஜாக்ஸியோ

I found the use of the package unicode for string, so I added this in my test.

So, if I'm understanding well, the default encoding was ASCII for python 2, that's why we need to declare unicode strings (with u"content"). The export uses the unicode function to get the hexa code and then work with ASCII. For python 3 the default encoding is UTF-8, so it will transform hexa code into UTF-8 characters. So that's true, hexa codes are artifacts from python2's past.

tomtaylor commented 4 years ago

Thoughts?

I think I've got a slightly different perspective to you, having just wrangled a ~2.5 million file repo of UK postcodes.

I wrote a Go tool to sync against the source data and manipulate the files. It rewrites every file, because it's difficult to compare the representation precisely in Go (as it stands).

Because there is no canonical formatting spec, I ended up creating a lot of unnecessary changes in the diffs because my tool formatted things slightly differently. Now it's difficult to see what I introduced/edited/deprecated, because there's a load of whitespace noise.

Without a tight canonical spec for how a WOF file should be formatted, this'll happen again and again, as multiple tools serving different purposes write and rewrite files. It'll produce a git history where it's difficult to work out what has actually changed and why.

That might be fine, and maybe we can live with that. Maybe hand editing is a big enough use case that it should take priority? But I feel like most of the WOF changes I see are being performed by tools and scripts.

Maybe there's a way of solving this in the tools themselves, by preserving some of the formatting as the files pass in/out, but that sounds quite tricky to me.

I think I'd prefer to decide a tight formatting spec, but I'd like to hear other arguments.

Joxit commented 4 years ago

So, what you would like is to have a spec, format all the WOF documents from all repositories and then work with this new base ?

tomtaylor commented 4 years ago

So, what you would like is to have a spec, format all the WOF documents from all repositories and then work with this new base ?

No need to touch everything at once. But by updating the formatting libraries in each implementation it'll tend towards that over time. I'm not pushing this hard - but I think it's worth thinking about.

Joxit commented 4 years ago

Okay, in a lazy way then. Thanks to this only the first contribution will be hard to compare.

I feel like your case is included in rules of @thisisaaronland. He wrote permissive rules in order to limit side effects on diffs. But you're right, this is useful only for hand editing.... :thinking:

missinglink commented 4 years ago

Heya looks like I'm a bit late to the party on this one :tada:

I was actually thinking over the holiday break about how to simplify reading/writing different WOF collection formats.

Talking about interoperability between different tools/languages, I think this quote is pertinent:

Interfaces are just contracts or signatures and they don't know anything about implementations.

As well as an excerpt from the 'Unix Philosophy', which is an oldie (1978) but a goodie:

  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".

  2. Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats.

So I wrote a reference implementation of how the 'Unix Philosophy' could work for WOF data, the code is only two days old but it's capable of reading/writing all the major formats including fs, sqlite and bundle via unix pipes:

# convert git repo to sqlite database

wof git export /data/whosonfirst-data-admin-nz | wof sqlite import nz.db

sqlite3 nz.db '.tables'
ancestors     concordances  geojson       names         spr
# convert a sqlite database to a tar.bz2 bundle

wof sqlite export nz.db | wof bundle import nz.tar.bz2

tar -tf nz.tar.bz2
... many geojson files
data/857/829/01/85782901.geojson
data/857/829/05/85782905.geojson
data/857/842/67/85784267.geojson
data/857/846/57/85784657.geojson
meta/locality.csv
meta/county.csv
meta/localadmin.csv
meta/neighbourhood.csv
meta/dependency.csv
meta/country.csv
meta/region.csv

The magic here is in the interface, each process just knows it's either reading or writing a stream of geojson features, it doesn't care how the bytes are encoded/marshalled, only that the contents are valid geojson.

The only time the marshalling is actually relevant is when using byte-for-byte comparison tools like diff (and git diff).

These traditional diff tools are not a good fit for this task, but can still be used when the json marshal algorithm is the same for both data being compared, which I think is what we're discussing here :smile:

The current marshalling format makes working with existing line-based unix tooling difficult:

wof git export /data/whosonfirst-data-admin-nz --path=data/857/846/57/85784657.geojson \
  | wc -l

105

But it's simple enough to reformat the stream so that each feature is minified and printed one-per-line by piping the stream to a formatter:

wof git export /data/whosonfirst-data-admin-nz --path=data/857/846/57/85784657.geojson \
  | wof feature format \
  | wc -l

1

The same idea can be applied to a 'canonical marshalling', as per the 'Unix Philosophy', we could simply have one program which accepts geojson via stdin and outputs the canonical format to stdout.

The nice thing about this is that tooling written in other languages doesn't need to worry about what the git format is exactly, they can simply pipe their own output to this one program to be guaranteed bug-for-bug compatibility.

I could be wrong about this but I suspect that if we all try to implement a common marshalling format in various languages it's going to be time-consuming, error-prone and will lose out on the performance benefits of using the native serialisation provided by each environment.

I haven't looked into it deeply but I'm assuming the string & number encoding issues will no longer be an issue so long as each implementation agrees to being lossless, by avoiding truncating float precision or otherwise mutating the underlying data?

thisisaaronland commented 4 years ago

I think there are two separate, but equally valid, use cases here:

  1. The need to do some sort of byte-level comparison to see if a document has changed. Am I correct in thinking this is one of the issues you're trying to tackle @tomtaylor ?

  2. The need to encode a document in a consistent manner that allows it to be easily read or written to without the need for specialized tools. This has always been a central goal of WOF with the idea being that the failure scenario is that a WOF document can and should be editable in any old plain-text editor. Not the ideal scenario per se, but still workable.

That's why I suggested earlier that perhaps we settle on two different marshaling formats, one for publishing and one for comparisons. Rules for the latter might be as simple as:

The former would be the list as we've discussed so far with recommendations about indenting and white space but deviations from those suggestions would not trigger errors.

Would that work for you @tomtaylor ?

@missinglink There is on-going work to something along the lines of what you're suggesting but most of data "source" abstraction happens at the reader (and writer) level:

In a WOF context the idea is that bytes (specifically io.ReadCloser instances) are formatted:

And then exported:

To generic "writer" interfaces:

(This one lacking documentation but it is just like go-reader but for... writing things.)

All of which can then be encapsulated in WOF specific code and rules:

Related is the standard WOF GeoJSON package (which is probably due for an overhaul as plain old whosonfirst/go-geojson to remove v1 stuff and so I can stop typing v2 everywhere):

And the validation package which

There are some outstanding inconsistencies in many of these interfaces but that's what I am trying to figure out.

missinglink commented 4 years ago

"That's why I suggested earlier that perhaps we settle on two different marshaling formats, one for publishing and one for comparisons"

I was actually hoping for 0 marshalling formats but it seems 1 will be required 😝 Surely we can use the same format for both publishing & comparisons?

Insofar as I see it, there are three distinct problems:

  1. Visual display for 👨 should be easy to read (and hand edit if need be), and UI friendly.
  2. Byte-for-byte comparisons by 🤖 should yield true for the same data encoded/marshalled differently.
  3. Hashing of geojson (ie. md5 hashes in meta files for both features and geometry) should hash the same.

Regarding the last 2 points, I think the only viable option is a program (preferably a single binary) which encodes/marshals geojson in a predicable way.

For :robot: it doesn't really matter which format is chosen as long as it is deterministic.

...and I think getting different languages to marshall in a deterministic way is going to be difficult or near impossible!

# some examples:

# a python2 serializer using 'indent=2, sort_keys=True'
function wof_write_python(){ python2 -c 'import sys, json; print json.dumps(json.load(sys.stdin), separators=(",", ":"), indent=2, sort_keys=True);' }

# a nodejs serializer using 'indent=2' (sorted keys requires user code)
function wof_write_javascript(){ node -e 'fs=require("fs"); console.log(JSON.stringify(JSON.parse(fs.readFileSync("/dev/stdin")), null, 2))' }

# a jq serializer using '--indent 2' and '-S' to sort keys
function wof_write_jq(){ jq -rMS --indent 2 }

I exported a random record from git and serialized it using these various methods and none of them produced the same result:

cat 85784657.git | wof_write_python > 85784657.python
cat 85784657.git | wof_write_javascript > 85784657.javascript
cat 85784657.git | wof_write_jq > 85784657.jq
-rw-r--r--   1 peter  staff      2636 Jan  8 10:21 85784657.git
-rw-r--r--   1 peter  staff      2685 Jan  8 10:55 85784657.javascript
-rw-r--r--   1 peter  staff      2685 Jan  8 11:03 85784657.jq
-rw-r--r--   1 peter  staff      2612 Jan  8 10:55 85784657.python
md5sum 85784657.*
b83e91ab7f05c326e8f2a1cea5f73df8  85784657.git
f74e3caee5c3f52e0caedacc41c74967  85784657.javascript
3ddf393285f856078c1bf61697c2bd0e  85784657.jq
a459cb675e3991806009d6cbf4183a8f  85784657.python

This leads me to believe that we should just have a single program which is responsible for this task.

The program will need to be deterministic and ideally for portability reasons it would be a compiled binary available for multiple architectures (here's looking at you Go 😉).

cat 85784657.git | magic_serializer | md5sum
3ddf393285f856078c1bf61697c2bd0e  -

cat 85784657.python | magic_serializer | md5sum
3ddf393285f856078c1bf61697c2bd0e  -

So. what format does magic_serializer produce? This brings be back to point 1, visual display for :man:.

What is quite nice about the current format (formats?!) I see in git is that the geometry object is compact yet the properties are expanded, this makes visual inspection by :man:easy, it also means you can do things like this on Github:

Screenshot 2020-01-08 at 11 28 30

Another bonus is that the geometry field is printed last, since it's very large and difficult to visually inspect anyway.

So I actually like the format as-is and would be :+1: for keeping that the same, albeit resolving the string and number encoding issues of python2.

One final thought is that it would be nice if magic_serializer didn't have any actual awareness of the WOF schema itself. It might need to know about GeoJSON in order to sort/display fields accordingly, but it shouldn't ever mutate the input data. This will ensure that it remains forwards/backwards compatible as new fields are added and removed.

missinglink commented 4 years ago

If anyone fancies playing around with node wof CLI you can install it with:

npm i -g @whosonfirst/wof
wof --help
wof <cmd> [args]

Commands:
  wof bundle   interact with tar bundles
  wof feature  misc functions for working with feature streams
  wof fs       interact with files in the filesystem
  wof git      interact with git repositories
  wof sqlite   interact with SQLite databases

Options:
  --version      Show version number                                   [boolean]
  --verbose, -v  enable verbose logging               [boolean] [default: false]
  --help         Show help                                             [boolean]
Joxit commented 4 years ago

Hey @missinglink, great minds think alike, I was also working on a wof cli during my holidays (in rust) : https://github.com/joxit/wof-cli

$ wof --help
wof 0.1.0
Jones Magloire @Joxit
The Who's On First command line aggregator.

USAGE:
    wof <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    completion    Generate autocompletion file for your shell
    export        Export tools for the Who's On First documents
    fetch         Fetch WOF data from github
    help          Prints this message or the help of the given subcommand(s)
    install       Install what you need to use this CLI (needs python2 and go)
    print         Print to stdout WOF document by id
    shapefile     Who's On First documents to ESRI shapefiles
    sqlite        Who's On First documents to SQLite database
thisisaaronland commented 4 years ago

@missinglink I think we are saying the same thing?

So-called magic_serializer is simply the canonical encoding for byte-level comparison. As such it consumes any other GeoJSON/WOF serialization and produces a secondary JSON document that follows strict formatting rules.

Those rules should be simple enough that they can be implemented in any language. The benefit of Go, for example, is that it can be cross-compiled for people who don't know, or care to know, the boring details of computers. For those who do it's easiest to assume that they have legitimate reasons for using [ some other language ] and shouldn't be forced to use [ some specific language ] just to compare WOF representations.

What I am proposing introduces non-zero CPU/brain cycles when comparing WOF records. Specifically, a program must first read and parse a source record, for example the human-readable "published" document and then marshal it in to second byte-level representation. The reasons this seems like a reasonable trade-off are:

Thoughts?

thisisaaronland commented 4 years ago

@Joxit @missinglink On the subject of your respective CLI tools: Blog posts about their "theory, practice and gotchas" would be welcomed and encouraged:

missinglink commented 4 years ago

@Joxit that's awesome :smile_cat:

It would be amazing if we all wrote small programs which read and write geojson streams, then we could just pipe them all together to achieve complex workflows :tada:

nvkelso commented 4 years ago

👋 lots of great discussion here :)

The above items Aaron lists are workable for me. Stephen and I rely on the git diff tooling on command line in web interface extensively so that’s a primary concern for me, even as we develop better edit tooling we still will be relying on diff for review.

I’m willing to relax the 14 decimal precision if it makes everyone’s life easier. A few years ago I wanted the option to preserve the original geometries from providers as they provided them in all their crazy precision... but we end up needing to modify them for topology reasons and other consistency reasons and it just ballots file size otherwise.

nvkelso commented 4 years ago

(now that my newborn is sleeping)...

I'm fine with pretty Unicode versus the escaped – often it's challenging for Stephen and I to manually decode them during the diff reviews which we could build more GUI around... but if this is for humans then I have mild preference towards switching them in the src.

We've done several large every file, every repo change sets in last 12 months – my preference is we settle on something and then submit PRs to update all files in all repos so we're not half-half old new.

And I have a mild preference for following convention on Github with trailing line so it's easier to make quick changes – but then they have to be exportified anyhow so really whatever works best for our tooling.