Open thisisaaronland opened 4 years ago
Here are the rules such as they have been formalized in code and/or explicitly written down anywhere. 1-4 have been copied from the comments in py-mapzen-whosonfirst-geojson.
Until further notice geometries may have up to but not exceeding (14) decimal points. This is probably serious overkill but it's also what Quattroshapes does so we're just going to leave it as is for now.
Everything else is truncated to (6) decimal points
Trailing zeros are removed from all coordinates. Mostly this is to account for Python deciding to use scientific notation on a whim which is super annoying. To that end we are enforcing some standards which raises the larger question of why we let anyone specify a precision at all. But that is tomorrow's problem...
WOF (GeoJSON) properties are indented but geometries are not. Additionally geometries are placed at the end of a Feature
record. This is an explicit design decision to make opening a WOF record in a vanilla text editor easier. The idea being that geometries aren't going to be edited by hand but properties might be. This decision means we can't use off-the-shelf JSON encoders and generally have to write our own marshaling code which is unfortunate.
WOF (GeoJSON) properties (keys) are sorted and encoded alphabetically.
WOF (GeoJSON) documents use 2-space indenting. More specifically: That is what we decided in the beginning and what the py-mapzen-whosonfirst-export
encoder defaults to but we have never been strict about spacing. Maybe we should?
WOF (GeoJSON) Features
should have a top-level id
property (that maps to properties.wof:id
).
WOF (GeoJSON) Features
should have a top-level bbox
array.
To @Joxit 's point about "No use of utf-8 hexa codes" in the go-whosonfirst-export
and -format
packages:
Thanks for bringing this together!
On point 6: it's worth mentioning that there's a lot of documents out there with mixed space indentation, varying between 2, 4 and 0 spaces, depending on the depth and key!
To try and summarise, I think we're trying to decide on:
Things that I'd consider fixed are:
Have I missed anything?
That looks about right, yeah. Can you talk a bit more about these and how/why they are important:
One thing that hasn't been mentioned so far are "remarks" files which were developed to address the need for comments (remarks, even) that weren't suited for a WOF document specifically:
https://github.com/whosonfirst/whosonfirst-cookbook/blob/master/definition/remarks_files.md
I mention them because I guess I've never imagined a situation where any given property would need to be wrapped at a given length (outside of having an explicit newline character which I think we assume(d) would never happen). Is there a particular use case you're thinking of?
As far as indenting offsets go, is there value/benefit in being strict about the length of those offsets? The principal aim of indenting was to make it easier to read a WOF document in a text editor or a browser window and not anything else.
There is something to be said for being able to compare the bytes of two documents in canonical form but maybe that's a secondary formatting that follows all the default rules but has no indenting? I don't know if that just makes things more confusing or not...
3. I like the idea of being able to open and edit a WOF document in a pre-Unicode text editor but maybe a) those don't exist anymore and b) even if they do it's not really helpful since the benefits of Unicode outweigh everything else and c) who can read and decipher hexa-codes in their head anyway?
a) true, b) true, c) nobody except cyborgs :robot: :laughing:
I think your list is complete @tomtaylor
If a spec is created, the update should by lazy => only for new and updated documents. Too many changes will increase the size of the git index anyway, but it would be cool if it could be bounded (using squash ?).
My feelings:
Can you talk a bit more about these and how/why they are important:
- line lengths
- (related to above) how arrays break over lines
Different formatters do different things with arrays, sometimes guided by an ideal maximum line length.
For example:
{
"key": [1, 2, 3]
}
vs
{
"key": [
1,
2,
3
]
}
Apologies for the radio silence on this.
For indenting, I think the rule(s) should be:
Trailing whitespace SHOULD be avoided as a rule but as with indentation it is not worth making a big deal over. As with indentation, if this is not the representation used to compare bytes insignificant whitespace should be left to the discretion of users, I think.
I agree with @Joxit about encodings but can someone confirm that Python 3 does the "right" thing?
I also agree that 14 decimal points is a lot but there were reasons (not that I remember them... @nvkelso ?). They can be shorter (per the rule about trimming zeros from coordinates) but I think the compromise was that they just can't be longer than 14 decimal points.
@tomtaylor : Arrays have always been indented with one item per line since that's what Python's JSON encoder has done so I'd prefer to just stick with that.
Thoughts?
I'm ok with your 4 rules !
I'm not a python guy, but I did two tests :
$ python2
>>> print("அஜாக்ஸியோ")
அஜாக்ஸியோ
>>> print("\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb")
\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb
>>> unicode(u"அஜாக்ஸியோ")
u'\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb'
$ python3
>>> print("\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb")
அஜாக்ஸியோ
>>> print("அஜாக்ஸியோ")
அஜாக்ஸியோ
I found the use of the package unicode
for string, so I added this in my test.
So, if I'm understanding well, the default encoding was ASCII for python 2, that's why we need to declare unicode strings (with u"content"
). The export uses the unicode function to get the hexa code and then work with ASCII.
For python 3 the default encoding is UTF-8, so it will transform hexa code into UTF-8 characters.
So that's true, hexa codes are artifacts from python2's past.
Thoughts?
I think I've got a slightly different perspective to you, having just wrangled a ~2.5 million file repo of UK postcodes.
I wrote a Go tool to sync against the source data and manipulate the files. It rewrites every file, because it's difficult to compare the representation precisely in Go (as it stands).
Because there is no canonical formatting spec, I ended up creating a lot of unnecessary changes in the diffs because my tool formatted things slightly differently. Now it's difficult to see what I introduced/edited/deprecated, because there's a load of whitespace noise.
Without a tight canonical spec for how a WOF file should be formatted, this'll happen again and again, as multiple tools serving different purposes write and rewrite files. It'll produce a git history where it's difficult to work out what has actually changed and why.
That might be fine, and maybe we can live with that. Maybe hand editing is a big enough use case that it should take priority? But I feel like most of the WOF changes I see are being performed by tools and scripts.
Maybe there's a way of solving this in the tools themselves, by preserving some of the formatting as the files pass in/out, but that sounds quite tricky to me.
I think I'd prefer to decide a tight formatting spec, but I'd like to hear other arguments.
So, what you would like is to have a spec, format all the WOF documents from all repositories and then work with this new base ?
So, what you would like is to have a spec, format all the WOF documents from all repositories and then work with this new base ?
No need to touch everything at once. But by updating the formatting libraries in each implementation it'll tend towards that over time. I'm not pushing this hard - but I think it's worth thinking about.
Okay, in a lazy way then. Thanks to this only the first contribution will be hard to compare.
I feel like your case is included in rules of @thisisaaronland. He wrote permissive rules in order to limit side effects on diffs. But you're right, this is useful only for hand editing.... :thinking:
Heya looks like I'm a bit late to the party on this one :tada:
I was actually thinking over the holiday break about how to simplify reading/writing different WOF collection formats.
Talking about interoperability between different tools/languages, I think this quote is pertinent:
Interfaces are just contracts or signatures and they don't know anything about implementations.
As well as an excerpt from the 'Unix Philosophy', which is an oldie (1978) but a goodie:
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats.
So I wrote a reference implementation of how the 'Unix Philosophy' could work for WOF data, the code is only two days old but it's capable of reading/writing all the major formats including fs
, sqlite
and bundle
via unix pipes:
# convert git repo to sqlite database
wof git export /data/whosonfirst-data-admin-nz | wof sqlite import nz.db
sqlite3 nz.db '.tables'
ancestors concordances geojson names spr
# convert a sqlite database to a tar.bz2 bundle
wof sqlite export nz.db | wof bundle import nz.tar.bz2
tar -tf nz.tar.bz2
... many geojson files
data/857/829/01/85782901.geojson
data/857/829/05/85782905.geojson
data/857/842/67/85784267.geojson
data/857/846/57/85784657.geojson
meta/locality.csv
meta/county.csv
meta/localadmin.csv
meta/neighbourhood.csv
meta/dependency.csv
meta/country.csv
meta/region.csv
The magic here is in the interface, each process just knows it's either reading or writing a stream of geojson features, it doesn't care how the bytes are encoded/marshalled, only that the contents are valid geojson.
The only time the marshalling is actually relevant is when using byte-for-byte comparison tools like diff
(and git diff
).
These traditional diff tools are not a good fit for this task, but can still be used when the json marshal algorithm is the same for both data being compared, which I think is what we're discussing here :smile:
The current marshalling format makes working with existing line-based unix tooling difficult:
wof git export /data/whosonfirst-data-admin-nz --path=data/857/846/57/85784657.geojson \
| wc -l
105
But it's simple enough to reformat the stream so that each feature is minified and printed one-per-line by piping the stream to a formatter:
wof git export /data/whosonfirst-data-admin-nz --path=data/857/846/57/85784657.geojson \
| wof feature format \
| wc -l
1
The same idea can be applied to a 'canonical marshalling', as per the 'Unix Philosophy', we could simply have one program which accepts geojson via stdin
and outputs the canonical format to stdout
.
The nice thing about this is that tooling written in other languages doesn't need to worry about what the git format is exactly, they can simply pipe their own output to this one program to be guaranteed bug-for-bug compatibility.
I could be wrong about this but I suspect that if we all try to implement a common marshalling format in various languages it's going to be time-consuming, error-prone and will lose out on the performance benefits of using the native serialisation provided by each environment.
I haven't looked into it deeply but I'm assuming the string & number encoding issues will no longer be an issue so long as each implementation agrees to being lossless, by avoiding truncating float precision or otherwise mutating the underlying data?
I think there are two separate, but equally valid, use cases here:
The need to do some sort of byte-level comparison to see if a document has changed. Am I correct in thinking this is one of the issues you're trying to tackle @tomtaylor ?
The need to encode a document in a consistent manner that allows it to be easily read or written to without the need for specialized tools. This has always been a central goal of WOF with the idea being that the failure scenario is that a WOF document can and should be editable in any old plain-text editor. Not the ideal scenario per se, but still workable.
That's why I suggested earlier that perhaps we settle on two different marshaling formats, one for publishing and one for comparisons. Rules for the latter might be as simple as:
properties
, geometry
, etc.) are also sorted alphabetically.id
and bbox
properties are required.The former would be the list as we've discussed so far with recommendations about indenting and white space but deviations from those suggestions would not trigger errors.
Would that work for you @tomtaylor ?
@missinglink There is on-going work to something along the lines of what you're suggesting but most of data "source" abstraction happens at the reader (and writer) level:
In a WOF context the idea is that bytes (specifically io.ReadCloser
instances) are formatted:
And then exported:
To generic "writer" interfaces:
(This one lacking documentation but it is just like go-reader
but for... writing things.)
All of which can then be encapsulated in WOF specific code and rules:
Related is the standard WOF GeoJSON package (which is probably due for an overhaul as plain old whosonfirst/go-geojson
to remove v1
stuff and so I can stop typing v2
everywhere):
And the validation package which
There are some outstanding inconsistencies in many of these interfaces but that's what I am trying to figure out.
"That's why I suggested earlier that perhaps we settle on two different marshaling formats, one for publishing and one for comparisons"
I was actually hoping for 0 marshalling formats but it seems 1 will be required 😝 Surely we can use the same format for both publishing & comparisons?
Insofar as I see it, there are three distinct problems:
features
and geometry
) should hash the same.Regarding the last 2 points, I think the only viable option is a program (preferably a single binary) which encodes/marshals geojson in a predicable way.
For :robot: it doesn't really matter which format is chosen as long as it is deterministic.
...and I think getting different languages to marshall in a deterministic way is going to be difficult or near impossible!
# some examples:
# a python2 serializer using 'indent=2, sort_keys=True'
function wof_write_python(){ python2 -c 'import sys, json; print json.dumps(json.load(sys.stdin), separators=(",", ":"), indent=2, sort_keys=True);' }
# a nodejs serializer using 'indent=2' (sorted keys requires user code)
function wof_write_javascript(){ node -e 'fs=require("fs"); console.log(JSON.stringify(JSON.parse(fs.readFileSync("/dev/stdin")), null, 2))' }
# a jq serializer using '--indent 2' and '-S' to sort keys
function wof_write_jq(){ jq -rMS --indent 2 }
I exported a random record from git and serialized it using these various methods and none of them produced the same result:
cat 85784657.git | wof_write_python > 85784657.python
cat 85784657.git | wof_write_javascript > 85784657.javascript
cat 85784657.git | wof_write_jq > 85784657.jq
-rw-r--r-- 1 peter staff 2636 Jan 8 10:21 85784657.git
-rw-r--r-- 1 peter staff 2685 Jan 8 10:55 85784657.javascript
-rw-r--r-- 1 peter staff 2685 Jan 8 11:03 85784657.jq
-rw-r--r-- 1 peter staff 2612 Jan 8 10:55 85784657.python
md5sum 85784657.*
b83e91ab7f05c326e8f2a1cea5f73df8 85784657.git
f74e3caee5c3f52e0caedacc41c74967 85784657.javascript
3ddf393285f856078c1bf61697c2bd0e 85784657.jq
a459cb675e3991806009d6cbf4183a8f 85784657.python
This leads me to believe that we should just have a single program which is responsible for this task.
The program will need to be deterministic and ideally for portability reasons it would be a compiled binary available for multiple architectures (here's looking at you Go 😉).
cat 85784657.git | magic_serializer | md5sum
3ddf393285f856078c1bf61697c2bd0e -
cat 85784657.python | magic_serializer | md5sum
3ddf393285f856078c1bf61697c2bd0e -
So. what format does magic_serializer
produce?
This brings be back to point 1, visual display for :man:.
What is quite nice about the current format (formats?!) I see in git is that the geometry
object is compact yet the properties
are expanded, this makes visual inspection by :man:easy, it also means you can do things like this on Github:
Another bonus is that the geometry
field is printed last, since it's very large and difficult to visually inspect anyway.
So I actually like the format as-is and would be :+1: for keeping that the same, albeit resolving the string and number encoding issues of python2.
One final thought is that it would be nice if magic_serializer
didn't have any actual awareness of the WOF schema itself.
It might need to know about GeoJSON in order to sort/display fields accordingly, but it shouldn't ever mutate the input data.
This will ensure that it remains forwards/backwards compatible as new fields are added and removed.
If anyone fancies playing around with node wof CLI you can install it with:
npm i -g @whosonfirst/wof
wof --help
wof <cmd> [args]
Commands:
wof bundle interact with tar bundles
wof feature misc functions for working with feature streams
wof fs interact with files in the filesystem
wof git interact with git repositories
wof sqlite interact with SQLite databases
Options:
--version Show version number [boolean]
--verbose, -v enable verbose logging [boolean] [default: false]
--help Show help [boolean]
Hey @missinglink, great minds think alike, I was also working on a wof cli during my holidays (in rust) : https://github.com/joxit/wof-cli
$ wof --help
wof 0.1.0
Jones Magloire @Joxit
The Who's On First command line aggregator.
USAGE:
wof <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
completion Generate autocompletion file for your shell
export Export tools for the Who's On First documents
fetch Fetch WOF data from github
help Prints this message or the help of the given subcommand(s)
install Install what you need to use this CLI (needs python2 and go)
print Print to stdout WOF document by id
shapefile Who's On First documents to ESRI shapefiles
sqlite Who's On First documents to SQLite database
@missinglink I think we are saying the same thing?
So-called magic_serializer
is simply the canonical encoding for byte-level comparison. As such it consumes any other GeoJSON/WOF serialization and produces a secondary JSON document that follows strict formatting rules.
Those rules should be simple enough that they can be implemented in any language. The benefit of Go, for example, is that it can be cross-compiled for people who don't know, or care to know, the boring details of computers. For those who do it's easiest to assume that they have legitimate reasons for using [ some other language ] and shouldn't be forced to use [ some specific language ] just to compare WOF representations.
What I am proposing introduces non-zero CPU/brain cycles when comparing WOF records. Specifically, a program must first read and parse a source record, for example the human-readable "published" document and then marshal it in to second byte-level representation. The reasons this seems like a reasonable trade-off are:
diff
-ed WOF records.Thoughts?
@Joxit @missinglink On the subject of your respective CLI tools: Blog posts about their "theory, practice and gotchas" would be welcomed and encouraged:
@Joxit that's awesome :smile_cat:
It would be amazing if we all wrote small programs which read and write geojson streams, then we could just pipe them all together to achieve complex workflows :tada:
👋 lots of great discussion here :)
The above items Aaron lists are workable for me. Stephen and I rely on the git diff tooling on command line in web interface extensively so that’s a primary concern for me, even as we develop better edit tooling we still will be relying on diff for review.
I’m willing to relax the 14 decimal precision if it makes everyone’s life easier. A few years ago I wanted the option to preserve the original geometries from providers as they provided them in all their crazy precision... but we end up needing to modify them for topology reasons and other consistency reasons and it just ballots file size otherwise.
(now that my newborn is sleeping)...
I'm fine with pretty Unicode versus the escaped – often it's challenging for Stephen and I to manually decode them during the diff reviews which we could build more GUI around... but if this is for humans then I have mild preference towards switching them in the src.
We've done several large every file, every repo change sets in last 12 months – my preference is we settle on something and then submit PRs to update all files in all repos so we're not half-half old new.
And I have a mild preference for following convention on Github with trailing line so it's easier to make quick changes – but then they have to be exportified anyhow so really whatever works best for our tooling.
Some background:
@Joxit noted that:
To which I replied:
And then @tomtaylor said:
cc @nvkelso @stepps00
References:
Which are used respectively by:
For the purposes of this issue I think we are only discussing the first two packages. As in: How do blobs of GeoJSON get marshaled and not what the content of those blobs is.