whosonfirst / whosonfirst-properties

What things mean in Who's On First documents
Other
5 stars 5 forks source link

Newly added properties are missing from this repo #103

Open stepps00 opened 3 years ago

stepps00 commented 3 years ago

Some newly added properties on admin records are not being catalogued in this repo, like wof:postal_locality (and related properties). Opening this to crawl all admin repos, then add any missing properties to this repo as new .json files.

missinglink commented 3 years ago

The JSON1 SQLite extension (which is included in most builds of sqlite3) can be really helpful with this task.

keystats.sql


.separator "\t"

SELECT key, COUNT(*) AS cnt FROM geojson, json_each(json_extract(geojson.body, '$.properties')) GROUP BY key ORDER BY key;


```bash
sqlite3 whosonfirst-data-admin-latest.db < keystats.sql > keystats.tsv

The output TSV file contains two columns, the key name and the total amount of occurrences.

grep '^wof:' keystats.tsv

wof:belongs 13
wof:belongs_to  327
wof:belongsto   4984780
wof:breaches    4984780
wof:capital 385
wof:capital_of  392
wof:category    2
wof:children    7
wof:concordance 1859
wof:concordances    4969245
wof:concordances:hasc:id    35
wof:concordances_alt    11083
wof:controlled  3157
wof:coterminous 126459
wof:country 4984780
wof:country_alpha3  222
wof:created 366932
wof:fullname    376
wof:geom_alt    286041
wof:geomhash    5388745
wof:hierarchy   4984781
wof:id  5388745
wof:label   16528
wof:lang    126683
wof:lang_x_official 65150
wof:lang_x_spoken   65249
wof:lastmodified    4984780
wof:megacity    1347
wof:name    4984780
wof:parent_id   4984780
wof:placetype   4990860
wof:placetype_alt   1266
wof:placetype_local 93479
wof:populaiton_rank 1
wof:population  542786
wof:population_date 2
wof:population_rank 541539
wof:repo    5388745
wof:scale   1129
wof:shortcode   28994
wof:statistical_gore    2916
wof:subdivision 3389
wof:superseded  2154
wof:superseded_by   4984780
wof:supersedes  4984780
wof:tags    4984780

Here's the full list of keys: https://gist.github.com/missinglink/3d3e61e77d9dd7d088c1f3e4395e6c31

missinglink commented 3 years ago

I noticed a few errors when generating that script BTW, here's a list of keys with less than 10 occurrences. note: I excluded zs:* because its very noisy and name:* which was mostly the odd historic name here-and-there:

xsv cat rows -n -d '\t' keystats.tsv \
  | xsv search -n '^[0-9]$' \
  | grep -v '^zs:' \
  | grep -v '^name:'

A few of them are just typos, such as edtf:deprecate, goem:longitude, lbl:latitiude and wof:populaiton_rank, others may just be legitimately rare.

addr:website,2
bowie:latitude,1
bowie:longitude,1
can-abog:CITY_ID,1
can-abog:GEOCODE,2
can-abog:GEONAME,2
can-abog:HAMLET_ID,1
can-abog:PID,2
canvec-hydro:definit,9
canvec-hydro:definit_en,9
edtf:deprecate,1
figov:aluejako,6
fsgov:Aluejako,6
fsgov:Kunta,5
fsgov:Tunnus,5
fsgov:ajo_pvm,7
fsgov:aluejako,2
fsgov:nimi,5
fsgov:nimi_se,5
geom:src,1
gn:adm0_cc,2
gn:local,2
gn:nam_loc,2
gn:namadm1,2
goem:longitude,1
label:asm_x_preferred_placetype,7
label:bbox,1
label:deu_x_preferred_disambiguation,9
label:eng_x_abbreviation,1
label:eng_x_preferred,4
label:eng_x_variant,1
label:eng_x_variant_abbreviation,1
label:eng_x_variant_placetype,2
label:fre_x_preferred_disambiguation,3
label:rus_x_variant_placetype,2
label:spa_x_preferred_disambiguation,1
label:spa_x_variant_longname,1
lbl:latitiude,1
meso:c_desc,8
meso:c_name,8
meso:county_id,8
meso:objectid,8
misc:ara_x_preferred,1
mz:hours,4
mz:remarks,1
ne:importname,1
ne:map_color,8
ne:name_forma,8
ne:ne_10m_adm,8
ne:oid_,8
ne:terr_,8
porbps:SHARED,1
qs:iso_cc,1
sfomuseum:airport_id,2
sfomuseum:is_sfo,2
sfomuseum:name,2
sfomuseum:placetype,2
sg:address,1
sg:city,1
sg:classifiers,1
sg:owner,1
sg:phone,1
sg:postcode,1
sg:province,1
sg:tags,1
sg:website,1
src:bowie_centroid,1
src:population_year,4
src:reversegeo:centroid,1
src:reversegeo_centroid,1
src:tourist_centroid,1
statoids:areakm,4
statoids:date,4
statoids:tz,8
tourist:latitude,1
tourist:longitude,1
wd:population_date,1
wk:elevation,1
wk:lat,3
wk:latitude,6
wk:long,3
wk:longitude,6
wk:population,1
woe:adm0,2
woe:adm1,2
woe:adm2,2
woe:lau,2
woe:local,2
woe:ver,2
wof:category,2
wof:children,7
wof:populaiton_rank,1
wof:population_date,2
stepps00 commented 3 years ago

Here's the full list of keys: https://gist.github.com/missinglink/3d3e61e77d9dd7d088c1f3e4395e6c31

Awesome, thanks for this!

I also noticed some odd looking properties.. so I think I'll use this issue to track importing the json files to the properties repo, then open up a issue in the whosonfirst-data repo to crawl/correct any odd or incorrect properties.