whosonfirst-data / whosonfirst-data-postalcode

Postal codes for Who's On First
Other
5 stars 3 forks source link

Import GeoPlanet postal codes #2

Closed thisisaaronland closed 8 years ago

thisisaaronland commented 8 years ago

De-dupe against existing Geonames import.

thisisaaronland commented 8 years ago

Or not, because Geonames doesn't even have multi-part postal codes for Canadia...

less allCountries.txt | grep -e '^CA' | grep 'Montreal'
CA  H1B Montreal East   Quebec  QC                  45.632  -73.5075    4
CA  H1G Montreal North North    Quebec  QC                  45.6109 -73.6211    1
CA  H1H Montreal North South    Quebec  QC                  45.5899 -73.6389    1
CA  H2Y Old Montreal    Quebec  QC                  45.5057 -73.555 
CA  H2Z Downtown Montreal Northeast Quebec  QC                  45.5052 -73.5622    
CA  H3A Downtown Montreal North Quebec  QC                  45.504  -73.5747    1
CA  H3B Downtown Montreal East  Quebec  QC                  45.5005 -73.5684    1
CA  H3G Downtown Montreal Southeast Quebec  QC                  45.4987 -73.5793    1
CA  H3H Downtown Montreal South & West  Quebec  QC                  45.5009 -73.5877    1
CA  H4X Montreal West   Quebec  QC
thisisaaronland commented 8 years ago

Basically, start with GeoPlanet and then append Geonames coordinate data where we know it's not insane (probably the US)

thisisaaronland commented 8 years ago

Total number of unique postal codes:

cat allCountries.txt | awk '{ print $2 }' | sort | uniq | wc -l
482795

Which is a bit of a misnomer since postal codes are not unique between countries.

grep Zip geoplanet_places_7.10.0.tsv | awk '{ print $3 }' | sort | uniq | wc -l
505502
grep Zip geoplanet_places_7.10.0.tsv | awk '{ print $3 }' | wc -l
3457144
thisisaaronland commented 8 years ago

GeoPlanet has explicit parent IDs so the first step should be to see what the counts are for unique parent IDs and WOF concordances

thisisaaronland commented 8 years ago

Hrmph. As in concordances between WOF and the (WOE/GP) parent ID for a postal code...

python ./parents.py ./zip.tsv
found 22028 missing 182021 
thisisaaronland commented 8 years ago

Non-optimized imports are averaging about 1M/24 hours so another day or so, unless something blows its brains out...

thisisaaronland commented 8 years ago
find ./data -name '*.geojson' -print | wc -l
2072271
thisisaaronland commented 8 years ago
find ./data -name '*.geojson' -print | wc -l
3176709
thisisaaronland commented 8 years ago

Complete. Waiting for issue #3 to complete.

Once that's done will migrate all the data per issue #6