whosonfirst-data / whosonfirst-data

Who's On First is a gazetteer of places.
http://www.whosonfirst.org/
Other
415 stars 9 forks source link

Small cities named São Paulo seem to have inherited the population of the big one #1924

Closed orangejulius closed 3 years ago

orangejulius commented 3 years ago

Hi all, Here's the WOF record for the Brazilian megacity São Paulo. As expected it has a large population: 11967825.

Here's the WOF record for another Brazilian city called São Paulo, located far away elsewhere in Brazil. While it's clear from satellite imagery and other data sources that this place is a small, relatively unpopulated place, it nonetheless has the same population: 11967825

In fact, many other records in Brazil have the exact same population. Here's a full list

https://spelunker.whosonfirst.org/id/101982243/ https://spelunker.whosonfirst.org/id/101976113/ https://spelunker.whosonfirst.org/id/101965533/ (this one is correct) https://spelunker.whosonfirst.org/id/101975313/ https://spelunker.whosonfirst.org/id/101983201/ https://spelunker.whosonfirst.org/id/101969707/ https://spelunker.whosonfirst.org/id/101976989/

The population_rank value is also affected, but it doesn't look like any others have, for example, the megacity property set.

Let me know how you'd like to handle this. I'd be happy to go through the non-megacity records and remove their population in pull requests, for example.

stepps00 commented 3 years ago

Nice catch, thanks for the issue @orangejulius.

I've opened a PR with fixes here: https://github.com/whosonfirst-data/whosonfirst-data-admin-br/pull/16

Let me know how you'd like to handle this. I'd be happy to go through the non-megacity records and remove their population in pull requests, for example.

It's okay for non-megacity records to have population data, but I think a good indicator of where we might have an issue in WOF would be:

^ I'm sure there are more sanity checks we could run against WOF data, but records that fall under these categories would be worth double-checking.

I'll likely close this issue once I merge the PR, but let me know your thoughts on the above and we can track in a new issue(s).

orangejulius commented 3 years ago

Sounds good. I'll see if there are other megacities at least that share a population value with other cities.

missinglink commented 3 years ago

Here's a script to help try detect records with this issue:

.headers on
.separator "\t"

 SELECT
  json_extract(body, '$.properties."wof:repo"') AS repo,
  json_extract(body, '$.properties."wof:placetype"') AS placetype,
  json_extract(body, '$.properties."wof:name"') AS name,
  json_extract(body, '$.properties."wof:population"') AS pop,
  COUNT(*) AS count,
  group_concat(id) AS ids
FROM geojson
WHERE is_alt=0
AND LENGTH(pop) > 0
AND LENGTH(TRIM(name)) > 0
AND pop > 0
AND json_extract(body, '$.properties."mz:is_current"') NOT IN (0, '0')
GROUP BY repo, placetype, LOWER(TRIM(name)), pop
HAVING count > 1
ORDER BY count DESC;
sqlite3 whosonfirst-data-admin-latest.db < pop.sql > pop.tsv

The full list includes 7912 potential dupes: https://gist.github.com/missinglink/fddf8840ee557ba83e60a24f7b8de4b9

I also noticed that there's a lot of records with a 0 population, or a population <100 which seems like maybe an error?

orangejulius commented 3 years ago

Funny, I also spent some time writing SQL to find duplicates like this.

Here's a list of 200 places that mostly look to be either true duplicates of megacities, or cases like the Sao Paulos sharing population, but being distinct places. A couple, like Cincinnati, are legitimate because there's both a locality and a localadmin

city_id,city_name,city_population,megacity_id,megacity_name
890506443,Delhi,16314838,102030425,Delhi
890508309,Kolkata,14112536,102030585,Kolkata
890509223,Mumbai,12442373,102030609,Mumbai
101969707,"São Paulo",11967825,101965533,"São Paulo"
101975313,"São Paulo",11967825,101965533,"São Paulo"
101976113,"São Paulo",11967825,101965533,"São Paulo"
101976989,"São Paulo",11967825,101965533,"São Paulo"
101982243,"São Paulo",11967825,101965533,"São Paulo"
101983201,"São Paulo",11967825,101965533,"São Paulo"
1126091017,Pechs,11624219,421191101,Karachi
85672001,"Jakarta Raya",9607787,102020439,Jakarta
85669985,Bandundu,8000000,102027633,Dongguan
1125371489,"成都市",7415590,102027703,Chengdu
1125290269,"南充市",7150000,102027293,Nanchong
85676889,Khartoum,6527500,890441687,Khartoum
1125371469,"杭州市",6241971,102027571,Hangzhou
1125371549,"苏州市",5345961,102027777,Suzhou
1126010063,"St. Petersburg",5028000,102008123,"Saint Petersburg"
85688069,"Saint Petersburg",4879566,1158857531,"St. Petersburg"
890505403,Chennai,4646732,102029537,Chennai
85679675,Dar-es-Salaam,4364541,421189813,"Dar es Salaam"
1125405091,"济南市",4335989,102027459,Jinan
85632179,Panama,3864170,890445081,"Panamá"
1125289475,"佛山市",3600000,102027615,Foshan
1159339499,Somaliland,3500000,421186705,Guayaquil
1326900545,Tianshui,3500000,421186705,Guayaquil
1125357457,"宁波市",3491597,102027273,Ningbo
1125329017,"太原市",3426519,102027117,Taiyuan
1125277863,"淄博市",3129228,102026921,Zibo
890510133,Pune,3115431,102028965,Pune
1125326677,"乌鲁木齐市",3029372,102026823,Urunchi
890445275,"Fitzroy's Elecctronics",3000000,890442147,Caracas
101855843,"Каштани",2845023,101752489,"Kiev City"
1125411433,"石家庄市",2834942,102027161,Shijiazhuang
85679547,"Kaohsiung City",2777384,102026701,Kaohsiung
1108808547,Kaohsiung,2777384,102026701,Kaohsiung
85765525,Salvador,2711840,101950447,Salvador
404496273,Chicago,2695598,85940195,Chicago
85679583,"Taipei City",2655515,102026641,Taipei
1125371479,"兰州市",2628426,102027397,Lanzhou
421202467,"البصرة",2600000,101735835,Toronto
890502209,"Navi Mumbai",2600000,101735835,Toronto
101735835,Toronto,2600000,421202467,"البصرة"
890502209,"Navi Mumbai",2600000,421202467,"البصرة"
101946525,"Belo Horizonte",2479175,101948979,"Belo Horizonte"
101959007,Fortaleza,2452185,101944733,Fortaleza
101960371,Fortaleza,2452185,101944733,Fortaleza
101978429,Fortaleza,2452185,101944733,Fortaleza
101979209,Fortaleza,2452185,101944733,Fortaleza
101979457,Fortaleza,2452185,101944733,Fortaleza
101982761,Fortaleza,2452185,101944733,Fortaleza
101983233,Fortaleza,2452185,101944733,Fortaleza
1125309619,"南昌市",2357839,102027295,Nanchang
85953133,Houston,2195914,101725629,Houston
890456811,Perth,2021200,101937679,Perth
890494131,Samarang,1621384,102019121,Semarang
890495997,Samarang,1621384,102019121,Semarang
102081353,Philadelphia,1526006,101718083,Philadelphia
404483701,Philadelphia,1526006,101718083,Philadelphia
85883483,"Paso Del Norte",1512354,101994153,"Juárez"
1729238453,Auckland,1454300,101914257,Auckland
85784291,"Kuala Lumpur",1453975,102023407,"Kuala Lumpur"
102063167,"San Antonio",1436697,101724653,"San Antonio"
1125888989,Kharkiv,1430885,101752963,"Харків"
102012947,Novosibirsk,1419007,102012919,Novosibirsk
102012919,Novosibirsk,1419007,102012947,Novosibirsk
101978599,Manaus,1405835,101941913,Manaus
1125822397,"抚顺",1400646,102027929,Fushun
101971589,"Porto Alegre",1360590,101960525,"Porto Alegre"
101975397,"Alto Alegre I",1360590,101960525,"Porto Alegre"
101979817,"Porto Alegre",1360590,101960525,"Porto Alegre"
101982687,"Porto Alegre",1360590,101960525,"Porto Alegre"
1175610857,"Porto Alegre",1360590,101960525,"Porto Alegre"
102011897,Sverdlovsk,1349772,102011881,""
1125880545,Yekaterinburg,1349772,102011881,""
1125782447,"Nizhniy Novgorod",1284164,102002899,Gorkiy
1175613483,Gorkiy,1284164,102002899,Gorkiy
102002899,Gorkiy,1284164,1175613483,Gorkiy
1125782447,"Nizhniy Novgorod",1284164,1175613483,Gorkiy
1377787805,Munich,1260391,101748479,"München"
890437895,Omdurman,1200000,102027827,Yueyang
1125342789,"岳阳市",1200000,102027827,Yueyang
1125377893,"铜山县",1199193,102026983,Tongshan
85981123,Dallas,1197816,101724385,Dallas
421171509,Rosario,1193605,421172799,Rosario
102011631,Samara,1134730,102011627,""
102012389,Omsk,1129281,102012351,Omsk
102012393,Omsk,1129281,102012351,Omsk
102012351,Omsk,1129281,102012393,Omsk
102012389,Omsk,1129281,102012393,Omsk
85671301,Tbilisi,1118035,890443227,Tbilisi
1125783867,"大 同 市",1052678,102027935,Datong
1125377897,"咸阳市",1034081,102027031,Xianyang
890506759,Jodhpur,1033918,102029947,Jodhpur
890507095,Jodhpur,1033918,102029947,Jodhpur
1125847831,Oufa,1033338,102009691,Ufa
102005387,Voronezh,1032895,102005377,Voronezh
102005377,Voronezh,1032895,102005387,Voronezh
421180763,Managua,1028808,890451731,Managua
85675291,Niamey,1026848,421197251,Niamey
1511799789,Calgary,1019942,890458845,Calgary
101913231,Brussel,1019022,1175610569,Brussel
101913235,Brussel,1019022,1175610569,Brussel
101913243,Brussel,1019022,1175610569,Brussel
101913245,Brussel,1019022,1175610569,Brussel
101913247,Brussel,1019022,1175610569,Brussel
1125371215,Brussel,1019022,1175610569,Brussel
101925987,Johannesburg,1009035,101925753,Johannesburg
101941607,Campinas,969396,101956069,Campinas
101953469,Campinas,969396,101956069,Campinas
101959217,Campinas,969396,101956069,Campinas
101960363,Campinas,969396,101956069,Campinas
101977453,Campinas,969396,101956069,Campinas
101990819,Campinas,969396,101956069,Campinas
890452049,Freetown,951000,421176817,Freetown
102013953,Krasnoyarsk,927200,102013949,Krasnoyarsk
102013949,Krasnoyarsk,927200,102013953,Krasnoyarsk
890508369,Tiruchirappalli,916857,102028441,Tiruchirappalli
421168899,Bishkek,900000,102027425,Kaifeng
421199797,Qom,900000,102027425,Kaifeng
890434609,Abobo,900000,102027425,Kaifeng
1125924245,"Abū Ghurayb",900000,102027425,Kaifeng
1126008595,"开封",900000,102027425,Kaifeng
102027425,Kaifeng,900000,421199797,Qom
421168899,Bishkek,900000,421199797,Qom
890434609,Abobo,900000,421199797,Qom
1125924245,"Abū Ghurayb",900000,421199797,Qom
1126008595,"开封",900000,421199797,Qom
102027425,Kaifeng,900000,421168899,Bishkek
421199797,Qom,900000,421168899,Bishkek
890434609,Abobo,900000,421168899,Bishkek
1125924245,"Abū Ghurayb",900000,421168899,Bishkek
1126008595,"开封",900000,421168899,Bishkek
890510867,Mysore,887446,102029303,Mysore
85862773,"Italia '61",870456,101752697,Turin
102011129,"",863725,102011123,Saratov
85945009,Jacksonville,821784,85932547,Jacksonville
85675741,Islamabad,805235,85922583,"San Francisco"
102087579,"San Francisco",805235,85922583,"San Francisco"
101944861,"Maceió",797759,101948377,"Maceió"
101976425,"Maceió",797759,101948377,"Maceió"
1209499251,Anyang,781129,102027815,Anyang
1125405095,"呼和浩特市",774477,102027913,Hohhot
1125294731,"西宁市",767531,102027001,Xining
1125783617,"秦皇岛",759718,102027225,Qinhuangdao
1125880223,"Campo Grande",729151,101961475,"Campo Grande"
1125945699,"烟台",719332,102026967,Yantai
1125783403,Pekanbaru,703956,102019475,Pekanbaru
1125844261,Liaoyangxian,687890,102027891,Liaoyang
85939649,Detroit,680250,85951091,Detroit
1125983249,"Culiacan Rosales",675000,101993731,"Culiacán Rosales"
1125415649,"牡丹江市",665915,102027885,Mudanjiang
1125338477,Athens,664046,1175612731,Athens
1125783885,"丹东市",631973,102027939,Dandong
1125375845,"盐城市",628441,102026977,Yancheng
1125842555,Bhilauni,625138,102030693,Bhilai
1126085069,"Nueva España",621250,102000871,Saltillo
102019025,Situbondo,600000,101741075,Vancouver
890516827,Shuangyashan,600000,101741075,Vancouver
404492091,Milwaukee,594833,101732987,Milwaukee
85835959,Multnomah,583776,101715829,Portland
102030477,Cuttack,580000,1495123997,Oslo
890437999,Rabat,577827,890444507,Rabat
1125374969,"常德市",517780,102027057,Changde
1125297277,"珠海市",501199,102027037,Zhuhai
101926403,Vereeniging,474681,101926389,Vereeniging
1360698971,Liverpool,466415,101750547,Liverpool
1125289307,"遵义市",466292,102026873,Zunyi
1125375871,"襄樊市",462956,102027039,Xiangyang
102081929,"Virginia Beach",437994,101728745,"Virginia Beach"
1125289371,"绍兴县",421283,102027169,Shaoxing
1125822383,"鸡西",403759,102027895,Jixi
85953441,Cleveland,396815,101712563,Cleveland
404523697,Cleveland,396815,101712563,Cleveland
404511883,Minneapolis,382578,85969169,Minneapolis
1125811423,Thessaloniki,354290,101752331,"ΘΕΣΣΑΛΟΝΙΚΗ"
101715023,"St. Louis",318416,85971343,"St. Louis"
404482523,Pittsburgh,305704,101718805,Pittsburgh
404523793,Cincinnati,296943,101712203,Cincinnati
421199285,"San Jose",288054,421171925,"San Jose"
1125289389,"绵阳市",264136,102027309,Mianyang
404522329,Buffalo,261310,85978023,Buffalo
1125289501,"通辽市",261110,102027867,Tongliao
1125827413,Porto,249633,101752093,Porto
404521467,Rochester,210565,85977591,Rochester
102085959,Richmond,204214,101728675,Richmond
1125309113,"宜春市",152169,102026959,Yichun
404495939,Bridgeport,144229,85930811,Bridgeport
404524075,Dayton,141527,101712161,Dayton
85829035,"Laguna West",115041,421200319,Mendoza
1108718353,Capital,115041,421200319,Mendoza
1125414509,"新余市",97480,102026991,Xinyu
1125309605,"富阳市",70183,102027613,Fuyang
404488363,"East Providence",47037,101720571,"East Providence"
890426825,"Al Ḩudaydah",40260,890432439,"Al Ḩudaydah"
890427113,"Al Ḩudaydah",40260,890432439,"Al Ḩudaydah"
101828199,Morlaix,17516,102029167,"North Guwāhāti"
101828523,"Château-Thierry",15938,101955041,"Valparaíso"
890440685,"Lamharza Essahel",15938,101955041,"Valparaíso"

Mine looks at only places that share an exact population with a megacity.

Because it's a massive self join on a large table, and doing lots of json_extract, I found it was only practical to run by first extracting out a bunch of data into an intermediate sqlite db, so here's a script that does it all

#configure file here
wof_db=$1

if [[ "$wof_db" == "" ]]; then
  wof_db="whosonfirst-data-admin-latest.db"
fi

# extract records with population that aren't alt geoms or superseded
sqlite3 -header -csv $wof_db << EOF > population_data.csv
SELECT
  city.id as id,
  json_extract(city.body, '$.properties."wof:name"') as name,
  json_extract(city.body, '$.properties."wof:population"') as population,
  json_extract(city.body, '$.properties."wof:megacity"') as megacity
FROM
  geojson city
WHERE
  population != '' AND
  json_extract(city.body, '$.properties."wof:superseded_by"') = '[]' AND
  city.is_alt = 0
;
EOF

# create new sqlite DB with records and just the data we need
sqlite3 wof_population_data.db << EOF
drop table if exists cities;
create table cities( id BIGINT not null, name text, population integer not null, megacity integer);
.mode csv
.import population_data.csv cities
EOF

# query new sqlite DB
sqlite3 -csv -header wof_population_data.db << EOF
SELECT
  city.id as city_id,
  city.name as city_name,
  city.population as city_population,
  megacity.id as megacity_id,
  megacity.name as megacity_name
FROM
  cities city, cities megacity
WHERE
  city_population > 10000 AND
  megacity.population = city_population AND
  megacity_id != city_id AND
  megacity.megacity = 1
ORDER BY city_population desc
LIMIT 10000;
EOF
stepps00 commented 3 years ago

Nice, thanks for these lists to review..

Here's a list of 200 places that mostly look to be either true duplicates of megacities, or cases like the Sao Paulos sharing population, but being distinct places. A couple, like Cincinnati, are legitimate because there's both a locality and a localadmin

This list is interesting and seems manageable to hand review. I'm seeing the same issue we had with Sao Paolo in some of these cases, but also valid cases (like San Jos, Costa Rica), where a coterminous locality / county have the same population. I'll try to review this soon and push changes to PRs to fix.

The full list includes 7912 potential dupes: https://gist.github.com/missinglink/fddf8840ee557ba83e60a24f7b8de4b9 I also noticed that there's a lot of records with a 0 population, or a population <100 which seems like maybe an error?

This list may be a bit difficult to review, but I'll take a shot and finding a clear pattern that can easily be fixed.. maybe a good start is to review some of those smaller placetypes.

orangejulius commented 3 years ago

Okay, here's an updated script that compares parent_id values to make sure that nothing in the list is the direct parent of another.

That brings the list down to 166 and all the duplicates I checked look pretty legitimate now.

city_id,city_name,city_population,megacity_id,megacity_name
890506443,Delhi,16314838,102030425,Delhi
890509223,Mumbai,12442373,102030609,Mumbai
101969707,"São Paulo",11967825,101965533,"São Paulo"
101975313,"São Paulo",11967825,101965533,"São Paulo"
101976113,"São Paulo",11967825,101965533,"São Paulo"
101976989,"São Paulo",11967825,101965533,"São Paulo"
101982243,"São Paulo",11967825,101965533,"São Paulo"
101983201,"São Paulo",11967825,101965533,"São Paulo"
85672001,"Jakarta Raya",9607787,102020439,Jakarta
85669985,Bandundu,8000000,102027633,Dongguan
1125371489,"成都市",7415590,102027703,Chengdu
1125290269,"南充市",7150000,102027293,Nanchong
85676889,Khartoum,6527500,890441687,Khartoum
1125371469,"杭州市",6241971,102027571,Hangzhou
1125371549,"苏州市",5345961,102027777,Suzhou
1126010063,"St. Petersburg",5028000,102008123,"Saint Petersburg"
85688069,"Saint Petersburg",4879566,1158857531,"St. Petersburg"
85679675,Dar-es-Salaam,4364541,421189813,"Dar es Salaam"
1125405091,"济南市",4335989,102027459,Jinan
85632179,Panama,3864170,890445081,"Panamá"
1125289475,"佛山市",3600000,102027615,Foshan
1159339499,Somaliland,3500000,421186705,Guayaquil
1326900545,Tianshui,3500000,421186705,Guayaquil
1125357457,"宁波市",3491597,102027273,Ningbo
1125329017,"太原市",3426519,102027117,Taiyuan
1125277863,"淄博市",3129228,102026921,Zibo
1125326677,"乌鲁木齐市",3029372,102026823,Urunchi
890445275,"Fitzroy's Elecctronics",3000000,890442147,Caracas
101855843,"Каштани",2845023,101752489,"Kiev City"
1125411433,"石家庄市",2834942,102027161,Shijiazhuang
1108808547,Kaohsiung,2777384,102026701,Kaohsiung
1125371479,"兰州市",2628426,102027397,Lanzhou
421202467,"البصرة",2600000,101735835,Toronto
890502209,"Navi Mumbai",2600000,101735835,Toronto
101735835,Toronto,2600000,421202467,"البصرة"
890502209,"Navi Mumbai",2600000,421202467,"البصرة"
101946525,"Belo Horizonte",2479175,101948979,"Belo Horizonte"
101959007,Fortaleza,2452185,101944733,Fortaleza
101960371,Fortaleza,2452185,101944733,Fortaleza
101978429,Fortaleza,2452185,101944733,Fortaleza
101979209,Fortaleza,2452185,101944733,Fortaleza
101979457,Fortaleza,2452185,101944733,Fortaleza
101982761,Fortaleza,2452185,101944733,Fortaleza
101983233,Fortaleza,2452185,101944733,Fortaleza
1125309619,"南昌市",2357839,102027295,Nanchang
85953133,Houston,2195914,101725629,Houston
890456811,Perth,2021200,101937679,Perth
890494131,Samarang,1621384,102019121,Semarang
890495997,Samarang,1621384,102019121,Semarang
102081353,Philadelphia,1526006,101718083,Philadelphia
85883483,"Paso Del Norte",1512354,101994153,"Juárez"
102063167,"San Antonio",1436697,101724653,"San Antonio"
1125888989,Kharkiv,1430885,101752963,"Харків"
102012947,Novosibirsk,1419007,102012919,Novosibirsk
102012919,Novosibirsk,1419007,102012947,Novosibirsk
101978599,Manaus,1405835,101941913,Manaus
1125822397,"抚顺",1400646,102027929,Fushun
101971589,"Porto Alegre",1360590,101960525,"Porto Alegre"
101975397,"Alto Alegre I",1360590,101960525,"Porto Alegre"
101979817,"Porto Alegre",1360590,101960525,"Porto Alegre"
101982687,"Porto Alegre",1360590,101960525,"Porto Alegre"
1175610857,"Porto Alegre",1360590,101960525,"Porto Alegre"
102011897,Sverdlovsk,1349772,102011881,""
1125880545,Yekaterinburg,1349772,102011881,""
1125782447,"Nizhniy Novgorod",1284164,102002899,Gorkiy
1175613483,Gorkiy,1284164,102002899,Gorkiy
102002899,Gorkiy,1284164,1175613483,Gorkiy
1125782447,"Nizhniy Novgorod",1284164,1175613483,Gorkiy
890437895,Omdurman,1200000,102027827,Yueyang
1125342789,"岳阳市",1200000,102027827,Yueyang
1125377893,"铜山县",1199193,102026983,Tongshan
85981123,Dallas,1197816,101724385,Dallas
102011631,Samara,1134730,102011627,""
102012389,Omsk,1129281,102012351,Omsk
102012393,Omsk,1129281,102012351,Omsk
102012351,Omsk,1129281,102012393,Omsk
102012389,Omsk,1129281,102012393,Omsk
85671301,Tbilisi,1118035,890443227,Tbilisi
1125783867,"大 同 市",1052678,102027935,Datong
1125377897,"咸阳市",1034081,102027031,Xianyang
890506759,Jodhpur,1033918,102029947,Jodhpur
1125847831,Oufa,1033338,102009691,Ufa
102005387,Voronezh,1032895,102005377,Voronezh
102005377,Voronezh,1032895,102005387,Voronezh
421180763,Managua,1028808,890451731,Managua
85675291,Niamey,1026848,421197251,Niamey
101913231,Brussel,1019022,1175610569,Brussel
101913235,Brussel,1019022,1175610569,Brussel
101913243,Brussel,1019022,1175610569,Brussel
101913245,Brussel,1019022,1175610569,Brussel
101913247,Brussel,1019022,1175610569,Brussel
1125371215,Brussel,1019022,1175610569,Brussel
101925987,Johannesburg,1009035,101925753,Johannesburg
101941607,Campinas,969396,101956069,Campinas
101953469,Campinas,969396,101956069,Campinas
101959217,Campinas,969396,101956069,Campinas
101960363,Campinas,969396,101956069,Campinas
101977453,Campinas,969396,101956069,Campinas
101990819,Campinas,969396,101956069,Campinas
890452049,Freetown,951000,421176817,Freetown
102013953,Krasnoyarsk,927200,102013949,Krasnoyarsk
102013949,Krasnoyarsk,927200,102013953,Krasnoyarsk
421168899,Bishkek,900000,102027425,Kaifeng
421199797,Qom,900000,102027425,Kaifeng
890434609,Abobo,900000,102027425,Kaifeng
1125924245,"Abū Ghurayb",900000,102027425,Kaifeng
1126008595,"开封",900000,102027425,Kaifeng
102027425,Kaifeng,900000,421199797,Qom
421168899,Bishkek,900000,421199797,Qom
890434609,Abobo,900000,421199797,Qom
1125924245,"Abū Ghurayb",900000,421199797,Qom
1126008595,"开封",900000,421199797,Qom
102027425,Kaifeng,900000,421168899,Bishkek
421199797,Qom,900000,421168899,Bishkek
890434609,Abobo,900000,421168899,Bishkek
1125924245,"Abū Ghurayb",900000,421168899,Bishkek
1126008595,"开封",900000,421168899,Bishkek
102011129,"",863725,102011123,Saratov
85945009,Jacksonville,821784,85932547,Jacksonville
85675741,Islamabad,805235,85922583,"San Francisco"
101944861,"Maceió",797759,101948377,"Maceió"
101976425,"Maceió",797759,101948377,"Maceió"
1209499251,Anyang,781129,102027815,Anyang
1125405095,"呼和浩特市",774477,102027913,Hohhot
1125294731,"西宁市",767531,102027001,Xining
1125783617,"秦皇岛",759718,102027225,Qinhuangdao
1125880223,"Campo Grande",729151,101961475,"Campo Grande"
1125945699,"烟台",719332,102026967,Yantai
1125783403,Pekanbaru,703956,102019475,Pekanbaru
1125844261,Liaoyangxian,687890,102027891,Liaoyang
85939649,Detroit,680250,85951091,Detroit
1125983249,"Culiacan Rosales",675000,101993731,"Culiacán Rosales"
1125415649,"牡丹江市",665915,102027885,Mudanjiang
1125338477,Athens,664046,1175612731,Athens
1125783885,"丹东市",631973,102027939,Dandong
1125375845,"盐城市",628441,102026977,Yancheng
1125842555,Bhilauni,625138,102030693,Bhilai
102019025,Situbondo,600000,101741075,Vancouver
890516827,Shuangyashan,600000,101741075,Vancouver
85835959,Multnomah,583776,101715829,Portland
102030477,Cuttack,580000,1495123997,Oslo
1125374969,"常德市",517780,102027057,Changde
1125297277,"珠海市",501199,102027037,Zhuhai
101926403,Vereeniging,474681,101926389,Vereeniging
1125289307,"遵义市",466292,102026873,Zunyi
1125375871,"襄樊市",462956,102027039,Xiangyang
1125289371,"绍兴县",421283,102027169,Shaoxing
1125822383,"鸡西",403759,102027895,Jixi
85953441,Cleveland,396815,101712563,Cleveland
1125811423,Thessaloniki,354290,101752331,"ΘΕΣΣΑΛΟΝΙΚΗ"
101715023,"St. Louis",318416,85971343,"St. Louis"
421199285,"San Jose",288054,421171925,"San Jose"
1125289389,"绵阳市",264136,102027309,Mianyang
1125289501,"通辽市",261110,102027867,Tongliao
1125827413,Porto,249633,101752093,Porto
1125309113,"宜春市",152169,102026959,Yichun
85829035,"Laguna West",115041,421200319,Mendoza
1108718353,Capital,115041,421200319,Mendoza
1125414509,"新余市",97480,102026991,Xinyu
1125309605,"富阳市",70183,102027613,Fuyang
890426825,"Al Ḩudaydah",40260,890432439,"Al Ḩudaydah"
890427113,"Al Ḩudaydah",40260,890432439,"Al Ḩudaydah"
101828199,Morlaix,17516,102029167,"North Guwāhāti"
101828523,"Château-Thierry",15938,101955041,"Valparaíso"
890440685,"Lamharza Essahel",15938,101955041,"Valparaíso"
#configure file here
wof_db=$1

if [[ "$wof_db" == "" ]]; then
  wof_db="whosonfirst-data-admin-latest.db"
fi

# extract records with population that aren't alt geoms or superseded
sqlite3 -header -csv $wof_db << EOF > population_data.csv
SELECT
  city.id as id,
  json_extract(city.body, '$.properties."wof:name"') as name,
  json_extract(city.body, '$.properties."wof:population"') as population,
  json_extract(city.body, '$.properties."wof:megacity"') as megacity,
  json_extract(city.body, '$.properties."wof:parent_id"') as parent
FROM
  geojson city
WHERE
  population != '' AND
  json_extract(city.body, '$.properties."wof:superseded_by"') = '[]' AND
  city.is_alt = 0
;
EOF

# create new sqlite DB with records and just the data we need
sqlite3 wof_population_data.db << EOF
drop table if exists cities;
create table cities( id BIGINT not null, name text, population integer not null, megacity integer, parent);
.mode csv
.import population_data.csv cities
EOF

# query new sqlite DB
sqlite3 -csv -header wof_population_data.db << EOF
SELECT
  city.id as city_id,
  city.name as city_name,
  city.population as city_population,
  megacity.id as megacity_id,
  megacity.name as megacity_name
FROM
  cities city, cities megacity
WHERE
  city_population > 10000 AND
  city.parent != megacity_id AND
  megacity.parent != city_id AND
  megacity.population = city_population AND
  megacity_id != city_id AND
  megacity.megacity = 1
ORDER BY city_population desc
LIMIT 10000;
EOF
stepps00 commented 3 years ago

We've updated the Sao Paolo records identified in the initial comment - that work is now merged to the whosonfirst-data-admin-br repo. I've opened https://github.com/whosonfirst-data/whosonfirst-data/issues/1927 as a follow up to tackle the other cases found in this issue.

missinglink commented 3 years ago

Thanks, we also made some small label fixes and results are looking much better now 🎉

Screenshot 2021-02-23 at 15 38 11