whosonfirst / whosonfirst-www-spelunker

A simple Flask-based spelunker for poking around Who's On First data
BSD 3-Clause "New" or "Revised" License
7 stars 9 forks source link

Inconsistent feature counts between WOF features and GeoJSON collection #82

Closed stepps00 closed 7 years ago

stepps00 commented 7 years ago

I navigated the Spelunker to the record for Russia and clicked the "Download Descendants of Russia" link, which led me here.

The count shown for counties was 2,271 features, but the GeoJSON collection only contained 1,641 features. I tried a second time; the second GeoJSON bundle only contained 1,571 features.

Here's a screen shot of the GeoJSON bundle (light blue) over what was expected (light brown):

screen shot 2017-01-19 at 5 01 11 pm

It may be a coincidence, but both GeoJSON bundles I grabbed from the Spelunker seemed to be missing larger geometries. Also, the progress seemed to slow significantly around 80%.

Screenshot of the Spelunker, for what its worth:

screen shot 2017-01-19 at 4 57 31 pm
dphiffer commented 7 years ago

Weird, I was able to grab all 2,271 counties. I can send you the GeoJSON file to make sure it has what you were looking for. I double-checked the count like this:

cat wof_bundle_85632685_county.geojson | jq ".features | length"
stepps00 commented 7 years ago

Interesting.. I downloaded a bundle of county features in Russia just now and the file contained 1,528 features. At first glance, it looks like the same large geometries are not included.

If you send over your GeoJSON, I can compare it to a file of what I expect to see.

thisisaaronland commented 7 years ago

Can you tell which counties are missing?

stepps00 commented 7 years ago

No, not without a bundle of expected features to compare the bundle to...

stepps00 commented 7 years ago

Actually, the GeoJSON bundle could be joined to the CSV summary. This would list the missing wof:ids.

The CSV summary seems to include all features, the GeoJSON bundle does not.

thisisaaronland commented 7 years ago

If you can pull out the missing features, that would help with debugging.

stepps00 commented 7 years ago

Interesting... after joining, I realized the CSV summary file actually contains 496 duplicate features.

The total count in the CSV summary is 2,271 features (matches feature count in WOF), but those 496 duplicate features are equal to the amount of missing features in the GeoJSON bundle (bundle contains 1,775 features).

dphiffer commented 7 years ago

So after some testing I was able to reproduce this. @thisisaaronland and I worked with it for a bit and I think we've fixed the underlying problem. Could you try re-downloading those features and see if you get something more reasonable?

Thanks for the test case, btw. It's kind of an easy to miss one.

stepps00 commented 7 years ago

I downloaded all campus records parented by the United States in one bundle - all counts matched to what was expected. I also downloaded the same set of county records parented by Russia - again, all counts matched to what expected. (!)

I will keep testing the descender tool with new bundles, but it looks like the original issue is now fixed.

dphiffer commented 7 years ago

Gonna close this, we can open it again if we see the issue crop up again.