wdm0006 / pygeohash

Python module for interacting with geohashes
https://pygeohash.mcginniscommawill.com/
MIT License
161 stars 25 forks source link

Scripts for decoding and encoding using numba for performance gain #9

Closed IlyasMoutawwakil closed 2 years ago

IlyasMoutawwakil commented 3 years ago

Copied from what I did in geohash-on-steroids:

Dependencies

Optimized functions are created with the njit decorators and using arrays so the only dependencies are Numba and Numpy.

Performance

As you can see in my notebook, performance gain in comparison to the default python package pygeohash is the following:

%%timeit
point_decode(geohash) # pygeohash
# Output: 20.4 µs ± 367 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
nb_point_decode(geohash) # nbgeohash
# Output: 4.48 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
point_encode(latitude, longitude) # pygeohash
# Output: 92.8 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
nb_point_encode(latitude, longitude) # nbgeohash
# Output: 11.2 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But geohashing is generally performaded on large amounts of data points so I made a vector-wise implimentation that perform well at large scale:

%%timeit
np_vector_decode(geohashes) # a numpy vectorization of pygeohash's decode function
# Output: 2.09 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
nb_vector_decode(geohashes) # nbgeohash
# Output: 164 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
np_vector_encode(latitudes, longitudes) # a numpy vectorization of pygeohash's encode function
# Output: 2.57 s ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
nb_vector_encode(latitudes, longitudes) # nbgeohash
# Output: 443 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
IlyasMoutawwakil commented 3 years ago

@wdm0006 I still haven't made any unitests so I'm wondering if it should be in the same file you used as seperate methods or should I create a new file file?

wdm0006 commented 3 years ago

I think keeping the new tests in a separate file (one for vectorized and one for the numba versions) would be cleanest.

IlyasMoutawwakil commented 3 years ago

@wdm0006 There's some fails in distance and stats, but I guess they're just python rounding errors Anyway, here's the unittest log:

test_check_validity (tests.test_geohash.TestGeohash) ... ok  
test_decode (tests.test_geohash.TestGeohash) ... ok
test_distance (tests.test_geohash.TestGeohash) ... FAIL      
test_encode (tests.test_geohash.TestGeohash) ... ok
test_stats (tests.test_geohash.TestGeohash) ... FAIL
test_decode (tests.test_nbgeohash.TestNumbaPointGeohash) ... ok
test_encode (tests.test_nbgeohash.TestNumbaPointGeohash) ... ok
test_decode (tests.test_nbgeohash.TestNumbaVectorGeohash) ... ok
test_encode (tests.test_nbgeohash.TestNumbaVectorGeohash) ... ok
IlyasMoutawwakil commented 3 years ago

@wdm0006 what do you think

IlyasMoutawwakil commented 3 years ago

Overall looks good but let's use almostEquals to avoid rounding error issues in the tests and make sure that the soft dependencies are actually optional.

actually there's no rounding issue in decoding or coding to geohash, the issue is in distance and stats (which i did'nt implement), I can change them but that should be in a another PR ?