yinqiwen / ardb

A redis protocol compatible nosql, it support multiple storage engines as backend like Google's LevelDB, Facebook's RocksDB, OpenLDAP's LMDB, PerconaFT, WiredTiger, ForestDB.
BSD 3-Clause "New" or "Revised" License
1.82k stars 278 forks source link

In memory geohash based search #445

Open winash12 opened 6 years ago

winash12 commented 6 years ago

I have downloaded your software and I have started to compute geohashes.

https://github.com/yinqiwen/ardb/wiki/Spatial-Index

Is it possible to do a in memory geohash based indexer ?

I can create a HashMap of the 18 pairs (geohash number,geohash number+1) and it's eight neighbors. Supposing I want to search for coordinates within 5 kms what should be my precision value ? Also I do not understand why I must leftshift to 52 bits.

What data structure would be best to store this so I can do fast searches in memory(without external database) ? Quadtree, R-tree ? HashMap ?

What I thought was store all the 18 members in HashMap and do look ups.

yinqiwen commented 6 years ago
  1. this spartial-index have been ported to redis since 3.2, it's all in memory.
  2. 52 bits geohash value represented a box in map which length/height is less than 0.6 meters, and u can do a range search from 0.6 meters.
  3. since all data encoded by geohash must be stored in a sorted data structure, tree/skiplist seems better.
winash12 commented 6 years ago

It looks you are right on the choice of data structure. I extended your code to use Cython so that I can call your encode API and getneighbors in Python( I can give you the Cython extension). Then I created a lat lon grid and I populated the geoHash ,geoHash+1 and the 16 neighbors by encoding and get_neighors call . Then I bit shifted all of them by 52.

Then I encoded a test value to check to see if it is present in the HashMap. And I got a NO. So I will have to use a OrderedSet to do range queries.

from geo import GeoHash
from collections import OrderedDict
import numpy as np
import sys
minLat = 27.401436
maxLat = 62.54858
minLo = -180.0
maxLo = 179.95000000000002
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)
geoHash = GeoHash()

geoHashTable = OrderedDict()
geohash1 = {}
for lon,lat in zip(lonGrid,latGrid):
    geohash = geoHash.encode(lon,lat,24)
    bitsOriginal = geohash["bits"]
    bits = bitsOriginal << 52
    geoHashTable[bits] = np.c_[lon,lat]
    neighbors = geoHash.get_neighbors(geohash)
    for k,v in neighbors.items():
        if isinstance(v,dict):
            for k1,v1 in v.items():
                bits1 = v["bits"]
                bits = bits1 << 52
                geoHashTable[bits] = np.c_[lon,lat]
    bitsOriginal1 = bitsOriginal+1
    geohash1["bits"] = bitsOriginal1
    geohash1["step"] = 24
    neighbors1 = geoHash.get_neighbors(geohash1)
    bits1 = bitsOriginal1 << 52
    geoHashTable[bits1] = np.c_[lon,lat]
    for k,v in neighbors1.items():
        if isinstance(v,dict):
            for k1,v1 in v.items():
                bits = v["bits"]`
                bits = bits << 52
                geoHashTable[bits] = np.c_[lon,lat]

lonTest = 172.76843
latTest = 61.560745
geohashTest = geoHash.encode(lonTest,latTest,24)
bitsTest =    geohashTest["bits"]
finalBits = bitsTest << 52

if finalBits in geoHashTable:
    print("Yes")
else:
    print("No")
winash12 commented 5 years ago
1. this spartial-index have been ported to redis since 3.2, it's all in memory.

2. 52 bits geohash value represented a box in map which length/height is less than 0.6 meters, and u can do a range search from 0.6 meters.

3. since all data encoded by geohash must be stored in a sorted data structure,  tree/skiplist seems better.

I think I have a way to do this using a HashTable. The idea is the same i.e. approximate nearest neighbors.

What one can do is create a geohash using a very low precision(to be identified). Then points within a given radius will all probably give the same geohash.

Then one can create a hash table and create a key(what determines a key is to be determined) and store the values as the approximate nearest neighbors.

Then given a query point whose geohash is also to be calculated at the same low precision then one can do a O(1) lookup.

This is the same idea that is expressed in Locality Sensitive Hashing. Does that make sense or is it not possible ?

yinqiwen commented 5 years ago

store all points into a skiplist/tree can be used to search by any given radius value.
IMO, store points into hash + vectors what u described can only be used to search by a given radius value.

winash12 commented 5 years ago

I agree with what you are saying but if you look at this article - https://arxiv.org/pdf/1408.2927.pdf each one of those given radius values can be stored as a hash + vectors. So there will be more memory that is consumed but comes with a benefit in the form of O(1) lookup. In my original example I gave for a given radius value but you can replicate the hash + vectors for all radius values and do O(1) lookup.

Also have you looked at http://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/ S2 cells have the advantage that they are indeed space filling curves ?

yinqiwen commented 5 years ago

IMO. most cpu usage is spent on calculate the distances between given lat/lon and points in the founded areas. Use an O(1) lookup to find the areas did not help a lot on performance.

hilbert space filling curve is actually a better algorithm than current geohash value which would reduce search range.