zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
936 stars 118 forks source link

Set based and vector based blocking functions #317

Closed sonalgoyal closed 1 year ago

sonalgoyal commented 2 years ago

current blocking functions are assessed on equals. What if we had them return custom objects and defined fuzzy similarity on those objects. eg ngrams with 0.6% matches? or word2vec with 0.7 cosine? This would really lift the recall - or would it? ;-)

sonalgoyal commented 1 year ago

Added vector similarity in oss in 0.4.0. Other stuff needs a thorough design and is on hold for now.