ychantit / fuzzymatch_hiveUDF

a hive udf method to do fuzzy string matching using Jaro Winkler, Levenstein or NGram distance
9 stars 3 forks source link

fuzzy_match hive udf function

Hive udf utility method to do fuzzy string matching for two strings using Jaro Winkler (JW), Levensteing (LV) or Ngram (NG) distance.

fuzzy_match udf method is a wrapper of matching distance calculus available in lucene spell checker package :

JaroWinklerDistance

LevensteinDistance

NGramDistance

This projet provides an implementation example of Hive GenericUDF

fuzzy_match hive udf method intput & output

Param 1 : First string to match.

Param 2 : Second string to match with the first one.

Param 3 : Algo to be used in matching : JW, LV or NG.

Return : Double, the distance separating the two string

How to build fuzzy_match projet

fuzzy_match is a maven projet so building and installing it is straightforward with a mvn clean install The task will build a fat jar including all the dependencies of the fuzzy_match udf

How to use fuzzy_match method in hive script

  1. Put the jar fuzzy_text-1.0-SNAPSHOT.jar in your home directory, in my case /home/ych/fuzzy_match
  2. In your hive script or shell add the following two lignes :

     add jar /home/ych/fuzzy_match/fuzzytext-1.0-SNAPSHOT-fat.jar;
     CREATE TEMPORARY FUNCTION fuzzy_match as 'com.ych.fuzzytext.hive.udf.FuzzyMatch';'

That's it your are good to go !

Start using fuzzy_match select a, b, fuzzy_match(a,b,"JW") from mytable