zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
954 stars 118 forks source link

Exception in StringSimilarityDistanceFunction during 'match' #353

Closed navinrathore closed 2 years ago

navinrathore commented 2 years ago

Describe the bug Following exception is thrown in "match" . One may use amazon-google dataset. A new model can be created and match run.

2022-06-17 13:01:41,454 [Executor task launch worker for task 3.0 in stage 62.0 (TID 102)] WARN  org.apache.spark.storage.BlockManager - Block rdd_141_3 could not be removed as it was not found on disk or in memory
 2022-06-17 13:01:41,455 [Executor task launch worker for task 3.0 in stage 62.0 (TID 102)] WARN  org.apache.spark.storage.BlockManager - Putting block rdd_147_3 failed due to exception org.apache.spark.TaskKilledException.
 2022-06-17 13:01:41,455 [Executor task launch worker for task 3.0 in stage 62.0 (TID 102)] WARN  org.apache.spark.storage.BlockManager - Block rdd_147_3 could not be removed as it was not found on disk or in memory
 Caused by: java.lang.NullPointerException
    at java.base/java.util.TreeMap.rotateLeft(TreeMap.java:2221)
    at java.base/java.util.TreeMap.fixAfterInsertion(TreeMap.java:2288)
    at java.base/java.util.TreeMap.put(TreeMap.java:580)
    at com.wcohen.ss.tokens.SimpleTokenizer.intern(SimpleTokenizer.java:80)
    at com.wcohen.ss.tokens.SimpleTokenizer.internSomething(SimpleTokenizer.java:66)
    at com.wcohen.ss.tokens.SimpleTokenizer.tokenize(SimpleTokenizer.java:44)
    at com.wcohen.ss.Jaccard.prepare(Jaccard.java:33)
    at com.wcohen.ss.AbstractStringDistance.score(AbstractStringDistance.java:30)
    at zingg.similarity.function.StringSimilarityDistanceFunction.call(StringSimilarityDistanceFunction.java:28)
    at zingg.similarity.function.StringSimilarityDistanceFunction.call(StringSimilarityDistanceFunction.java:8)
    at org.apache.spark.sql.UDFRegistration.$anonfun$register$354(UDFRegistration.scala:793)
    ... 62 more
navinrathore commented 2 years ago

Related issues:

navinrathore commented 2 years ago

Reproduction steps

navinrathore commented 2 years ago

Inputs to the fn when there is failure

########### 
First: sonicwall 01-ssc-6997 : usually ships in 24 hours : : sonicwall client/server anti-virus suite leverages the award-winning mcafee netshield and groupshield applications for networks with windows -based file print and exchange servers., 
Second: sonicwall 01-ssc-5670 : usually ships in 24 hours : : more and more businesses schools government agencies and libraries are connecting to the internet to meet their organizational and educational goals.
 2022-06-21 18:37:01,771 [Executor task launch worker for task 6.0 in stage 62.0 (TID 683)] ERROR org.apache.spark.executor.Executor - Exception in task 6.0 in stage 62.0 (TID 683)
navinrathore commented 2 years ago

Add Junit for the class and include sample above and ones with /

navinrathore commented 2 years ago
navinrathore commented 2 years ago

The issue is again seen only once in Tens of trials. Moreover, to debug further we may put statements in wcohen secondstring . There is no update to this library after 2017.

sonalgoyal commented 2 years ago

from what we have seen so far, this happens when the model is out of sync with the features and match is run.