zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
951 stars 120 forks source link

Running training when dataset only has matches/non matches or limited samples throws errors. We should instead inform the user about this so they can add training samples. #86

Closed sonalgoyal closed 10 months ago

sonalgoyal commented 2 years ago

Reported by Luke from Databricks

zingg_Dec21_0823_log4j-active (1).txt zingg_Dec21_0823_sdtderr.txt

navinrathore commented 2 years ago

When there is no training data available, NullPointerException is thrown.

navinrathore commented 2 years ago

Other problematic scenarios: (refer to attached log file)

  1. When only negative or positive training data are available.
  2. When less number of training data are available. Error 1: java.lang.IllegalArgumentException: requirement failed: rawPredictionCol vectors must have length=2, but got 1 Error 2: java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer

An appropriate error message should be added to ask user to add more training data

sonalgoyal commented 2 years ago

let's fix all

sonalgoyal commented 11 months ago

@gnanaprakash-ravi please verify this

gnanaprakash-ravi commented 10 months ago

Hi,

  1. when trainingData is null image

  2. when neg is null and pos is less than 5 image result: image

  3. When pos is equal to 5 and neg is null image result: image

  4. when pos and neg are exactly equal to 5 in the train phase (Needs to be analyzed intensively) This behavior is occurring on the new model and new zinggdir image I suspect this error might be related to Apache Spark library but this was intercepted by zinggbusinessexception: (after the code change) image

sonalgoyal commented 10 months ago

1,2,3 are working as expected. 4 is giving an exception with the error around less data. No fix needed.