zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
950 stars 120 forks source link

Don't treat null values as same #892

Closed LYSTURM closed 2 weeks ago

LYSTURM commented 2 weeks ago

Describe the question I'm testing zingg in a project where one column has many null values. Ideally if values in that column are an exact match, the records match, but that's not the case when the value is null. I read in the docs zingg treats nulls as matches by default - is there a way to turn this off?

sonalgoyal commented 2 weeks ago

Nulls are treated as default matches because in most cases of matches in real world data, one or two attribute is missing from the data and the rest are similar. You can signal to the models by labeling differently and using the NULL_OR_N+ BLANK match type along with your other match type for the nullable column. https://docs.zingg.ai/zingg0.4.0/stepbystep/configuration/field-definitions#showconcise

HTH

LYSTURM commented 2 weeks ago

Thank you, that's clear. I was a little unsure about the NULL_OR_N+BLANK from the docs, but this makes sense. Appreciate it!