zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
952 stars 118 forks source link

Doubt about unify data from multiple source #231

Closed sidharth16395 closed 2 years ago

sidharth16395 commented 2 years ago

Hi Sonal, I am sidharth and working as data science engineer in firsthive. Since i was going through the zingg to use for unification purpose. So i have few doubt regarding the proces and need to clear.Below i am briefly explaining the problem statement and doubt.

Problem Statement:

We have one usecase of unify the customer based on multiple source data.

sonalgoyal commented 2 years ago

hey Sidharth,

here are some thoughts

  1. You can use the link phase to match across sources. Data has to be in the same schema
  2. cluster number is assigned by zingg.
  3. similarity model is a classifier to predict if two records match.
  4. Similarity and clustering happen as per the match types and field types defined in the config. We have a blocking tree which is a set coverage algorithm, which does the clustering. Features for that are learnt from the training data.
  5. you can check the link phase and its output - cluster id, probability of matching is provided.

hth

sidharth16395 commented 2 years ago

Hi Sonal, Thanks for clearing the doubt and i also started working in it by pulling the docker image. Then i was stuck with the issue:

  1. if we are using multiple input source data then column names of each data source should be same other wise it is giving error.

for e.g:

sonalgoyal commented 2 years ago

yes - the column names need to be the same across all datasets currently. you can preprocess your data and select the relevant columns before passing through zingg. providing different column names is on the roadmap.

sidharth16395 commented 2 years ago

Okay got it yes curently i had preprocessed the data kept same name and done . And one suggestion need from your side for probabilistic unification we want to use apart from demographic data like browser type, IP address, device type, and operating system to unify as single customer. so can we do it but schema will be changed?

sonalgoyal commented 2 years ago

yes, you can define your own schema. you dont have to have the same schema as the examples.

sonalgoyal commented 2 years ago

reg nulls; you can have null values, but if your columns are very sparely populated, it makes sense to not use them.

sidharth16395 commented 2 years ago

yes got it if any a column had more null value then it will not useful for that i need to missing value imputation or drop the column

sidharth16395 commented 2 years ago

we can

yes, you can define your own schema. you dont have to have the same schema as the examples.

we can define scchema but if we have 2 input data their schema should be same you said earlier also for e.g : data 1: usr_id,name,city,emailid,mobile. data 2: usr_id,city,emailid,mobile then it will not work right as i had tried but it is giving error.

sonalgoyal commented 2 years ago

yes, i meant you dont neeed to have same fields like febrl example. sorry for the confusion.