Doubt about unify data from multiple source

sidharth16395 commented 2 years ago

Hi Sonal, I am sidharth and working as data science engineer in firsthive. Since i was going through the zingg to use for unification purpose. So i have few doubt regarding the proces and need to clear.Below i am briefly explaining the problem statement and doubt.

Problem Statement:

We have one usecase of unify the customer based on multiple source data.

For example: a. we have one master data and 3 types of channel data . b. so we need to unify master data by matching with the channel data and update the id.
Doubt:
1. If we had multiple source data (cross channel data) then the zing system how it will unify
2. How to assign number of cluster as we need to define the number of cluster while training the data or it is randomly taking the number of clusters.
3. And Similarity model is basically to check similar within the cluster or across the cluster.
4. Similarity and cluster are being trained by taking how many sets of features.(like the existence unification what i had done that i had trained a kmean clustering to cluster email and name and cross join among the cluster to unify by giving probability value for city email and name.)
5. If we automate the process then how it will give the label i.e., clusterid, match case and probability of matching or they are using record linkage.

sonalgoyal commented 2 years ago

hey Sidharth,

here are some thoughts

You can use the link phase to match across sources. Data has to be in the same schema
cluster number is assigned by zingg.
similarity model is a classifier to predict if two records match.
Similarity and clustering happen as per the match types and field types defined in the config. We have a blocking tree which is a set coverage algorithm, which does the clustering. Features for that are learnt from the training data.
you can check the link phase and its output - cluster id, probability of matching is provided.

hth

sidharth16395 commented 2 years ago

Hi Sonal, Thanks for clearing the doubt and i also started working in it by pulling the docker image. Then i was stuck with the issue:

if we are using multiple input source data then column names of each data source should be same other wise it is giving error.

for e.g:

if one input data source have column name :email_id
and another input data source have column name :email_address

then it is failing to train and giving error of column mismatch if we keep same column name across multiple input data source then it is working fine. But my doubt is:
- can't we keep different column names .
- second can we able to handle the missing value.
- As you said schema of multiple input data source should be same otherwise it will not possible but we have some inputdata which schema is different from master data as we are doing probabilistic unification.

sonalgoyal commented 2 years ago

yes - the column names need to be the same across all datasets currently. you can preprocess your data and select the relevant columns before passing through zingg. providing different column names is on the roadmap.

sidharth16395 commented 2 years ago

Okay got it yes curently i had preprocessed the data kept same name and done . And one suggestion need from your side for probabilistic unification we want to use apart from demographic data like browser type, IP address, device type, and operating system to unify as single customer. so can we do it but schema will be changed?

sonalgoyal commented 2 years ago

yes, you can define your own schema. you dont have to have the same schema as the examples.

sonalgoyal commented 2 years ago

reg nulls; you can have null values, but if your columns are very sparely populated, it makes sense to not use them.

sidharth16395 commented 2 years ago

yes got it if any a column had more null value then it will not useful for that i need to missing value imputation or drop the column

sidharth16395 commented 2 years ago

we can

yes, you can define your own schema. you dont have to have the same schema as the examples.

we can define scchema but if we have 2 input data their schema should be same you said earlier also for e.g : data 1: usr_id,name,city,emailid,mobile. data 2: usr_id,city,emailid,mobile then it will not work right as i had tried but it is giving error.

sonalgoyal commented 2 years ago

yes, i meant you dont neeed to have same fields like febrl example. sorry for the confusion.

zinggAI / zingg

Doubt about unify data from multiple source #231

Problem Statement:

Doubt: