Closed sidharth16395 closed 2 years ago
hey Sidharth,
here are some thoughts
hth
Hi Sonal, Thanks for clearing the doubt and i also started working in it by pulling the docker image. Then i was stuck with the issue:
for e.g:
if one input data source have column name :email_id
and another input data source have column name :email_address
then it is failing to train and giving error of column mismatch if we keep same column name across multiple input data source then it is working fine. But my doubt is:
yes - the column names need to be the same across all datasets currently. you can preprocess your data and select the relevant columns before passing through zingg. providing different column names is on the roadmap.
Okay got it yes curently i had preprocessed the data kept same name and done . And one suggestion need from your side for probabilistic unification we want to use apart from demographic data like browser type, IP address, device type, and operating system to unify as single customer. so can we do it but schema will be changed?
yes, you can define your own schema. you dont have to have the same schema as the examples.
reg nulls; you can have null values, but if your columns are very sparely populated, it makes sense to not use them.
yes got it if any a column had more null value then it will not useful for that i need to missing value imputation or drop the column
we can
yes, you can define your own schema. you dont have to have the same schema as the examples.
we can define scchema but if we have 2 input data their schema should be same you said earlier also for e.g : data 1: usr_id,name,city,emailid,mobile. data 2: usr_id,city,emailid,mobile then it will not work right as i had tried but it is giving error.
yes, i meant you dont neeed to have same fields like febrl example. sorry for the confusion.
Hi Sonal, I am sidharth and working as data science engineer in firsthive. Since i was going through the zingg to use for unification purpose. So i have few doubt regarding the proces and need to clear.Below i am briefly explaining the problem statement and doubt.
Problem Statement:
We have one usecase of unify the customer based on multiple source data.
Doubt: