zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
954 stars 118 forks source link

Suggestions for FindTrainingData step #213

Open lsbilbro opened 2 years ago

lsbilbro commented 2 years ago

Hello!

As you know I'm working with the NC 5M dataset. I am frequently restarting from scratch to test the framework I'm building. Each time, I'll run the findTrainingData + labeler sequence until I get 40 positive labels.

Usually this takes ~150 total labels (i.e 150 pairs / 300 records) ... but sometimes it takes a significantly longer.

For example, in my current iteration, I've labeled 419 pairs and I only have 28 positive labels so far. Moreover, 17 of my first 50 labels were positive, but only 11 of the last 369 have been positive. Is this an indication that something has gone wrong?

Are there any tuning/config tips I could be making use of here?

sonalgoyal commented 2 years ago

hmm..I have seen this happen once or twice but the learning always came back. But it needs some time and couple of extra rounds.

I would advise starting from scratch and going from there. can you please share the model folder of the run which has gone wrong?

Our current Active learner implementation takes a sample and makes predictions. It then splits the records on the probability of 0.5 and takes 10 records in each split closest to 0.5. I can check what's going wrong. Exposing these internals - confusion matrix etc is on the roadmap which will help you decide better. Thanks for showing me your notebooks the other day, sparked a lot of fresh ideas :D

another approach could be to still run a trainmatch and see what it is giving to you.

lsbilbro commented 2 years ago

Thanks Sonal, I'll attach it here.

I ended up doing a few more rounds before trying something different. What I found was interesting. During one of my last rounds I "lied" to zingg - i.e. I labeled some pairs that were close (but not real matches) as matches. I did this for about 5 or so pairs. On the very next batch, I was presented with an excellent distribution of matches and non-matches again... almost like zingg had broken out of it's loop... I know it sounds weird. After this last batch I had 48 positive labels, so I moved forward with trainMatch.

I'll definitely be restarting from scratch again soon.

lsbilbro commented 2 years ago

NCVoter360.tar.gz

sonalgoyal commented 2 years ago

The black magic of machine learning Luke ;-)

How did the match output look with the wrong labels? I suspect it may not be too good.

We have an updateLabel utility which you could use to edit the labels9https://docs.zingg.ai/zingg/updatinglabels - havent built it yet for Databricks but I believe our original notebooks can be used as base to build something similar?

sonalgoyal commented 2 years ago

NCVoter360.tar.gz

thanks @lsbilbro

lsbilbro commented 2 years ago

The match output didn't look too bad, actually... but it's hard to tell on the NC voter set since there are so few columns and too much ambiguity to begin with.

I did rerun again from the beginning and it only took about 250 labels... so not too bad.

I suppose we can close this question out ... unless you found anything interesting in the model.

sonalgoyal commented 2 years ago

yeah the models are generally resilient to a few wrong labels. I still need to check the model closely, let us leave the question open for now.

sonalgoyal commented 2 years ago

hey @lsbilbro - wanted to confirm that the config.json for this model is the same as the example in the repo?

lsbilbro commented 2 years ago

Hey Sonal, it's similar but slightly different. I'll attach it here NCVoter360_config.txt .