zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
958 stars 120 forks source link

Make Zingg More Usable - Part 1. Blocking #902

Closed sonalgoyal closed 1 week ago

sonalgoyal commented 1 month ago

Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons. For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.

Let us add a new phase debugBlocking which will block the incoming data and output

We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples

timestamp - same for both

sonalgoyal commented 1 month ago

this is a new phase. define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks. In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)

In BlockingTreeDebugger, call same

sonalgoyal commented 1 month ago

if there are more than one sources, we need to do a group by of the hashes per source.

sonalgoyal commented 1 month ago

see also https://github.com/zinggAI/zingg/issues/893

sania-16 commented 1 month ago

zingg.sh --phase debugBlocking --conf config.json --zinggDir /location

what will the run command look like?

sonalgoyal commented 1 month ago

—zinggDir is optional