Closed sonalgoyal closed 1 week ago
this is a new phase.
define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks.
In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)
In BlockingTreeDebugger
, call same
if there are more than one sources, we need to do a group by of the hashes per source.
zingg.sh --phase debugBlocking --conf config.json --zinggDir /location
what will the run command look like?
—zinggDir is optional
Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons. For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.
Let us add a new phase
debugBlocking
which will block the incoming data and outputWe can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples
timestamp - same for both