Closed KilianLissSMRTR closed 1 year ago
Thanks for using Zingg and reporting this issue @KilianLissSMRTR ! It is likely that the blocking model is not the most optimal in this case. How much training data do oyu have and how many of them are matches?
Closed the issue.
Hi @sonalgoyal , thanks for responding.
On average, it looks like we have around 2-3 matches per customer in our data. Our total data set consists of 46 million rows, and should have 16 million customers according to our old process (which we are looking to replace).
For our training data, we have: 396 Matches 2411 No Matches 165 Uncertain Matches
The full list of columns that we would like to use Zingg on are:
fieldDefs = [
FieldDefinition('SourceTableName', 'string', MatchType.DONT_USE)
, FieldDefinition('SourceTableID', 'integer', MatchType.DONT_USE)
, FieldDefinition('FirstName', 'string', MatchType.FUZZY)
, FieldDefinition('LastName', 'string', MatchType.EXACT)
, FieldDefinition('dateOfBirth', 'string', MatchType.FUZZY)
, FieldDefinition('Address', 'string', MatchType.FUZZY)
, FieldDefinition('Suburb', 'string', MatchType.FUZZY)
, FieldDefinition('State', 'string', MatchType.FUZZY)
, FieldDefinition('Postcode', 'string', MatchType.NUMERIC)
, FieldDefinition('Email', 'string', MatchType.EMAIL)
, FieldDefinition('Landline', 'string', MatchType.NUMERIC)
, FieldDefinition('Mobile', 'string', MatchType.NUMERIC)
]
The dataset and match numbers you reported should be doable with Zingg.
Are you seeing similar issue withother data sizes as well and other models? Can you share a reproducible test case for us to debug this? I will need the config, data and training data.
Yes, so since I am not able to share our data, I did try replicating the issue with the NCVoters example dataset, but the issue didn't appear.
I did manage to solve it partially by using a single node cluster and reducing the Zingg "numPartitions" parameter equal to the number of cores my cluster is running (32 partition in my case), instead of 20-30 times the number of cores as mentioned in the documentation.
Occasionally, I did get some memory spill, but I'm still investigating this. Restarting my cluster seemed to solve it, but need to check that this consistently solves the issue.
Thanks for sharing this information. Please keep us posted how this goes and if we can help in any way
@KilianLissSMRTR - are you still seeing this?
@KilianLissSMRTR - please let us know if this is still an issue.
Hi @sonalgoyal , this is no longer an issue. I've found tweaking the number of partitions and adjusting our clusters seemed to speed things up.
Thanks @KilianLissSMRTR
Describe the bug According to the hardware sizing reccomendations of the documentation.
On smaller data sets (Eg: 55,878 records), my match stage runtime takes less than 5 minutes as expected. Whenever I increase the number of records (Eg: 78,305), Zingg's match stage often get's stuck on various spark shuffle write operations, causing it to take 1-2 hours to run, even though I have much more hardware resources assigned to the task compare to that quoted in the documentation.
Eventually, we would like to run Zingg on 40 million records, which would get prohibitively expensive unless this issue get's resolved.
To Reproduce I am running Zingg in a databricks environment, with the following cluster configuration:
In general, I have tried changing the compute between Single and Multi-node, I've tried bigger computes, changing Zingg's "numPartitions" parameter.
After already having created sufficient training data, my python code looks as follows:
My resulting arguments look like:
As we can see from the spark job run logs, just one of the multiple Shuffle Write stages took about 8 minutes (longer than what the entire match stage should take):
Expected behavior I don't think Zingg should be getting stuck on spark Shuffle Write stages. Please let me know if you need more information, and thanks in advance for helping out!