zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
925 stars 115 forks source link

Matching and non matching pairs count displayed as 0 #842

Open AppyKul opened 1 month ago

AppyKul commented 1 month ago

Hi,

After the interactive labeler phase, when the below code runs : print(f'You have accumulated {n_pos} pairs labeled as positive matches.') print(f'You have accumulated {n_neg} pairs labeled as not matches.')

The count is incorrectly displayed.

The line start = pair.children[0].value.find('data-title="') is where the issue probably is as the value for start is -1.

Can you please tell what exactly happens in the data-title section?

The value of pair.children[0] is :

HTML(value='z_cluster1719220687105:01719220687105:0custid1010410103fname joshua joshuslname george georgestNo77add1 jelbart street jelbart streetadd2 city hyam sbeach hyams beachareacode vic vicstate40324032dob1945022019450220ssn43322624332262')

sonalgoyal commented 1 month ago

Can you please check if you are saving the labels correctly. Please check this video if it helps.

AppyKul commented 1 month ago

Hi Sonal Yes I did.. Please find the screenshots attached. There is an issue at the line if start > 0: start += len('data-title="') end = pair.children[0].value.find('"', start+2) pair_id = pair.children[0].value[start:end] ---> issue at this line as end is not recognized the variable end is not recognized as it is not entering if clause.

start = pair.children[0].value.find('data-title="') start value is -1. So clearly it is not picking something?

Also tried printing the contents of pair.children[0] It is as below : HTML(value='z_cluster1719220687105:01719220687105:0custid1010410103fname joshua joshuslname george georgestNo77add1 jelbart street jelbart streetadd2 city hyam sbeach hyams beachareacode vic vicstate40324032dob1945022019450220ssn43322624332262')

Please help.

sonalgoyal commented 1 month ago

do you see the widget properly? is the record getting parsed and columns getting displayed properly?

AppyKul commented 1 month ago

Apologies for not attaching screenshots . Please find the screenshots attached. The widgets do not have records displacyed in the way it is displayed inreference video. Cpt1 cpt2 cpt3

sonalgoyal commented 1 month ago

ok so it seems that the data itself i not getting parsed or viewed correctly. can you please check the settings for your input pipes?

AppyKul commented 1 month ago

Its the same as in the example code : schema = "id string, fname string, lname string, stNo string, add1 string, add2 string, city string, state string, areacode string, dob string, ssn string" inputPipe = CsvPipe("testFebrl", "/FileStore/tables/test_1.csv", schema)

args.setData(inputPipe)

test_1.csv is a subset of the test.csv in example with just 100 records

sonalgoyal commented 1 month ago

can you do a normal pyspark read on the data to see if you can read it properly?

AppyKul commented 1 month ago

Yes, please find the screenshot of a simple dataframe reading from csv: Data appears in tabular format cpt4

AppyKul commented 1 month ago

Hi @sonalgoyal , my free AWS trial ends in 9 days. Can you please help on this?

sonalgoyal commented 1 month ago

can you try passing the above df as InMemoryPipe and see if that displays correctly? I am afraid we may be hitting a bug if that doesnt work. Please log the browser version and dbr information to the issue if it is not resolved.

Also can you please try with the exact test file we supply?