zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
950 stars 120 forks source link

XML support #794

Closed philipus closed 6 months ago

philipus commented 7 months ago

Describe the question Please describe your question in detail.

Can zingg framework handle XML as source format in the link phase mode for matching/linking two XML sources/files with different schema?

Thank you in advance

sonalgoyal commented 7 months ago

if you use the Zingg Pipe configuration and set format and props as described in the first answer, you should be able to pass xml format to Zingg. We havent tried reading in different schemas for matching and linking, so give it a try and if it doesnt work, you can always read the xml data and shape the dataframes to the same schema and then use an InMemoryPipe to send it to Zingg.

philipus commented 7 months ago

I will give it a try! Is zing expecting a flat schema file in a csv file or can it handle complex XML and Json schemas with deep hierarch?

sonalgoyal commented 7 months ago

Zingg uses the underlying compute engine(spark) as the underlying read/write layer. So if it works in spark, Zingg will be able to read it as well.

philipus commented 7 months ago

extracted_schema.json @sonalgoyal do you think such kind of schema can be handled with zingg and would I need to place this scheme into the config.json?

sonalgoyal commented 7 months ago

can you try loading it into pyspark and see what happens? Here is a link to get you started. https://stackoverflow.com/questions/50429315/read-xml-in-spark

philipus commented 7 months ago

could not load a complex json (see below error) file with the upper json schema which I first read as a string for hand over via

(1)

schema: json_schema

file: /tmp/sdn-2024-2-28.json

inputPipe = Pipe("senzingofac", "json") inputPipe.addProperty(FilePipe.LOCATION, "/tmp/sdn-2024-2-28.json") inputPipe.setSchema(json_schema) args.setData(inputPipe)


(2) fname = FieldDefinition("NAME_FIRST", "string", MatchType.FUZZY) lname = FieldDefinition("NAME_LAST", "string", MatchType.FUZZY) add1 = FieldDefinition("ADDR_LINE1","string", MatchType.FUZZY) add2 = FieldDefinition("ADDR_LINE2", "string", MatchType.FUZZY) add3 = FieldDefinition("ADDR_LINE3", "string", MatchType.FUZZY) city = FieldDefinition("ADDR_CITY", "string", MatchType.FUZZY) postal = FieldDefinition("ADDR_POSTAL_CODE", "string", MatchType.FUZZY) state = FieldDefinition("ADDR_STATE", "string", MatchType.FUZZY) dob = FieldDefinition("DATE_OF_BIRTH", "string", MatchType.FUZZY) ssn = FieldDefinition("SSN_NUMBER", "string", MatchType.FUZZY)

fieldDefs = [fname, lname, add1, add2, add3, city, postal, state, dob, ssn] args.setFieldDefinition(fieldDefs)


(3)

options = ClientOptions([ClientOptions.PHASE,"findTrainingData"])

Zingg execution for the given phase

zingg = ZinggWithSpark(args, options) zingg.initAndExecute()


zingg.initAndExecute()

bring the error


329 else:
330     raise Py4JError(
331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o554.execute. : zingg.common.client.ZinggClientException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601 (line 1, pos 0)

== SQL == { ^^^ "$schema": "http://json-schema.org/schema#", "type": "object", "properties": { "DATA_SOURCE": { "type": "string" }, "RECORD_TYPE": { "type": "string" }, "RECORD_ID": { "type": "string" }, "OFAC_ID": { "type": "string" }, "PUBLISH_DATE": { "type": "string" }, "SDN_PROGRAM": { "type": "string" }, "NAME_LIST": { "type": "array", "items": { "type": "object", "properties": { "NAME_TYPE": { "type": "string" }, "NAME_ORG": { "type": "string" }, "NAME_LAST": { "type": "string" }, "NAME_FIRST": { "type": "string" } }, "required": [ "NAME_TYPE" ] } }, "ADDR_LIST": { "type": "array", "items": { "type": "object", "properties": { "ADDR_CITY": { "type": "string" }, "ADDR_COUNTRY": { "type": "string" }, "ADDR_LINE1": { "type": "string" }, "ADDR_POSTAL_CODE": { "type": "string" }, "ADDR_LINE2": { "type": "string" }, "ADDR_STATE": { "type": "string" }, "ADDR_LINE3": { "type": "string" } } } }, "SWIFT/BIC": { "type": "string" }, "ID_LIST": { "type": "array", "items": { "type": "object", "properties": { "NATIONAL_ID_NUMBER": { "type": "string" }, "NATIONAL_ID_TYPE": { "type": "string" }, "NATIONAL_ID_COUNTRY": { "type": "string" }, "WEBSITE_ADDRESS": { "type": "string" }, "PASSPORT_NUMBER": { "type": "string" }, "PASSPORT_COUNTRY": { "type": "string" }, "GENDER": { "type": "string" }, "SSN_NUMBER": { "type": "string" }, "SSN_COUNTRY": { "type": "string" }, "TAX_ID_NUMBER": { "type": "string" }, "TAX_ID_COUNTRY": { "type": "string" }, "OTHER_ID_NUMBER": { "type": "string" }, "OTHER_ID_TYPE": { "type": "string" }, "OTHER_ID_COUNTRY": { "type": "string" }, "DRIVERS_LICENSE_NUMBER": { "type": "string" }, "DRIVERS_LICENSE_COUNTRY": { "type": "string" }, "EMAIL_ADDRESS": { "type": "string" }, "PHONE_NUMBER": { "type": "string" }, "LEI_NUMBER": { "type": "string" }, "LEI_COUNTRY": { "type": "string" }, "IMO_NUMBER": { "type": "string" }, "DUNS_NUMBER": { "type": "string" }, "MSB_LICENSE_NUMBER": { "type": "string" }, "MSB_LICENSE_COUNTRY": { "type": "string" } } } }, "SDN_TITLE": { "type": "string" }, "ATTR_LIST": { "type": "array", "items": { "type": "object", "properties": { "DATE_OF_BIRTH": { "type": "string" }, "PLACE_OF_BIRTH": { "type": "string" }, "NATIONALITY": { "type": "string" }, "CITIZENSHIP": { "type": "string" } } } }, "Additional Sanctions Information -": { "type": "string" }, "SDN_REMARKS": { "type": "string" }, "Secondary sanctions risk:": { "type": "string" }, "Serial No.": { "type": "string" }, "Transactions Prohibited For Persons Owned or Controlled By U.S. Financial Institutions:": { "type": "string" }, "IFCA Determination -": { "type": "string" }, "Target Type": { "type": "string" }, "Organization Established Date": { "type": "string" }, "CNP (Personal Numerical Code)": { "type": "string" }, "Chinese Commercial Code": { "type": "string" }, "Executive Order 13662 Directive Determination -": { "type": "string" }, "Executive Order 14024 Directive Information -": { "type": "string" }, "Listing Date (EO 14024 Directive 2):": { "type": "string" }, "Listing Date (EO 14024 Directive 3):": { "type": "string" }, "Effective Date (EO 14024 Directive 2):": { "type": "string" }, "Effective Date (EO 14024 Directive 3):": { "type": "string" }, "Executive Order 14024 Directive Information": { "type": "string" }, "UN/LOCODE": { "type": "string" }, "Organization Type:": { "type": "string" }, "MICEX Code": { "type": "string" }, "Nationality of Registration": { "type": "string" }, "Birth Certificate Number": { "type": "string" }, "CAATSA Section 235 Information:": { "type": "string" }, "Executive Order 13846 information:": { "type": "string" }, "Digital Currency Address - XBT": { "type": "string" }, "Digital Currency Address - LTC": { "type": "string" }, "Digital Currency Address - ETH": { "type": "string" }, "Digital Currency Address - XMR": { "type": "string" }, "Digital Currency Address - ETC": { "type": "string" }, "Digital Currency Address - ZEC": { "type": "string" }, "Digital Currency Address - DASH": { "type": "string" }, "Digital Currency Address - BTG": { "type": "string" }, "Digital Currency Address - BSV": { "type": "string" }, "Digital Currency Address - BCH": { "type": "string" }, "Digital Currency Address - XVG": { "type": "string" }, "Digital Currency Address - USDT": { "type": "string" }, "Equity Ticker": { "type": "string" }, "ISIN": { "type": "string" }, "Digital Currency Address - XRP": { "type": "string" }, "Registration Country": { "type": "string" }, "Organization Code": { "type": "string" }, "Digital Currency Address - ARB": { "type": "string" }, "Digital Currency Address - BSC": { "type": "string" }, "Digital Currency Address - USDC": { "type": "string" }, "Digital Currency Address - TRX": { "type": "string" }, "Military Registration Number": { "type": "string" } }, "required": [ "DATA_SOURCE", "NAME_LIST", "OFAC_ID", "PUBLISH_DATE", "RECORD_ID", "RECORD_TYPE", "SDN_PROGRAM" ] }

sonalgoyal commented 6 months ago

it was recommended to use InMemoryPipe so as to resolve this issue.