Closed philipus closed 6 months ago
if you use the Zingg Pipe configuration and set format and props as described in the first answer, you should be able to pass xml format to Zingg. We havent tried reading in different schemas for matching and linking, so give it a try and if it doesnt work, you can always read the xml data and shape the dataframes to the same schema and then use an InMemoryPipe to send it to Zingg.
I will give it a try! Is zing expecting a flat schema file in a csv file or can it handle complex XML and Json schemas with deep hierarch?
Zingg uses the underlying compute engine(spark) as the underlying read/write layer. So if it works in spark, Zingg will be able to read it as well.
extracted_schema.json @sonalgoyal do you think such kind of schema can be handled with zingg and would I need to place this scheme into the config.json?
can you try loading it into pyspark and see what happens? Here is a link to get you started. https://stackoverflow.com/questions/50429315/read-xml-in-spark
could not load a complex json (see below error) file with the upper json schema which I first read as a string for hand over via
(1)
inputPipe = Pipe("senzingofac", "json") inputPipe.addProperty(FilePipe.LOCATION, "/tmp/sdn-2024-2-28.json") inputPipe.setSchema(json_schema) args.setData(inputPipe)
(2) fname = FieldDefinition("NAME_FIRST", "string", MatchType.FUZZY) lname = FieldDefinition("NAME_LAST", "string", MatchType.FUZZY) add1 = FieldDefinition("ADDR_LINE1","string", MatchType.FUZZY) add2 = FieldDefinition("ADDR_LINE2", "string", MatchType.FUZZY) add3 = FieldDefinition("ADDR_LINE3", "string", MatchType.FUZZY) city = FieldDefinition("ADDR_CITY", "string", MatchType.FUZZY) postal = FieldDefinition("ADDR_POSTAL_CODE", "string", MatchType.FUZZY) state = FieldDefinition("ADDR_STATE", "string", MatchType.FUZZY) dob = FieldDefinition("DATE_OF_BIRTH", "string", MatchType.FUZZY) ssn = FieldDefinition("SSN_NUMBER", "string", MatchType.FUZZY)
fieldDefs = [fname, lname, add1, add2, add3, city, postal, state, dob, ssn] args.setFieldDefinition(fieldDefs)
(3)
options = ClientOptions([ClientOptions.PHASE,"findTrainingData"])
zingg = ZinggWithSpark(args, options) zingg.initAndExecute()
zingg.initAndExecute()
bring the error
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling o554.execute. : zingg.common.client.ZinggClientException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601 (line 1, pos 0)
== SQL == { ^^^ "$schema": "http://json-schema.org/schema#", "type": "object", "properties": { "DATA_SOURCE": { "type": "string" }, "RECORD_TYPE": { "type": "string" }, "RECORD_ID": { "type": "string" }, "OFAC_ID": { "type": "string" }, "PUBLISH_DATE": { "type": "string" }, "SDN_PROGRAM": { "type": "string" }, "NAME_LIST": { "type": "array", "items": { "type": "object", "properties": { "NAME_TYPE": { "type": "string" }, "NAME_ORG": { "type": "string" }, "NAME_LAST": { "type": "string" }, "NAME_FIRST": { "type": "string" } }, "required": [ "NAME_TYPE" ] } }, "ADDR_LIST": { "type": "array", "items": { "type": "object", "properties": { "ADDR_CITY": { "type": "string" }, "ADDR_COUNTRY": { "type": "string" }, "ADDR_LINE1": { "type": "string" }, "ADDR_POSTAL_CODE": { "type": "string" }, "ADDR_LINE2": { "type": "string" }, "ADDR_STATE": { "type": "string" }, "ADDR_LINE3": { "type": "string" } } } }, "SWIFT/BIC": { "type": "string" }, "ID_LIST": { "type": "array", "items": { "type": "object", "properties": { "NATIONAL_ID_NUMBER": { "type": "string" }, "NATIONAL_ID_TYPE": { "type": "string" }, "NATIONAL_ID_COUNTRY": { "type": "string" }, "WEBSITE_ADDRESS": { "type": "string" }, "PASSPORT_NUMBER": { "type": "string" }, "PASSPORT_COUNTRY": { "type": "string" }, "GENDER": { "type": "string" }, "SSN_NUMBER": { "type": "string" }, "SSN_COUNTRY": { "type": "string" }, "TAX_ID_NUMBER": { "type": "string" }, "TAX_ID_COUNTRY": { "type": "string" }, "OTHER_ID_NUMBER": { "type": "string" }, "OTHER_ID_TYPE": { "type": "string" }, "OTHER_ID_COUNTRY": { "type": "string" }, "DRIVERS_LICENSE_NUMBER": { "type": "string" }, "DRIVERS_LICENSE_COUNTRY": { "type": "string" }, "EMAIL_ADDRESS": { "type": "string" }, "PHONE_NUMBER": { "type": "string" }, "LEI_NUMBER": { "type": "string" }, "LEI_COUNTRY": { "type": "string" }, "IMO_NUMBER": { "type": "string" }, "DUNS_NUMBER": { "type": "string" }, "MSB_LICENSE_NUMBER": { "type": "string" }, "MSB_LICENSE_COUNTRY": { "type": "string" } } } }, "SDN_TITLE": { "type": "string" }, "ATTR_LIST": { "type": "array", "items": { "type": "object", "properties": { "DATE_OF_BIRTH": { "type": "string" }, "PLACE_OF_BIRTH": { "type": "string" }, "NATIONALITY": { "type": "string" }, "CITIZENSHIP": { "type": "string" } } } }, "Additional Sanctions Information -": { "type": "string" }, "SDN_REMARKS": { "type": "string" }, "Secondary sanctions risk:": { "type": "string" }, "Serial No.": { "type": "string" }, "Transactions Prohibited For Persons Owned or Controlled By U.S. Financial Institutions:": { "type": "string" }, "IFCA Determination -": { "type": "string" }, "Target Type": { "type": "string" }, "Organization Established Date": { "type": "string" }, "CNP (Personal Numerical Code)": { "type": "string" }, "Chinese Commercial Code": { "type": "string" }, "Executive Order 13662 Directive Determination -": { "type": "string" }, "Executive Order 14024 Directive Information -": { "type": "string" }, "Listing Date (EO 14024 Directive 2):": { "type": "string" }, "Listing Date (EO 14024 Directive 3):": { "type": "string" }, "Effective Date (EO 14024 Directive 2):": { "type": "string" }, "Effective Date (EO 14024 Directive 3):": { "type": "string" }, "Executive Order 14024 Directive Information": { "type": "string" }, "UN/LOCODE": { "type": "string" }, "Organization Type:": { "type": "string" }, "MICEX Code": { "type": "string" }, "Nationality of Registration": { "type": "string" }, "Birth Certificate Number": { "type": "string" }, "CAATSA Section 235 Information:": { "type": "string" }, "Executive Order 13846 information:": { "type": "string" }, "Digital Currency Address - XBT": { "type": "string" }, "Digital Currency Address - LTC": { "type": "string" }, "Digital Currency Address - ETH": { "type": "string" }, "Digital Currency Address - XMR": { "type": "string" }, "Digital Currency Address - ETC": { "type": "string" }, "Digital Currency Address - ZEC": { "type": "string" }, "Digital Currency Address - DASH": { "type": "string" }, "Digital Currency Address - BTG": { "type": "string" }, "Digital Currency Address - BSV": { "type": "string" }, "Digital Currency Address - BCH": { "type": "string" }, "Digital Currency Address - XVG": { "type": "string" }, "Digital Currency Address - USDT": { "type": "string" }, "Equity Ticker": { "type": "string" }, "ISIN": { "type": "string" }, "Digital Currency Address - XRP": { "type": "string" }, "Registration Country": { "type": "string" }, "Organization Code": { "type": "string" }, "Digital Currency Address - ARB": { "type": "string" }, "Digital Currency Address - BSC": { "type": "string" }, "Digital Currency Address - USDC": { "type": "string" }, "Digital Currency Address - TRX": { "type": "string" }, "Military Registration Number": { "type": "string" } }, "required": [ "DATA_SOURCE", "NAME_LIST", "OFAC_ID", "PUBLISH_DATE", "RECORD_ID", "RECORD_TYPE", "SDN_PROGRAM" ] }
it was recommended to use InMemoryPipe so as to resolve this issue.
Describe the question Please describe your question in detail.
Can zingg framework handle XML as source format in the link phase mode for matching/linking two XML sources/files with different schema?
Thank you in advance