zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
952 stars 120 forks source link

`exportModel` encounters `NullPointerException` #817

Open knguyen1 opened 6 months ago

knguyen1 commented 6 months ago

Describe the bug Cannot generate csv of model because of NullPointerException. Phase generateDocs works just fine. From documentation: https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/exportlabeleddata

To Reproduce Steps to reproduce the behavior: Run: (.venv) spark@496208741a60:/workspaces/foo-zingg-entity-resolution $ ~/zingg-0.4.0/scripts/zingg.sh --phase exportModel --conf /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json --location tmp --properties-file /workspaces/foo-zingg-entity-resolution/zingg.conf

Expected behavior Should be able to export a csv of the model.

Screenshots

24/04/12 15:35:03 INFO ClientOptions: --phase
24/04/12 15:35:03 INFO ClientOptions: exportModel
24/04/12 15:35:03 INFO ClientOptions: --conf
24/04/12 15:35:03 INFO ClientOptions: /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json
24/04/12 15:35:03 INFO ClientOptions: --location
24/04/12 15:35:03 INFO ClientOptions: tmp
24/04/12 15:35:03 INFO ClientOptions: --email
24/04/12 15:35:03 INFO ClientOptions: zingg@zingg.ai
24/04/12 15:35:03 INFO ClientOptions: --license
24/04/12 15:35:03 INFO ClientOptions: zinggLicense.txt
24/04/12 15:35:03 WARN ArgumentsUtil: Config Argument is /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json
24/04/12 15:35:03 WARN ArgumentsUtil: phase is exportModel
24/04/12 15:35:03 INFO Client: 
24/04/12 15:35:03 INFO Client: **************************************************************************
24/04/12 15:35:03 INFO Client: *            ** Note about analytics collection by Zingg AI **           *
24/04/12 15:35:03 INFO Client: *                                                                        *
24/04/12 15:35:03 INFO Client: *  Please note that Zingg captures a few metrics about application's     *
24/04/12 15:35:03 INFO Client: *  runtime parameters. However, no user's personal data or application   *
24/04/12 15:35:03 INFO Client: *  data is captured. If you want to switch off this feature, please      *
24/04/12 15:35:03 INFO Client: *  set the flag collectMetrics to false in config. For details, please   *
24/04/12 15:35:03 INFO Client: *  refer to the Zingg docs (https://docs.zingg.ai/docs/security.html)    *
24/04/12 15:35:03 INFO Client: **************************************************************************
24/04/12 15:35:03 INFO Client: 
java.lang.NullPointerException
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Unknown Source)
        at zingg.spark.core.executor.SparkZFactory.get(SparkZFactory.java:40)
        at zingg.common.client.Client.setZingg(Client.java:68)
        at zingg.common.client.Client.<init>(Client.java:46)
        at zingg.spark.client.SparkClient.<init>(SparkClient.java:29)
        at zingg.spark.client.SparkClient.getClient(SparkClient.java:68)
        at zingg.common.client.Client.mainMethod(Client.java:185)
        at zingg.spark.client.SparkClient.main(SparkClient.java:76)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Desktop (please complete the following information):

Smartphone (please complete the following information): N/A

Additional context

{
    "fieldDefinition": [
        {
            "fieldName": "data_source",
            "fields": "data_source",
            "dataType": "string",
            "matchType": "DONT_USE"
        },
        // other fields...
    ],
    "output": [
        {
            "name": "output",
            "format": "csv",
            "props": {
                "location": "/tmp/zinggOutput",
                "delimiter": ",",
                "header": true
            }
        }
    ],
    "data": [{
        "name": "salesforce",
        "format": "jdbc",
        "props": {
            "url": "jdbc:redshift://my-redshift-server:5439/my-redshift-db",
            "dbtable": "my_schema.my_table",
            "driver": "com.amazon.redshift.jdbc42.Driver",
            "user": "test",
            "password": "password123"
        }
    }],
    "labelDataSampleSize" : 0.15,
    "numPartitions": 50,
    "modelId": 101,
    "zinggDir": "/workspaces/foo-zingg-entity-resolution/models"
}
sonalgoyal commented 6 months ago

thanks for reporting this. if you are struck, you can try reading the model folder at zinggDir/modelId/trainingData/marked using pyspark. this location will have your labeled data in parquet format

vikasgupta78 commented 5 months ago

Will be handled along side SparkConnect change, putting on hold for now

havardox commented 2 months ago

For anyone who just wants to get their training data:

MODEL_PATH: str = "{your model folder}/{your model ID}"
OUTPUT_PATH: str = "output.csv"

from pathlib import Path
from pyspark.sql import SparkSession

context: SparkSession = SparkSession.builder.getOrCreate()

context.sparkContext.getConf().getAll()

df = context.read.parquet(str((Path(MODEL_PATH) / "trainingData/marked").absolute()))
print(df.toPandas())

# Save to CSV
df.toPandas().to_csv(Path(OUTPUT_PATH), header=True, index=False)
iqoOopi commented 1 month ago

same null pointer error on zingg:0.4.0 from docker img

iqoOopi commented 1 month ago

For anyone who just wants to get their training data:

MODEL_PATH: str = "{your model folder}/{your model ID}"
OUTPUT_PATH: str = "output.csv"

from pathlib import Path
from pyspark.sql import SparkSession

context: SparkSession = SparkSession.builder.getOrCreate()

context.sparkContext.getConf().getAll()

df = context.read.parquet(str((Path(MODEL_PATH) / "trainingData/marked").absolute()))
print(df.toPandas())

# Save to CSV
df.toPandas().to_csv(Path(OUTPUT_PATH), header=True, index=False)

Thanks havardox.

I'm running zingg from docker and new to spark. Wondering how can I export the model from docker?

sonalgoyal commented 1 month ago

Can you try running pyspark in the docker and the commands shared above by @havardox

Nitish1814 commented 1 week ago

fixed: https://github.com/zinggAI/zingg/pull/860/files