Open timtimich35 opened 1 month ago
The error I get is saying "Py4JJavaError: An error occurred while calling o100353.save. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: milvus." however the jar file is just there where I specified it. SparkSession doesn't return any error on clickhouse driver I also specify and utilizes it just fine.
@timtimich35 Obviously the spark-milvus jar is not loaded correctly. I 'm not quite familiar to pyspark. Have you ever tried .config("spark.driver.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar') \ .config("spark.executor.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar')
or pyspark --jars /data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar,/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar
@wayblink Nope. Haven't yet tried this approach. Will do. Will text you back.
How can I get this part (marked yellow) done if I'm on Windows?
Dependencies: python 3.8.12 pyspark 3.5.0 pymilvus 2.4.1 grpcio-tools 1.60.0 protobuf 4.25.3 milvus cluster was deployed in k8s using milvus operator 0.9.13
SparkSession setup:
Milvus setup:
Given: I work in DataLore IDE deployed in k8s along with Milvus and Spark. I have a spark dataframe of 2.5 million rows and 2 columns, id and 3000 elements long vector of floats. I try to load it in batches 100,000 records each so it should be 25 iterations in total. None of the batches gets inserted.
Insert operation:
Can you please help me understand what do I do wrong?