I am trying to run a UDF pipeline on a dataset using Weld (or grizzly, I suppose).
Grizzly, however, (as far as I know) does not offer an optimized function to apply for example a scalar UDF on a specific column of the dataset.
I found that one way to do it is to access the internal data using to_pandas() which has a function called “apply” and use this function to run a Python UDF on a column.
The problem is that I want to measure Weld’s performance on UDFs and by accessing the internal data and applying the functions just like a normal python program would do is not a fair way to measure Weld’s performance regarding (Python) UDF execution.
How can I apply a python UDF on a column of the dataset in an optimized way using Weld?
I am trying to run a UDF pipeline on a dataset using Weld (or grizzly, I suppose).
Grizzly, however, (as far as I know) does not offer an optimized function to apply for example a scalar UDF on a specific column of the dataset.
I found that one way to do it is to access the internal data using to_pandas() which has a function called “apply” and use this function to run a Python UDF on a column.
The problem is that I want to measure Weld’s performance on UDFs and by accessing the internal data and applying the functions just like a normal python program would do is not a fair way to measure Weld’s performance regarding (Python) UDF execution.
How can I apply a python UDF on a column of the dataset in an optimized way using Weld?
Thanks in advance!