weld-project / weld

High-performance runtime for data analytics applications
https://www.weld.rs
BSD 3-Clause "New" or "Revised" License
2.99k stars 260 forks source link

Running Python UDFs in Weld. #523

Open kchasialis opened 2 years ago

kchasialis commented 2 years ago

I am trying to run a UDF pipeline on a dataset using Weld (or grizzly, I suppose).

Grizzly, however, (as far as I know) does not offer an optimized function to apply for example a scalar UDF on a specific column of the dataset.

I found that one way to do it is to access the internal data using to_pandas() which has a function called “apply” and use this function to run a Python UDF on a column.

The problem is that I want to measure Weld’s performance on UDFs and by accessing the internal data and applying the functions just like a normal python program would do is not a fair way to measure Weld’s performance regarding (Python) UDF execution.

How can I apply a python UDF on a column of the dataset in an optimized way using Weld?

Thanks in advance!