wildlife-dynamics / ecoscope-workflows

An extensible task specification and compiler for local and distributed workflows.
https://ecoscope-workflows.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

Task: Data Transform (Custom function evaluation) #39

Open walljcg opened 4 months ago

walljcg commented 4 months ago

How feasible is it to allow users to define their own custom functions to achieve complex ETL tasks?

Pandas provides a Dataframe.eval() function that lets us do various column based operations.

Pandas also provides a Dataframe.assign() function that accepts lambda expressions as input. How safe are these and could a user mine bitcoin or print out all system variables within the lambda?

We win a lot of functionality if we can use either df.eval() or df.assign() without creating major security concerns when running workflows in the cloud. For example df.eval ('A' + 6) or df.eval('A'*5/9) gives us a way to do value math operations whereas df.assign(lambda x: datetime(x)) lets us do things like timezone conversions or more complicated expressions.

cisaacstern commented 4 months ago

My take is that .eval is clearly highly risky and probably should just be avoided entirely, unless someone can tell me something I don't know about how to guardrail it, which theres definitely a lot I don't know!

As for .assign, I asked ChatGPT and I think it gave a pretty compelling response that opens up a path for development there:

My question:

pandas docs warn about code injection risk of pd.Dataframe.eval. does pd.Dataframe.assign (which accepts a lamba function) also have code injection risk?

ChatGPT says:

Yes, there can be a code injection risk with pd.DataFrame.assign if it is used improperly, particularly when lambda functions or other expressions are used in a way that involves untrusted input. The risk arises if you allow user input to directly influence the expressions or functions passed to assign without proper sanitization.

The assign method in pandas allows you to create new columns or modify existing ones using a lambda function or other callable expressions. Here’s an example of how it can be used:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3]})
df = df.assign(B=lambda x: x['A'] * 2)
print(df)

In this case, the lambda function is safe because it doesn’t involve any external input. However, if the lambda function or the expressions inside assign are constructed using external input, there is a potential for code injection.

For example, consider this unsafe usage where user input is directly used to construct the expression:

user_input = "x['A'] * 2"  # Imagine this comes from an untrusted source
df = df.assign(B=lambda x: eval(user_input))

Here, eval is dangerous because it can execute arbitrary code. If the user_input is something malicious, it could potentially harm the system.

To mitigate this risk, avoid using eval or other dynamic code execution functions with untrusted input. Instead, prefer safer alternatives such as explicit column operations that do not involve user input in the expressions. For instance:

user_multiplier = 2  # Imagine this comes from an untrusted source
df = df.assign(B=lambda x: x['A'] * user_multiplier)

In this case, the risk is minimized because user_multiplier is a simple numeric value and is directly used in the multiplication operation without involving eval.

In summary, while pd.DataFrame.assign itself doesn’t inherently pose a code injection risk, improper use of lambda functions or expressions that involve untrusted input can lead to vulnerabilities. Always sanitize and validate any external input before using it in such contexts.

Yun-Wu commented 4 months ago

tl;dr I propose implementing dataframe filtering and assignment explicitly without relying on custom functions like pandas query and eval.

It's disappointing after extensive research, but I've realized that more restrictions on custom functions increase the likelihood of unexpected errors for our customers. Fewer restrictions, however, don't fully address safety concerns.

I plan to support the following filters:

Please suggests other transformations useful for MVP. For everything else, I can either implement them in our codebase, or wait until custom modules in the future.

Here are all the restrictions I’ve explored and documenting them here for future references.