[DISCUSSION] Broader SageMaker Training API support in Step Operator

christianversloot commented 1 year ago

In https://github.com/zenml-io/zenml/pull/1381, we are expanding the SageMaker Step Operator with additional estimator arguments and training data inputs.

Currently, the SageMaker Step Operator in ZenML uses a default Estimator for starting a Training Job. However, there is a broader range of Training APIs available in SageMaker: https://sagemaker.readthedocs.io/en/stable/api/training/index.html

In my view, and at least on our side I can foresee some use cases (especially for data processing steps in ZenML) where some of these Training APIs are needed:

For training steps: AutoML, specific Estimators (TensorFlow, PyTorch, HuggingFace, ...), pre-uploaded/third party algorithms (Algorithm Estimators), hyperparameter tuning (HyperparameterTuner), et cetera.
For data processing steps: Processing jobs

This would be quite a big feature. Before just shooting in a PR, I have a few questions that we should perhaps discuss first. I'm not sure whether an issue is the correct place for that, but I read somewhere that "before making a massive PR make an issue for discussion first", so here it is :)

Currently, there is just one SageMaker step operator with one flavor. What could be an appropriate structure? One SageMaker step operator with multiple flavors (SageMakerAutoMlStepOperator, SageMakerTensorFlowStepOperator, SageMakerProcessingStepOperator, ...)? Or multiple SageMaker step operators, each with their own possible flavors?
If the first, I could use a bit of explanation about a seemingly circular import structure in sagemaker_step_operator.py and sagemaker_step_operator_flavor.py. The first imports the flavor classes from the second; the second imports the implementation class from the first. If we'd have multiple implementation classes (one for each Training API I'd say)? A possible structure could be to have the SagemakerStepOperator import the flavors and based on the config pick the right one, where the flavors implement the Training API implementations which also inherit from BaseStepOperator. But that feels off; perhaps splitting the Training APIs over multiple SageMaker step operators would be simpler and less complex for the user ('just pick the right step operator') instead of ('pick the step operator, find how I need to configure it to use insert Training API, then find out how I need to add specific configuration for that training API').
What is your opinion about adding support for Processing jobs (https://sagemaker.readthedocs.io/en/stable/api/training/processing.html) into ZenML? Currently, there is no 'integration' category (https://docs.zenml.io/component-gallery/integrations) for data processing, and the ZenML pipelines seem to be oriented around 'import data' -> 'train model'. Allowing to create pipelines for data preprocessing jobs in ZenML would be a big use case for us (we recently had a ~300GB dataset which required data preprocessing); otherwise we'd have to do that manually or using a different tool (while our preference would be to have data preprocessing <> training pipelines integrated in one UI).

I'd love to hear your thoughts :)

Checks

[X] I added a descriptive title to this issue
[X] I have searched (google, github) for similar issues and couldn't find anything
[X] I have read and followed the docs and still think this is a bug - it's not a bug, but OK :)

htahir1 commented 1 year ago

@christianversloot Very interesting. Let me get the team together to discuss this and we'll write a more thoughtful response. In general, would love to receive a PR to enable more support for Sagemaker.

Btw, have you taken a look at the Sagemaker Orchestrator as well, or would you want to stick to Sagemaker step operators? In the latter case, what orchestrator are you using?

christianversloot commented 1 year ago

Thanks, looking forward to reading your thoughts!

We are in the very early stages of ZenML adoption - installed ZenML and created a local orchestrator and S3 based artifact store for now. We follow the 'local if possible, cloud if necessary' paradigm since we're also working with smaller-data traditional ML approaches which often don't require cloud-based instances and can run on a local orchestrator. We are thinking about Sagemaker orchestration for our pipelines, but may also choose for (self-hosted) Airflow to let our core remain independent from one of the major cloud vendors. Then, the step operators are very useful should we want to run specific steps with a specific cloud vendor (although we're primarily focused on AWS for ML for now).

htahir1 commented 1 year ago

Makes sense! Let me now answer some of your questions in-line. Still brainstorming internally but would love to get you looped in:

Currently, there is just one SageMaker step operator with one flavor. What could be an appropriate structure? One SageMaker step operator with multiple flavors (SageMakerAutoMlStepOperator, SageMakerTensorFlowStepOperator, SageMakerProcessingStepOperator, ...)? Or multiple SageMaker step operators, each with their own possible flavors?

Gut feeling is preferably it would be one step operator that takes different configurations that then define the internals. For example, I can pass operator="TensorFlowStepOperator" when I register my SagemakerStepOperator? We can have some sort of internal class structure within the flavor that then diffrentiates different behavior.

However, then the question is: Is there a common configuration set that spans across all these different types of operators, and how can one configure the individual operators that have specific configuration (I imagine AutoMLStepOperator has different config than TensorflowStepOperator.

We had a similar problem with the AirflowOrchestrator where airflow can also run on different 'operators'. The way we solved it is to have the operator and operator_args arguments. Operator specified the specific operator (lets say TensorflowStepOperator and operator_args specified the arguments for that operator specifically (lets say 'tensorflow_version`)

If the first, I could use a bit of explanation about a seemingly circular import structure in sagemaker_step_operator.py and sagemaker_step_operator_flavor.py. The first imports the flavor classes from the second; the second imports the implementation class from the first. If we'd have multiple implementation classes (one for each Training API I'd say)? A possible structure could be to have the SagemakerStepOperator import the flavors and based on the config pick the right one, where the flavors implement the Training API implementations which also inherit from BaseStepOperator. But that feels off; perhaps splitting the Training APIs over multiple SageMaker step operators would be simpler and less complex for the user ('just pick the right step operator') instead of ('pick the step operator, find how I need to configure it to use insert Training API, then find out how I need to add specific configuration for that training API').

If we go with one flavor, then I think we can get away without this complexity hidden away.

What is your opinion about adding support for Processing jobs (https://sagemaker.readthedocs.io/en/stable/api/training/processing.html) into ZenML? Currently, there is no 'integration' category (https://docs.zenml.io/component-gallery/integrations) for data processing, and the ZenML pipelines seem to be oriented around 'import data' -> 'train model'. Allowing to create pipelines for data preprocessing jobs in ZenML would be a big use case for us (we recently had a ~300GB dataset which required data preprocessing); otherwise we'd have to do that manually or using a different tool (while our preference would be to have data preprocessing <> training pipelines integrated in one UI).

I feel like this is a bit question that we need to addess with another abstraction in ZenML. Currently we have the SparkStepOperator that allows for massive data processing jobs but tbh we need to flesh this out with other ideas. Ideally, we'd have a SageMakerProcessingJob flavor that might solve this?

htahir1 commented 1 year ago

@christianversloot Eager to hear thoughts!

christianversloot commented 1 year ago

I think it would be a good idea to apply a similar solution as you've used for the AirflowOrchestrator for the simple reason that this then becomes a standard way of working for users, which makes using ZenML more simple. Users like that predictability.

The solution that we are implementing for expanding the current step operator is already moving into that direction by allowing users to specify everything in estimator_args which are then passed to SageMaker (possibly with some values being overruled by ZenML): https://github.com/zenml-io/zenml/pull/1381/files

Unfortunately, in SageMaker, not all estimators, tuners, etc, share the same way of working. Some rely more on keywords arguments; others more on regular arguments. For example:

A regular Estimator has just two regular arguments followed by keyword arguments: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator
A HyperparameterTuner has multiple regular arguments followed by keyword arguments: https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html

Maybe, we can solve this in a generic way by allowing users to pass instances of SagemakerStepOperatorSettings when defining the step operator on their pipeline steps:

SagemakerHyperparameterTunerSettings extends SagemakerStepOperatorSettings,
et cetera.

That way, there is a generic class structure for specifying settings. Then, depending on the settings object's instance type, a specific implementation of the SageMakerStepOperator is chosen (e.g. when we pass SagemakerHyperparameterTunerSettings we execute functionality that creates a hyperparameter tuner, possibly via a sub class).

In the various settings sub classes, we can specify the required arguments for each SageMaker functionality being requested (i.e. the varying amounts of regular arguments), and allow all keyword arguments to be passed as a dict: kwargs: Dict[str, Any] (perhaps that being present in the base class). This forces users to fill in the required arguments for their specific Sagemaker functionality while allowing non-required keywords arguments to be left empty, while also preventing users from providing configs for things they don't want by tying the settings instance type to what's being done under the hood.

In other words, configuration is then done in the pipeline (at step level), as that's where I'd prefer it to be:

It is closest to the work of the data scientist / ML engineer compared to the person configuring the stacks (typically an ops person).
It allows users to reuse one step operator for multiple things instead of having to define many step operators that can do one thing which would be 90% the same functionality wise.

That way, maybe it is no longer necessary to specify the specific type of operator during registration (operator="TensorFlowStepOperator"). The Sagemaker step operator is just the Sagemaker step operator. To summarize: how it behaves is defined by the settings passed to the step (in the regular way, i.e. https://docs.zenml.io/advanced-guide/pipelines/settings). Then all the SageMakerStepOperator internals handle the rest (via internal class structure).

Hope this makes sense?

htahir1 commented 1 year ago

@christianversloot Now had the time to digest and debrief also with @schustmi and @strickvl internally. I think we agree on the overall solution of implementing this all inside a single step flavor for sure. However, the proposal of passing different settings classes doesn't work with pydantic we think.

Instead, a solution would be to have those "operator-specific" configurations as an attribute on the SagemakerStepOperatorSettings:

class EstimatorSettings(BaseModel):
  ...

class HyperparameterTunerSettings(BaseModel):
  ...

class SagemakerStepOperatorSettings(BaseSettings):
  operator_settings: Union[EstimatorSettings, HyperparameterTunerSettings, ...]

That might make the most sense but maybe it would make it hard to make the base class extensible to future implementation. However, we can always figure this out when we get there.

If you guys are up for it, we're happy to help with this extension to ZenML. Would you have time/capacity to create a PR for it? We can jump on it asap to get it into the latest possible release! I think it would be a very impactful community driven feature

christianversloot commented 1 year ago

I would agree with the operator_settings approach as well as the let's see later approach to avoid building an overengineered solution.

Design-wise, do I understand correctly that below is an approach that we all agree on (except some details, maybe, but primarily the generics)?

We will have step operator implementations which are sub classes of SagemakerStepOperator, likely only implementing the launch definition.
The SagemakerStepOperator object will in its launch definition retrieve the SagemakerStepOperatorSettings settings object, check if there is a matching isinstance for any of the supported settings classes, and if so make an object for the specific implementation and pass everything to its launch definition.
Within each implementation, the actual SageMaker functionality is provided.

Time-wise, I do have approximately 1 day per week available for setting up the ZenML stack and its components. A lot is already running and we are now in the process of adding the first components and creating the first stacks. This means that there will be some time for me to start working on broader API support, especially because it's very likely that we're going to need it. Also, given the design (when we all consider it to be final), I don't think the changes will be very complex, so we should be able to make some nice progress. Perhaps, release wise, break it down into smaller steps (e.g. step A being the regular Estimator following the new setup, step B the TensorFlow estimator, etc.) so that if needed they can be released individually and some steps can be released during the next release?

htahir1 commented 1 year ago

@schustmi Do you agree with that summary? If yes, @christianversloot , lets have a 30 minute discussion about how to break this down in a public hellonext ticket and track it out?

christianversloot commented 1 year ago

Sure! Perhaps we can schedule via Slack. I'll take a look on Monday :)

htahir1 commented 1 year ago

Awesome!

christianversloot commented 1 year ago

Looking forward to hearing what the next steps are 😊

htahir1 commented 1 year ago

@christianversloot Reached out via Slack. if somebody else is on this thread in the future, please let us know here and we'll loop you in!

htahir1 commented 1 year ago

For those who read this at a later date, the design doc for this extension is found here: https://zenml.notion.site/Design-Document-Template-for-SageMakerStepOperator-Extension-in-ZenML-56a4ecda560c413eb9f70dd2d4c3c41a

zenml-io / zenml

[DISCUSSION] Broader SageMaker Training API support in Step Operator #1398

Checks