Open christianversloot opened 1 year ago
@christianversloot Very interesting. Let me get the team together to discuss this and we'll write a more thoughtful response. In general, would love to receive a PR to enable more support for Sagemaker.
Btw, have you taken a look at the Sagemaker Orchestrator as well, or would you want to stick to Sagemaker step operators? In the latter case, what orchestrator are you using?
Thanks, looking forward to reading your thoughts!
We are in the very early stages of ZenML adoption - installed ZenML and created a local orchestrator and S3 based artifact store for now. We follow the 'local if possible, cloud if necessary' paradigm since we're also working with smaller-data traditional ML approaches which often don't require cloud-based instances and can run on a local orchestrator. We are thinking about Sagemaker orchestration for our pipelines, but may also choose for (self-hosted) Airflow to let our core remain independent from one of the major cloud vendors. Then, the step operators are very useful should we want to run specific steps with a specific cloud vendor (although we're primarily focused on AWS for ML for now).
Makes sense! Let me now answer some of your questions in-line. Still brainstorming internally but would love to get you looped in:
Currently, there is just one SageMaker step operator with one flavor. What could be an appropriate structure? One SageMaker step operator with multiple flavors (SageMakerAutoMlStepOperator, SageMakerTensorFlowStepOperator, SageMakerProcessingStepOperator, ...)? Or multiple SageMaker step operators, each with their own possible flavors?
Gut feeling is preferably it would be one step operator that takes different configurations that then define the internals. For example, I can pass operator="TensorFlowStepOperator"
when I register my SagemakerStepOperator? We can have some sort of internal class structure within the flavor that then diffrentiates different behavior.
However, then the question is: Is there a common configuration set that spans across all these different types of operators, and how can one configure the individual operators that have specific configuration (I imagine AutoMLStepOperator has different config than TensorflowStepOperator.
We had a similar problem with the AirflowOrchestrator where airflow can also run on different 'operators'. The way we solved it is to have the operator
and operator_args
arguments. Operator specified the specific operator (lets say TensorflowStepOperator
and operator_args
specified the arguments for that operator specifically (lets say 'tensorflow_version`)
If the first, I could use a bit of explanation about a seemingly circular import structure in sagemaker_step_operator.py and sagemaker_step_operator_flavor.py. The first imports the flavor classes from the second; the second imports the implementation class from the first. If we'd have multiple implementation classes (one for each Training API I'd say)? A possible structure could be to have the SagemakerStepOperator import the flavors and based on the config pick the right one, where the flavors implement the Training API implementations which also inherit from BaseStepOperator. But that feels off; perhaps splitting the Training APIs over multiple SageMaker step operators would be simpler and less complex for the user ('just pick the right step operator') instead of ('pick the step operator, find how I need to configure it to use insert Training API, then find out how I need to add specific configuration for that training API').
If we go with one flavor, then I think we can get away without this complexity hidden away.
What is your opinion about adding support for Processing jobs (https://sagemaker.readthedocs.io/en/stable/api/training/processing.html) into ZenML? Currently, there is no 'integration' category (https://docs.zenml.io/component-gallery/integrations) for data processing, and the ZenML pipelines seem to be oriented around 'import data' -> 'train model'. Allowing to create pipelines for data preprocessing jobs in ZenML would be a big use case for us (we recently had a ~300GB dataset which required data preprocessing); otherwise we'd have to do that manually or using a different tool (while our preference would be to have data preprocessing <> training pipelines integrated in one UI).
I feel like this is a bit question that we need to addess with another abstraction in ZenML. Currently we have the SparkStepOperator that allows for massive data processing jobs but tbh we need to flesh this out with other ideas. Ideally, we'd have a SageMakerProcessingJob
flavor that might solve this?
@christianversloot Eager to hear thoughts!
I think it would be a good idea to apply a similar solution as you've used for the AirflowOrchestrator
for the simple reason that this then becomes a standard way of working for users, which makes using ZenML more simple. Users like that predictability.
The solution that we are implementing for expanding the current step operator is already moving into that direction by allowing users to specify everything in estimator_args
which are then passed to SageMaker (possibly with some values being overruled by ZenML): https://github.com/zenml-io/zenml/pull/1381/files
Unfortunately, in SageMaker, not all estimators, tuners, etc, share the same way of working. Some rely more on keywords arguments; others more on regular arguments. For example:
Maybe, we can solve this in a generic way by allowing users to pass instances of SagemakerStepOperatorSettings
when defining the step operator on their pipeline steps:
SagemakerHyperparameterTunerSettings
extends SagemakerStepOperatorSettings
,That way, there is a generic class structure for specifying settings. Then, depending on the settings object's instance type, a specific implementation of the SageMakerStepOperator
is chosen (e.g. when we pass SagemakerHyperparameterTunerSettings
we execute functionality that creates a hyperparameter tuner, possibly via a sub class).
In the various settings sub classes, we can specify the required arguments for each SageMaker functionality being requested (i.e. the varying amounts of regular arguments), and allow all keyword arguments to be passed as a dict: kwargs: Dict[str, Any]
(perhaps that being present in the base class). This forces users to fill in the required arguments for their specific Sagemaker functionality while allowing non-required keywords arguments to be left empty, while also preventing users from providing configs for things they don't want by tying the settings instance type to what's being done under the hood.
In other words, configuration is then done in the pipeline (at step level), as that's where I'd prefer it to be:
That way, maybe it is no longer necessary to specify the specific type of operator during registration (operator="TensorFlowStepOperator"
). The Sagemaker step operator is just the Sagemaker step operator. To summarize: how it behaves is defined by the settings passed to the step (in the regular way, i.e. https://docs.zenml.io/advanced-guide/pipelines/settings). Then all the SageMakerStepOperator
internals handle the rest (via internal class structure).
Hope this makes sense?
@christianversloot Now had the time to digest and debrief also with @schustmi and @strickvl internally. I think we agree on the overall solution of implementing this all inside a single step flavor for sure. However, the proposal of passing different settings classes doesn't work with pydantic we think.
Instead, a solution would be to have those "operator-specific" configurations as an attribute on the SagemakerStepOperatorSettings:
class EstimatorSettings(BaseModel):
...
class HyperparameterTunerSettings(BaseModel):
...
class SagemakerStepOperatorSettings(BaseSettings):
operator_settings: Union[EstimatorSettings, HyperparameterTunerSettings, ...]
That might make the most sense but maybe it would make it hard to make the base class extensible to future implementation. However, we can always figure this out when we get there.
If you guys are up for it, we're happy to help with this extension to ZenML. Would you have time/capacity to create a PR for it? We can jump on it asap to get it into the latest possible release! I think it would be a very impactful community driven feature
I would agree with the operator_settings
approach as well as the let's see later approach to avoid building an overengineered solution.
Design-wise, do I understand correctly that below is an approach that we all agree on (except some details, maybe, but primarily the generics)?
SagemakerStepOperator
, likely only implementing the launch
definition.SagemakerStepOperator
object will in its launch
definition retrieve the SagemakerStepOperatorSettings
settings object, check if there is a matching isinstance
for any of the supported settings classes, and if so make an object for the specific implementation and pass everything to its launch
definition.Time-wise, I do have approximately 1 day per week available for setting up the ZenML stack and its components. A lot is already running and we are now in the process of adding the first components and creating the first stacks. This means that there will be some time for me to start working on broader API support, especially because it's very likely that we're going to need it. Also, given the design (when we all consider it to be final), I don't think the changes will be very complex, so we should be able to make some nice progress. Perhaps, release wise, break it down into smaller steps (e.g. step A being the regular Estimator following the new setup, step B the TensorFlow estimator, etc.) so that if needed they can be released individually and some steps can be released during the next release?
@schustmi Do you agree with that summary? If yes, @christianversloot , lets have a 30 minute discussion about how to break this down in a public hellonext ticket and track it out?
Sure! Perhaps we can schedule via Slack. I'll take a look on Monday :)
Awesome!
Looking forward to hearing what the next steps are 😊
@christianversloot Reached out via Slack. if somebody else is on this thread in the future, please let us know here and we'll loop you in!
For those who read this at a later date, the design doc for this extension is found here: https://zenml.notion.site/Design-Document-Template-for-SageMakerStepOperator-Extension-in-ZenML-56a4ecda560c413eb9f70dd2d4c3c41a
In https://github.com/zenml-io/zenml/pull/1381, we are expanding the SageMaker Step Operator with additional estimator arguments and training data inputs.
Currently, the SageMaker Step Operator in ZenML uses a default Estimator for starting a Training Job. However, there is a broader range of Training APIs available in SageMaker: https://sagemaker.readthedocs.io/en/stable/api/training/index.html
In my view, and at least on our side I can foresee some use cases (especially for data processing steps in ZenML) where some of these Training APIs are needed:
This would be quite a big feature. Before just shooting in a PR, I have a few questions that we should perhaps discuss first. I'm not sure whether an issue is the correct place for that, but I read somewhere that "before making a massive PR make an issue for discussion first", so here it is :)
import
structure insagemaker_step_operator.py
andsagemaker_step_operator_flavor.py
. The first imports the flavor classes from the second; the second imports the implementation class from the first. If we'd have multiple implementation classes (one for each Training API I'd say)? A possible structure could be to have theSagemakerStepOperator
import the flavors and based on the config pick the right one, where the flavors implement the Training API implementations which also inherit fromBaseStepOperator
. But that feels off; perhaps splitting the Training APIs over multiple SageMaker step operators would be simpler and less complex for the user ('just pick the right step operator') instead of ('pick the step operator, find how I need to configure it to use insert Training API, then find out how I need to add specific configuration for that training API').I'd love to hear your thoughts :)
Checks