zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
3.93k stars 429 forks source link

[FEATURE] Decouple configuration from first-class component executions #45

Closed htahir1 closed 2 years ago

htahir1 commented 3 years ago

Is your feature request related to a problem? Please describe. Currently Pipelines, Steps, Datasource, and Backends , i.e., first-class ZenML components have the configuration and the post-execution state built-in to them. For example to run a Pipeline:

# Configuration
training_pipeline.add_split()
training_pipeline.add_preprocesser()
training_pipeline.add_trainer()

# After this, the state of `training_pipeline` changes from a config type to an execution type implictly
training_pipeline.run()

# This gets configuration + execution -> state is preserved
pipeline_execution = repo.get_pipeline_by_name('name')

This causes unintended consequences after the pipeline is run -> The execution object becomes immutable (in a hidden way) at that point, it gets in the way of fast iteration if working in a Jupyter notebook setting.

Describe the solution you'd like Due to a variety of reasons, including the ability to test, reduced complexity, and ease of understanding, the community has arrived at a conclusion that the configuration and execution need to be separate Python objects. That is,

pipeline_execution = training_pipeline.run(name='unique name')

The pipeline_execution and training_pipeline will be different objects, former being the execution object and the latter being the configuration object. The name variable will then bind the execution and the configurations for experiment tracking.

Describe alternatives you've considered Trying to maintain immutable states after the run() and register() calls but that led to the problems stated above.

dr3s commented 3 years ago

I think the terminology is confusing me a bit. Model and configuration to me are somewhat synonymous whereas execution is the result of calling run(). I think what you have described above is that:

var training_pipeline is a Pipeline Model/Configuration/Plan
var pipeline is a Pipeline Execution

Correct?

Each execution would be immutable. The model or config could also be immutable if there is a register() call that creates a history of immutable changes to the model. I would use this for linking the model to the execution and performing experiment tracking and analysis.

I think the key difference here is that there is one name training_pipeline that the user associates with the canonical pipeline model (all it's versions and executions).

htahir1 commented 3 years ago

@dr3s You're right -> The word model is not reflective at all. I updated the comment and adopted the word execution.

The model or config could also be immutable if there is a register() call that creates history of immutable changes to the model.

I'm not sure about this part of your comment here. The config itself would be mutable, while the execution would be immutable. Perhaps my updated comment would clarify this -> Do let me know if I misunderstood your comment.

I think the key difference here is that there is one name training_pipeline that the user associates with the canonical pipeline model (all it's versions and executions).

Yes, the name is unique and should be defined at execution time rather than construction time.

dr3s commented 3 years ago

I'm not sure about this part of your comment here. The config itself would be mutable, while the execution would be immutable. Perhaps my updated comment would clarify this -> Do let me know if I misunderstood your comment.

I think it's helpful to be specific here. There are at least two things with the PipelineConfig that could be mutable: the object reference in code and the data that zenml persists to record that config. The former could be mutable or immutable (using something like the builder pattern). The latter could also be mutable or immutable regardless of how the execution is treated. If the config is only persisted at execution time, it could overwrite the config from the last execution or create a new immutable version of the config that is then attached to the execution when it's created. Having a history of immutable config versions can be useful IMO. You could do this as part of the execution but I prefer to model the config and execution as different domain models.

Yes, the name is unique and should be defined at execution time rather than construction time.

This is confusing to me because the issue is more about using a non-unique name across executions. Yes, it's unique in as far as the user wants to make it unique. We want training_pipeline to always refer to the same pipeline across all executions and versions of it's config. The name wouldn't be defined at execution time but at design time.

htahir1 commented 3 years ago

@dr3s I think we're on the same page here. I'd love for you to take a look as this develops. Please keep an eye on it