zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.03k stars 437 forks source link

[BUG]: non-local Kubeflow Metadata store incorrectly reports not running #756

Closed strangemonad closed 2 years ago

strangemonad commented 2 years ago

Contact Details [Optional]

shawnmorel@gmail.com

System Information

ZenML version: 0.10.0 Install path: /home/jovyan/factory-data-algorithms/projects/common/process-health/.venv/lib/python3.9/site-packages/zenml Python version: 3.9.9 Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '20.04'} Environment: docker Integrations: ['aws', 'kubeflow', 's3', 'scipy', 'seldon', 'slack']

What happened?

When targeting a remote kubeflow pipelines deployment with a kubeflow orchestrator and kubeflow metadata-store, stack.deploy_pipeline() checks if not component.is_running for each component. The kubeflow orchestrator correctly determines that it's not running locally and returns True but the kubeflow metadata store reports that it's not running (we don't want to zenml stack up to run a local kfp metadata store, we just want to deploy the pipeline and let inside_kfp_pod do it's magic when resolving get_tfx_metadata_config

Reproduction steps

No response

Relevant log output

StackValidationError: The 'kubeflow' metadata_store stack component is not currently running. Please run the following command to provision and start the 
component:

    `zenml stack up`

Code of Conduct

stefannica commented 2 years ago

Hello @strangemonad and thank you for reporting this issue !

There seems to be some left-over confusion about what zenml stack up really does. Traditionally, this command was used exclusively to provision resources for local stack components, like the local k3d cluster and kubeflow deployment for the kubeflow orchestrator. However, this has changed with more recent ZenML versions to cover use-cases that connect directly to remote services, like the kubeflow orchestrator in your case.

Even when zenml stack up doesn't provision local resources, you still have to run it in some cases to forward remote ports locally. This is the case here with the kubeflow metadata store: you have to run zenml stack up to forward the remote gRPC metadata-store port locally via a kubectl port-forward command, otherwise you won't be able to access the metadata store in the post-execution workflow.

In the case of the kubeflow orchestrator component, there are some configuration attributes that you can tweak to completely remove the need to forward ports locally and to connect directly to the remote ports (see this issue for more info).

We could implement a similar logic for the kubeflow metadata store:

Please let me know if you think that would address your use-case.

htahir1 commented 2 years ago

Isnt this also related somehow to #728 ? Perhaps @VictorW96 can confirm this is the behavior he sees?

strangemonad commented 2 years ago

@stefannica interesting and I see the reasoning. I think this might need more nuance though. @RoyerRamirez and I are preparing a more comprehensive writeup of all the rough edges we've run into getting a pipeline working against an AWS Kubeflow deployment.

For this one in particular, I think there needs to be a way to have it both ways.

  1. some times you need to fetch artifacts after the run (in our case though, we're already running our notebooks in the KF cluster so we want to use the locally configured GRPC METADATA endpoint rather than kubectl.
  2. sometimes you just want to run the pipeline and don't care about doing anything after it completes (e.g. we run pipelines on external events)
htahir1 commented 2 years ago

Sorry to butt into the conversation, but @strangemonad you might appreciate our new repo here: https://github.com/zenml-io/mlops-stacks

It allows you to quickly get a cloud based stack running with some opinionated configuration. We also have a rehaul of the docs coming up with more focus on the cloud stuff.

It isnt finalized yet and we have not really launched it, but the goal would also be to link these stack recipes to zenml stack somehow. WDYT?

stefannica commented 2 years ago

@stefannica interesting and I see the reasoning. I think this might need more nuance though. @RoyerRamirez and I are preparing a more comprehensive writeup of all the rough edges we've run into getting a pipeline working against an AWS Kubeflow deployment.

To say that I'm really looking forward to reading it would be an understatement :smile:

strangemonad commented 2 years ago

@stefannica @htahir1 @RoyerRamirez here are the rough notes establishing the context of what we're trying to setup and the roadblocks we hit. https://notes.strangemonad.com/Zenml+stack+setup+thoughts still rough but hopefully sketches enough of an outline.

@htahir1 I had seen the repo. Setting up the infrastructure with terraform isn't our roadblock (though I might suspect it is for many that don't have in-house DevOps and k8s expertise). We have a functional Kubeflow stack and we're trying to target that for the time being. There's a lot in the way of ML metadata tracking, visualization and relative maturity with KFP that we're not willing to step away from yet in favor of, say, the zenml k8s orchestrator until that's more mature (e.g. how can I run dynamic conditional steps or parallel steps controlling for max-concurrrency using results from a previous step)

amirhessam88 commented 2 years ago

Hey @stefannica @RoyerRamirez and I are still experiencing this issue (btw @strangemonad explained it in details in his notes above); I am wondering if this is already on your radar.

htahir1 commented 2 years ago

@amirhessam88 With the new changes we are undergoing this issue will resolve itself. For now maybe one of @fa9r @stefannica or @schustmi can help?

htahir1 commented 2 years ago

With the new release, there is no metadata store, so I would ask @strangemonad to close the issue if they think its fine? :-)

amirhessam88 commented 2 years ago

@htahir1 Thanks Hamza! We have not had a chance to test out the new version and pinned our work at v0.13.2. Shawn and I have plans to try it out and see what part of our work should be changed accordingly. I have seen that you guys have some recipes in the docs. One question I have is, do you think another release which might have some breaking changes might come soon. I think I read somewhere you are pushing on release v1.0.0. and feel free to close the issue. Thanks

htahir1 commented 2 years ago

@amirhessam88 We are racing towards 1.0.0! Cant promise a date yet but I would say the biggest changes are behind us with the architecture change. Maybe some stuff will change around database schemas when we drop MLMD as a dependency and secret managers might move out of the stack, but that is all migratable. I will close the issue now - let us know how your upgrade goes!