zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.02k stars 436 forks source link

[BUG]: Orchestrator is unable to access ML Pipelines API #720

Closed RoyerRamirez closed 2 years ago

RoyerRamirez commented 2 years ago

Contact Details [Optional]

System Information

ZenML version: 0.9.0 Install path: /home/jovyan/factory-data-algorithms/projects/common/process-health/.venv/lib/python3.9/site-packages/zenml Python version: 3.9.9 Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '20.04'} Environment: docker Integrations: ['kubeflow', 'scipy', 'seldon']

What happened?

Hi, we are running zenml from a Jupyter Notebook in the same cluster as ML Pipelines. Currently the Orchestrator is only able to access to ML Pipelines API by using kube-proxy, or port-forwarding.

It does not make sense to do port-forwarding or go through kube-proxy when ZenML is running in the same cluster as ML Pipelines. We should be able to hit the endpoint directly. We can directly access ML Pipelines and any other Kubeflow service using the following dns record ..svc.cluster.local:. This is demonstrated below:

curl -XGET ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz {"commit_sha":"c8a18bde299f2fdf5f72144f15887915b8d11520", "tag_name":"1.8.1", "multi_user":true}(base)

Furthermore, ML-Pipelines also runs a step to determine if it's running as a local deployment or on an external server by looking for a kube config file. If it can't find one, it assumes they "own" the ML Pipeline deployment and they don't perform kube-proxy/port-forwarding to access other services.

Reproduction steps

  1. Create an orchestrator for Kubeflow Pipelines
  2. Deploy a Jupyter Notebook or Deployment in the same cluster as ML Pipelines
  3. Attempt to create a pipeline. You will receive a connection error and/or a kube-context error. ...

Relevant log output

Example of how ML-Pipeline will determine if they "own" their deployment. Maybe ZenML can do something similar to resolve this issue? I recommend looking for a kube config file. If one does not exist, then assume it's on the cloud or it can access the ML Pipeline endpoint directly.

ML Pipeline Startup Logs:
I0623 23:40:09.590195       7 client_manager.go:160] Initializing client manager
I0623 23:40:09.590498       7 config.go:57] Config DBConfig.ExtraParams not specified, skipping
I0623 23:40:13.018429       7 client_manager.go:415] We already own mlpipeline
I0623 23:40:13.019436       7 swf.go:64] (Expected when in cluster) Failed to create scheduled workflow client by out of cluster kubeconfig. Error: stat /root/.kube/config: no such file or directory
I0623 23:40:13.019462       7 swf.go:66] Starting to create scheduled workflow client by in cluster config.
I0623 23:40:13.026601       7 client_manager.go:204] Client manager initialized successfully
I0623 23:40:13.030082       7 main.go:183] Samples already loaded in the past. Skip loading.
I0623 23:40:13.036776       7 resource_manager.go:1021] Default experiment already exists! ID: ba6bb05d-565d-40fa-b589-6af95995cc34
I0623 23:40:13.036830       7 main.go:120] Starting Http Proxy
I0623 23:40:13.040199       7 main.go:90] Starting RPC server

Code of Conduct

stefannica commented 2 years ago

@RoyerRamirez thank you for opening this issue ! The Kubeflow orchestrator does provide some advanced configuration attributes that you can tweak that might just allow running pipelines from within the same cluster as Kubeflow, although we never tried this use-case:

RoyerRamirez commented 2 years ago

Hi @stefannica, thank you. I'm going to close this bug.

stefannica commented 2 years ago

@RoyerRamirez just for the sake of completeness, in case we run into this situation again in the future: were you able to implement your use-case with those extra orchestrator parameters ?