zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.03k stars 437 forks source link

[BUG]: Kubeflow Pipelines GCP example doesn't work. Kubeflow metadata store is not ready yet #728

Closed VictorW96 closed 2 years ago

VictorW96 commented 2 years ago

System Information

ZenML version: 0.9.0 Install path: /home/victor/.local/lib/python3.8/site-packages/zenml Python version: 3.8.10 Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '20.04'} Environment: native Integrations: ['gcp', 'kubeflow', 'mlflow', 'scipy', 'seldon', 'sklearn', 'tensorflow']

What happened?

The kubeflow metadata store seems to be not reachable. If I list the metadata stores the output is as follows: grafik I find it curious that the IP address points to localhost. I also saw that kubeflow changed its metadata store implementation https://www.kubeflow.org/docs/components/pipelines/concepts/metadata/ . Is this a problem?

My GCP kubeflow configuration looks like this: grafik

Reproduction steps

Follow the kubeflow pipeline orchestration example. In particular follow the step: Run the same pipeline on Kubeflow Pipelines deployed to GCP

Relevant log output

Provisioning resources for stack 'gcp_kubeflow_stack'.
Provisioning local Kubeflow Pipelines deployment...
Provisioned resources for KubeflowMetadataStore(type=metadata_store, flavor=kubeflow, name=kubeflow_metadata_store, uuid=165eb418-8d7b-42b9-a2a9-29d01259486d, upgrade_migration_enabled=False, host=127.0.0.1, port=8081).
Resuming provisioned resources for stack gcp_kubeflow_stack.
Started Kubeflow Pipelines UI daemon (check the daemon logs at /home/victor/.config/zenml/kubeflow/eda1b333-39a8-4a3b-9647-0f00fb78dab2/kubeflow_daemon.log in case you're not able to view the UI). The Kubeflow Pipelines UI should now be accessible at http://localhost:8080/.
Resumed resources for KubeflowOrchestrator(type=orchestrator, flavor=kubeflow, name=gcp_kubeflow_orchestrator, uuid=eda1b333-39a8-4a3b-9647-0f00fb78dab2, custom_docker_base_image_name=None, kubeflow_pipelines_ui_port=8080, kubeflow_hostname=None, kubernetes_context=gke_ai-gilde_europe-west4-a_cluster-1, synchro
nous=False, skip_local_validations=False, skip_cluster_provisioning=False, skip_ui_daemon_provisioning=False).
Started Kubeflow Pipelines Metadata daemon (check the daemonlogs at /home/victor/.config/zenml/kubeflow/eda1b333-39a8-4a3b-9647-0f00fb78dab2/metadata-store/165eb418-8d7b-42b9-a2a9-29d01259486d/kubeflow_daemon.log in case you're not able to access the pipelinemetadata).
Waiting for the Kubeflow metadata store to be ready (this might take a few minutes).
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...
The Kubeflow metadata store is not ready yet. Waiting for 10 seconds...

Code of Conduct

htahir1 commented 2 years ago

Can you try doing zenml stack up again and see if it solves the problem?

VictorW96 commented 2 years ago

No the same thing happens again. I think the metadata store and the kubeflow orchestrator are both thought of local by zenml. When I deprovision the resources with zenml stack down --force I get the following output:

Deprovisioning resources for stack 'gcp_kubeflow_stack'.
Deprovisioned resources for KubeflowOrchestrator(type=orchestrator, flavor=kubeflow, name=gcp_kubeflow_orchestrator, uuid=eda1b333-39a8-4a3b-9647-0f00fb78dab2, custom_docker_base_image_name=None, kubeflow_pipelines_ui_port=8080, kubeflow_hostname=None, kubernetes_context=gke_ai-gilde_europe-west4-a_cluster-1, s
ynchronous=False, skip_local_validations=False, skip_cluster_provisioning=False, skip_ui_daemon_provisioning=False).
Local kubeflow pipelines deployment deprovisioned.
Deprovisioned resources for KubeflowMetadataStore(type=metadata_store, flavor=kubeflow, name=kubeflow_metadata_store, uuid=165eb418-8d7b-42b9-a2a9-29d01259486d, upgrade_migration_enabled=False, host=127.0.0.1, port=8081).

How do I tell ZenML to explicitly use the gcloud kubeflow?

stefannica commented 2 years ago

Hi Victor, thank you for opening this issue. Your stack setup looks good. Even if you see host=127.0.0.1 in your metadata store configuration, that is expected. When you run zenml stack up, the gRPC metadata service port is forwarded locally via a kubectl port-forward command, which is why you see ZenML trying to access the metadata store on your localhost.

Here are a few things you can try:

  1. check that the kubectl port-forward daemon command is indeed running on your system (e.g. by listing the processes with ps -ef|grep kubectl). You should see something like kubectl --context gke_ai-gilde_europe-west4-a_cluster-1 --namespace kubeflow port-forward svc/metadata-grpc-service 8081:8080

  2. if the answer to 1. is yes, then try to access the gRPC port yourself with curl. If the port is open, you should get an answer that looks like the one below:

    $ curl -v http://localhost:8081/
    *   Trying 127.0.0.1:8081...
    * TCP_NODELAY set
    * Connected to localhost (127.0.0.1) port 8081 (#0)
    > GET / HTTP/1.1
    > Host: localhost:8088
    > User-Agent: curl/7.68.0
    > Accept: */*
    > 
    * Received HTTP/0.9 when not allowed
    
    * Closing connection 0
    curl: (1) Received HTTP/0.9 when not allowed
  3. check the metadata UI daemon logs as indicated in the zenml stack up output (e.g. by looking at the /home/victor/.config/zenml/kubeflow/eda1b333-39a8-4a3b-9647-0f00fb78dab2/kubeflow_daemon.log file) and see if there is anything logged there that might point to a problem with the port-forwarding.

  4. if all previous points check out, you can try connecting to the Kubeflow gRPC metadata service manually by running something like this (this is more or less what the zenml stack up command check is doing):

    from zenml.repository import Repository
    r = Repository()
    r.get_pipelines()

Basically, the problem could be one of the following:

VictorW96 commented 2 years ago

Thank you very much. It appears that the problem had to do with the kubectl port-forward daemon. After a fresh restart the daemon started up and connected. A followup question. The example throws the following error after python run.py:

No pipelines found for name 'mnist_pipeline'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/victor/ML-Projects/mlops-pipeline-poc/zenml_examples/kubeflow_pipelines_orchestration/run. │
│ py:79 in <module>                                                                                │
│                                                                                                  │
│   76                                                                                             │
│   77                                                                                             │
│   78 if __name__ == "__main__":                                                                  │
│ ❱ 79 │   main()                                                                                  │
│   80                                                                                             │
│                                                                                                  │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:1130 in __call__                   │
│                                                                                                  │
│   1127 │                                                                                         │
│   1128 │   def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:                           │
│   1129 │   │   """Alias for :meth:`main`."""                                                     │
│ ❱ 1130 │   │   return self.main(*args, **kwargs)                                                 │
│   1131                                                                                           │
│   1132                                                                                           │
│   1133 class Command(BaseCommand):                                                               │
│                                                                                                  │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:1055 in main                       │
│                                                                                                  │
│   1052 │   │   try:                                                                              │
│   1053 │   │   │   try:                                                                          │
│   1054 │   │   │   │   with self.make_context(prog_name, args, **extra) as ctx:                  │
│ ❱ 1055 │   │   │   │   │   rv = self.invoke(ctx)                                                 │
│   1056 │   │   │   │   │   if not standalone_mode:                                               │
│   1057 │   │   │   │   │   │   return rv                                                         │
│   1058 │   │   │   │   │   # it's not safe to `ctx.exit(rv)` here!                               │
│                                                                                                  │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:1404 in invoke                     │
│                                                                                                  │
│   1401 │   │   │   echo(style(message, fg="red"), err=True)                                      │
│   1402 │   │                                                                                     │
│   1403 │   │   if self.callback is not None:                                                     │
│ ❱ 1404 │   │   │   return ctx.invoke(self.callback, **ctx.params)                                │
│   1405 │                                                                                         │
│   1406 │   def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]:  │
│   1407 │   │   """Return a list of completions for the incomplete value. Looks                   │
│                                                                                                  │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:760 in invoke                      │
│                                                                                                  │
│    757 │   │                                                                                     │
│    758 │   │   with augment_usage_errors(__self):                                                │
│    759 │   │   │   with ctx:                                                                     │
│ ❱  760 │   │   │   │   return __callback(*args, **kwargs)                                        │
│    761 │                                                                                         │
│    762 │   def forward(                                                                          │
│    763 │   │   __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any  # noqa: B902             │
│                                                                                                  │
│ /home/victor/ML-Projects/mlops-pipeline-poc/zenml_examples/kubeflow_pipelines_orchestration/run. │
│ py:59 in main                                                                                    │
│                                                                                                  │
│   56 │   )                                                                                       │
│   57 │   p.run()                                                                                 │
│   58 │                                                                                           │
│ ❱ 59 │   visualize_tensorboard(                                                                  │
│   60 │   │   pipeline_name="mnist_pipeline",                                                     │
│   61 │   │   step_name="trainer",                                                                │
│   62 │   )                                                                                       │
│                                                                                                  │
│ /home/victor/.local/lib/python3.8/site-packages/zenml/integrations/tensorflow/visualizers/tensor │
│ board_visualizer.py:221 in visualize_tensorboard                                                 │
│                                                                                                  │
│   218 │   │   pipeline_name: the name of the pipeline                                            │
│   219 │   │   step_name: pipeline step name                                                      │
│   220 │   """                                                                                    │
│ ❱ 221 │   step = get_step(pipeline_name, step_name)                                              │
│   222 │   TensorboardVisualizer().visualize(step)                                                │
│   223                                                                                            │
│   224                                                                                            │
│                                                                                                  │
│ /home/victor/.local/lib/python3.8/site-packages/zenml/integrations/tensorflow/visualizers/tensor │
│ board_visualizer.py:198 in get_step                                                              │
│                                                                                                  │
│   195 │   repo = Repository()                                                                    │
│   196 │   pipeline = repo.get_pipeline(pipeline_name)                                            │
│   197 │   if pipeline is None:                                                                   │
│ ❱ 198 │   │   raise RuntimeError(f"No pipeline with name `{pipeline_name}` was found")           │
│   199 │                                                                                          │
│   200 │   last_run = pipeline.runs[-1]                                                           │
│   201 │   step = last_run.get_step(name=step_name)                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: No pipeline with name `mnist_pipeline` was found

But the pipeline executes under Runs in Kubeflow:

grafik Can I tell ZenMl to create a new pipeline? (Additionally here you can also see Bug https://github.com/zenml-io/zenml/issues/729)

stefannica commented 2 years ago

Hey Victor, I think this happens because the pipeline is still running when the Tensorboard visualizer call is reached in the post-execution workflow, and the pipeline hasn't yet had a chance to be recorded in the metadata store.

There are two ways you can address this. You can either reconfigure your orchestrator to wait until the pipeline run is complete before it starts the visualizer (see below), or you can simply run the example a second time and the visualizer will find the previous pipeline run and start Tensorboard.

To have a synchronous type of pipeline execution in which your client code waits until the pipeline run is complete, you can run something like this:

zenml orchestrator update --synchronous=True
stefannica commented 2 years ago

I should also add that the evaluator step must execute successfully in order for Tensorboard to load and visualize the model from the artifact store. This currently doesn't happen because of the other bug you mentioned.

htahir1 commented 2 years ago

@VictorW96 Can I close this issue in favor of #729 ?