Closed VictorW96 closed 2 years ago
Can you try doing zenml stack up
again and see if it solves the problem?
No the same thing happens again. I think the metadata store and the kubeflow orchestrator are both thought of local by zenml. When I deprovision the resources with zenml stack down --force
I get the following output:
Deprovisioning resources for stack 'gcp_kubeflow_stack'.
Deprovisioned resources for KubeflowOrchestrator(type=orchestrator, flavor=kubeflow, name=gcp_kubeflow_orchestrator, uuid=eda1b333-39a8-4a3b-9647-0f00fb78dab2, custom_docker_base_image_name=None, kubeflow_pipelines_ui_port=8080, kubeflow_hostname=None, kubernetes_context=gke_ai-gilde_europe-west4-a_cluster-1, s
ynchronous=False, skip_local_validations=False, skip_cluster_provisioning=False, skip_ui_daemon_provisioning=False).
Local kubeflow pipelines deployment deprovisioned.
Deprovisioned resources for KubeflowMetadataStore(type=metadata_store, flavor=kubeflow, name=kubeflow_metadata_store, uuid=165eb418-8d7b-42b9-a2a9-29d01259486d, upgrade_migration_enabled=False, host=127.0.0.1, port=8081).
How do I tell ZenML to explicitly use the gcloud kubeflow?
Hi Victor, thank you for opening this issue. Your stack setup looks good. Even if you see host=127.0.0.1
in your metadata store configuration, that is expected. When you run zenml stack up
, the gRPC metadata service port is forwarded locally via a kubectl port-forward
command, which is why you see ZenML trying to access the metadata store on your localhost.
Here are a few things you can try:
check that the kubectl port-forward
daemon command is indeed running on your system (e.g. by listing the processes with ps -ef|grep kubectl
). You should see something like kubectl --context gke_ai-gilde_europe-west4-a_cluster-1 --namespace kubeflow port-forward svc/metadata-grpc-service 8081:8080
if the answer to 1. is yes, then try to access the gRPC port yourself with curl
. If the port is open, you should get an answer that looks like the one below:
$ curl -v http://localhost:8081/
* Trying 127.0.0.1:8081...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8081 (#0)
> GET / HTTP/1.1
> Host: localhost:8088
> User-Agent: curl/7.68.0
> Accept: */*
>
* Received HTTP/0.9 when not allowed
* Closing connection 0
curl: (1) Received HTTP/0.9 when not allowed
check the metadata UI daemon logs as indicated in the zenml stack up
output (e.g. by looking at the /home/victor/.config/zenml/kubeflow/eda1b333-39a8-4a3b-9647-0f00fb78dab2/kubeflow_daemon.log file) and see if there is anything logged there that might point to a problem with the port-forwarding.
if all previous points check out, you can try connecting to the Kubeflow gRPC metadata service manually by running something like this (this is more or less what the zenml stack up
command check is doing):
from zenml.repository import Repository
r = Repository()
r.get_pipelines()
Basically, the problem could be one of the following:
kubectl port-forward
command is failing)kubectl -n kubeflow logs metadata-grpc-deployment-6b5685488-48x4b
)Thank you very much. It appears that the problem had to do with the kubectl port-forward daemon. After a fresh restart the daemon started up and connected. A followup question. The example throws the following error after python run.py
:
No pipelines found for name 'mnist_pipeline'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/victor/ML-Projects/mlops-pipeline-poc/zenml_examples/kubeflow_pipelines_orchestration/run. │
│ py:79 in <module> │
│ │
│ 76 │
│ 77 │
│ 78 if __name__ == "__main__": │
│ ❱ 79 │ main() │
│ 80 │
│ │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:1130 in __call__ │
│ │
│ 1127 │ │
│ 1128 │ def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any: │
│ 1129 │ │ """Alias for :meth:`main`.""" │
│ ❱ 1130 │ │ return self.main(*args, **kwargs) │
│ 1131 │
│ 1132 │
│ 1133 class Command(BaseCommand): │
│ │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:1055 in main │
│ │
│ 1052 │ │ try: │
│ 1053 │ │ │ try: │
│ 1054 │ │ │ │ with self.make_context(prog_name, args, **extra) as ctx: │
│ ❱ 1055 │ │ │ │ │ rv = self.invoke(ctx) │
│ 1056 │ │ │ │ │ if not standalone_mode: │
│ 1057 │ │ │ │ │ │ return rv │
│ 1058 │ │ │ │ │ # it's not safe to `ctx.exit(rv)` here! │
│ │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:1404 in invoke │
│ │
│ 1401 │ │ │ echo(style(message, fg="red"), err=True) │
│ 1402 │ │ │
│ 1403 │ │ if self.callback is not None: │
│ ❱ 1404 │ │ │ return ctx.invoke(self.callback, **ctx.params) │
│ 1405 │ │
│ 1406 │ def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]: │
│ 1407 │ │ """Return a list of completions for the incomplete value. Looks │
│ │
│ /home/victor/.local/lib/python3.8/site-packages/click/core.py:760 in invoke │
│ │
│ 757 │ │ │
│ 758 │ │ with augment_usage_errors(__self): │
│ 759 │ │ │ with ctx: │
│ ❱ 760 │ │ │ │ return __callback(*args, **kwargs) │
│ 761 │ │
│ 762 │ def forward( │
│ 763 │ │ __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any # noqa: B902 │
│ │
│ /home/victor/ML-Projects/mlops-pipeline-poc/zenml_examples/kubeflow_pipelines_orchestration/run. │
│ py:59 in main │
│ │
│ 56 │ ) │
│ 57 │ p.run() │
│ 58 │ │
│ ❱ 59 │ visualize_tensorboard( │
│ 60 │ │ pipeline_name="mnist_pipeline", │
│ 61 │ │ step_name="trainer", │
│ 62 │ ) │
│ │
│ /home/victor/.local/lib/python3.8/site-packages/zenml/integrations/tensorflow/visualizers/tensor │
│ board_visualizer.py:221 in visualize_tensorboard │
│ │
│ 218 │ │ pipeline_name: the name of the pipeline │
│ 219 │ │ step_name: pipeline step name │
│ 220 │ """ │
│ ❱ 221 │ step = get_step(pipeline_name, step_name) │
│ 222 │ TensorboardVisualizer().visualize(step) │
│ 223 │
│ 224 │
│ │
│ /home/victor/.local/lib/python3.8/site-packages/zenml/integrations/tensorflow/visualizers/tensor │
│ board_visualizer.py:198 in get_step │
│ │
│ 195 │ repo = Repository() │
│ 196 │ pipeline = repo.get_pipeline(pipeline_name) │
│ 197 │ if pipeline is None: │
│ ❱ 198 │ │ raise RuntimeError(f"No pipeline with name `{pipeline_name}` was found") │
│ 199 │ │
│ 200 │ last_run = pipeline.runs[-1] │
│ 201 │ step = last_run.get_step(name=step_name) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: No pipeline with name `mnist_pipeline` was found
But the pipeline executes under Runs in Kubeflow:
Can I tell ZenMl to create a new pipeline? (Additionally here you can also see Bug https://github.com/zenml-io/zenml/issues/729)
Hey Victor, I think this happens because the pipeline is still running when the Tensorboard visualizer call is reached in the post-execution workflow, and the pipeline hasn't yet had a chance to be recorded in the metadata store.
There are two ways you can address this. You can either reconfigure your orchestrator to wait until the pipeline run is complete before it starts the visualizer (see below), or you can simply run the example a second time and the visualizer will find the previous pipeline run and start Tensorboard.
To have a synchronous type of pipeline execution in which your client code waits until the pipeline run is complete, you can run something like this:
zenml orchestrator update --synchronous=True
I should also add that the evaluator step must execute successfully in order for Tensorboard to load and visualize the model from the artifact store. This currently doesn't happen because of the other bug you mentioned.
@VictorW96 Can I close this issue in favor of #729 ?
System Information
ZenML version: 0.9.0 Install path: /home/victor/.local/lib/python3.8/site-packages/zenml Python version: 3.8.10 Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '20.04'} Environment: native Integrations: ['gcp', 'kubeflow', 'mlflow', 'scipy', 'seldon', 'sklearn', 'tensorflow']
What happened?
The kubeflow metadata store seems to be not reachable. If I list the metadata stores the output is as follows: I find it curious that the IP address points to localhost. I also saw that kubeflow changed its metadata store implementation https://www.kubeflow.org/docs/components/pipelines/concepts/metadata/ . Is this a problem?
My GCP kubeflow configuration looks like this:
Reproduction steps
Follow the kubeflow pipeline orchestration example. In particular follow the step: Run the same pipeline on Kubeflow Pipelines deployed to GCP
Relevant log output
Code of Conduct