zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
3.91k stars 427 forks source link

[BUG]: Unable to deploy a ML model locally to MLFlow #2235

Closed PriyanshBhardwaj closed 1 week ago

PriyanshBhardwaj commented 8 months ago

System Information

python = 3.9 zenml version = 0.53.1 os = macos integration = mlflow (downloaded separately by pip install mlflow)

What happened?

Unable to deploy a ml model locally in mlflow. The problem lies in class MLFlowDeploymentService in file mlflow_deployment.py.

Please check the reproduction steps to understand the issue clearly.

Reproduction steps

I followed all steps correctly, set the experiment tracker, model deployer and created the stack:

zenml experiment-tracker register mlflow_tracker --flavor=mlflow

zenml model-deployer register mlflow --flavor=mlflow

zenml stack register mlflow_stack -a default -o default -d mlflow -e mlflow_tracker --set

i created a pipeline which will ingest data, train the model, evaluate performance and then deploy model after passing the trigger:

df = ingest_data(data_path = data_path)
x_train, x_test, y_train, y_test = clean_data(df)
model = train_model(x_train, x_test, y_train, y_test)
r2, mse, rmse = evaluate_model(model, x_test, y_test)
deployment_decision = deployment_trigger(accuracy=mse)      #deploying on the basis of mse score
mlflow_model_deployer_step(
    model=model,
    deploy_decision=deployment_decision,
    workers=workers,
    timeout=timeout,
)

I debugged it, everything is working good and the model is also passing the deployment trigger but the model deployer is not working properly. the problem is with this step:

mlflow_model_deployer_step(
    model=model,
    deploy_decision=deployment_decision,
    workers=workers,
    timeout=timeout,
)

when pipeline calls it, the log which prints is:

Updating an existing MLflow deployment service: MLFlowDeploymentService[2ade1153-7fd3-45d1-8ecd-412f86b264b5] (type: model-serving, flavor: mlflow)

this gets logged from function deploy_model() which is in the file /zenml/integrations/mlflow/model_deployers/mlflow_model_deployer.py . In the same function at line 210 it calls service.start() which is in the file /zenml/services/local/local_service.py and in the same under the start function it logs Starting service 'MLFlowDeploymentService[2ade1153-7fd3-45d1-8ecd-412f86b264b5] (type: model-serving, flavor: mlflow)'. from line 387 and then when it calls if not self.poll_service_status(timeout): at line 391 it logs error:

Timed out waiting for service MLFlowDeploymentService[2ade1153-7fd3-45d1-8ecd-412f86b264b5] (type: model-serving, flavor: mlflow) to become active:
  Administrative state: active
  Operational state: inactive
  Last status message: 'service daemon is not running'
For more information on the service status, please see the following log file: 

when i visited the log file it says:

TypeError: Cannot load service with unregistered service type:
type='model-serving' flavor='mlflow' name='mlflow-deployment'
description='MLflow prediction service'

it raises this issue from here: /zenml/services/service_registry.py:193 in load_service_from_dict

but in the file mlflow_model_deployer.py at line 187 service gets its value from here service = cast(MLFlowDeploymentService, existing_service) which is MLFlowDeploymentService[2ade1153-7fd3-45d1-8ecd-412f86b264b5] (type: model-serving, flavor: mlflow).

but in class MLFlowDeploymentService in file mlflow_deployment.py at line 128 type is already defined as "model-serving" which i think cant be changed:

SERVICE_TYPE = ServiceType(
        name="mlflow-deployment",
        type="model-serving",
        flavor="mlflow",
        description="MLflow prediction service",
    )

It is getting timed out due to this because MLFlowDeploymentService[2ade1153-7fd3-45d1-8ecd-412f86b264b5] (type: model-serving, flavor: mlflow) is not starting and it will always give timeout error doesn't matter what will be the value of timeout.

so why it is giving this error in log file

TypeError: Cannot load service with unregistered service type:

I tried everything to solve it: created new stack from scratch created new pipeline in different stacks initialize zenml again by deleting .zen folder and again calling zenml init in terminal in same directory

i also tried to use --type=mlflow while creating model deployer as mentioned somewhere in your old docs in this command:

zenml model-deployer register mlflow --type=mlflow --flavor=mlflow

It obviously didnt work.

Nothing worked and also your docs doesnt have a solution for such problems.

My issue is I tried everything to debug this but not able to deploy my model locally to mlflow bcz i cant change type in your internal class which is the root cause. Please resolve the issue and please update your logs and make them more clear for the users.

P.S: model size 1000 bytes only, timeout 60, 120 didnt work for both values I'm using a separate env for this. latest version of zenml and mlflow

Relevant log output

TypeError: Cannot load service with unregistered service type:
type='model-serving' flavor='mlflow' name='mlflow-deployment'
description='MLflow prediction service'
Cleanup: terminating children processes...

Code of Conduct

Vishal-Padia commented 8 months ago

@PriyanshBhardwaj

The inconsistent type naming between the model deployer registration and the MLFlowDeploymentServicedefinition could also be contributing to this issue. One thing you try is to register the MLFLow model deployer as

zenml model-deployer register mlflow --type=model-serving --flavor=mlflow

As it's defined in ServiceType as:

SERVICE_TYPE = ServiceType(
        name="mlflow-deployment",
        type="model-serving",
        flavor="mlflow",
        description="MLflow prediction service",
    )

Or we can change the type in src\zenml\integrations\mlflow\services\mlflow_deployment.py on line 128 to just mlflow rather than mlflow-serving. So that the service type name matches what is defined in the MLFlowDeploymentServiceclass. This can contribute to the issue you are facing.

PriyanshBhardwaj commented 8 months ago

@Vishal-Padia thanks for your response.

zenml model-deployer register mlflow --type=model-serving --flavor=mlflow this will not work as --type is an extra field in model deployer registration as you can see in the error below. I tried it because I saw it in some old docs.

ValidationError: 1 validation error for MLFlowModelDeployerConfig
type
  extra fields not permitted (type=value_error.extra)

for your 2nd solution, I tried it but it didn't work.

Vishal-Padia commented 8 months ago

@PriyanshBhardwaj okay, I got it. I guess Alex will be able to zero-down on the issue!

avishniakov commented 6 months ago

Hello @PriyanshBhardwaj , sorry about the delay!

As I learned from the info provided you install mlflow using pip directly, which might not work well with zenml due to version mismatch. Moreover mlflow integration also pulls important components for model deployment, the full list is (for 0.53.1 version you used): 'mlflow>=2.1.1,<=2.9.2', 'mlserver>=1.3.3', 'mlserver-mlflow>=1.3.3'.

Can you do the following and retest?

pip3 uninstall mlflow
zenml integration install mlflow -y

Moreover, forking on MacOS might not be working always smoothly, if OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES is not set, so also ensure that this ENV is properly set before rerunning.

jatinpreeet commented 1 month ago

Did someone came up with any solutions? Because I tried everything mentioned over here and still getting that error. @avishniakov @PriyanshBhardwaj

strickvl commented 1 month ago

@jatinpreeet Thank you for messaging here. Some questions so we can narrow down exactly what's going on in your case:

  1. Can you share the full traceback of the TypeError you're seeing in the logs? That will help identify exactly where the unregistered service type error is originating.
  2. What version of MLflow are you currently using? You mentioned installing it separately via pip - can you check if it matches the zenml 0.53.1 requirements of mlflow>=2.1.1,<=2.9.2?
  3. Just to double check - you mentioned setting the OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES environment variable, but the error still persists after that, correct?
  4. Are there any other relevant environment variables or zenml configurations that might be pertinent here?
  5. If you're able to share a minimal reproducible example of your code and setup (anonymized as needed), that could also be helpful for debugging. No pressure if not though!
  6. Sending zenml info -a -s would also be really useful. Feel free to message me on Slack with this information, or paste it into a Gist if it's too long etc.

Please let me know the answers to the above when you can. I know this has been frustrating to troubleshoot, but I appreciate you taking the time to provide all these details. We'll get to the bottom of it! Let me know if any other questions or issues come up in the meantime.

michalziobro commented 2 weeks ago

Hello there! Seems that I also face the same issue. Code to reproduce can be found here: https://github.com/michalziobro/mlops-projects-course it is based on: https://github.com/ayush714/mlops-projects-course/tree/main

mlflow version is 2.15.1, installed with zenml integration install mlflow -y

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES is set (otherwise zenml up fails)

zenml version is 0.64.0, more details here: https://gist.github.com/michalziobro/2bf39c55e8f696e8f4b20215a1319586

stefannica commented 1 week ago

Hello there! Seems that I also face the same issue. Code to reproduce can be found here: https://github.com/michalziobro/mlops-projects-course it is based on: https://github.com/ayush714/mlops-projects-course/tree/main

mlflow version is 2.15.1, installed with zenml integration install mlflow -y

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES is set (otherwise zenml up fails)

zenml version is 0.64.0, more details here: https://gist.github.com/michalziobro/2bf39c55e8f696e8f4b20215a1319586

Hello @michalziobro , thank you for reporting this.

I've tried to reproduce this issue using your code repository, but I ran into several issues unrelated to what you reported here:

INFO: pip is looking at multiple versions of streamlit to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 168), -r requirements.txt (line 28), -r requirements.txt (line 48), -r requirements.txt (line 90), -r requirements.txt (line 91) and click==8.1.7 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested click==8.1.7
    click-params 0.3.0 depends on click<9.0 and >=7.0
    flask 3.0.3 depends on click>=8.1.3
    mlflow-skinny 2.15.1 depends on click<9 and >=7.0
    mlserver 1.4.0 depends on click
    streamlit 1.8.1 depends on click<8.1 and >=7.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip to attempt to solve the dependency conflict

I had to manually adjust your requirements.txt several times to get everything to install correctly. Among other things, I removed backports.zoneinfo and streamlit.

Finally, after making several manual changes to the code you provided, I was able to successfully run the continuous deployment and inference pipelines.

In order for me to be able to help you, I would kindly ask you to revise your code and then provide me with a more accurate way to reproduce this on my machine.