zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.04k stars 436 forks source link

Fix step logging when using GCS Artifact Store #2211

Open strickvl opened 10 months ago

strickvl commented 10 months ago

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

support@zenml.io

What happened?

There seems to be an issue with StepLogging when using GCS (Google Cloud Storage) as the artifact store. Specifically, only the last parts of the logs appear in the file, which suggests a problem with the log writing or saving mechanism.

Steps to Reproduce

Here's a snippet to reproduce the issue:

import gcsfs
from zenml.client import Client
from zenml.logging.step_logging import StepLogsStorage

client = Client()
_ = client.active_stack

TEST_FILE="gs://<<your_bucket>>/test_log.log"

log_storage = StepLogsStorage(logs_uri=TEST_FILE, max_messages=5)
for i in range(0,11):
    log_storage.write(f"I'm log line #{i}")
log_storage.save_to_file()

fs = gcsfs.GCSFileSystem()
with fs.open(TEST_FILE, 'r') as f:
    all_of_it = f.read()

print(all_of_it)

Expected Behavior

All log lines should be saved and visible in the GCS file, not just the last few.

Potential Solution

Consider using the logging.StreamHandler facility to temporarily write logs to the remote file (GCS, S3, etc.). Here's an example:

import logging
import fsspec

f = fsspec.open("gs://<<my_gcs_bucket>>/test_log.log", "w")
with f as of:
    log_handler = logging.StreamHandler(of)
    logger = logging.getLogger()  # Root logger
    logger.addHandler(log_handler)
    for i in range(0, 5000):
        logger.warning(f"I'm log line #{i}")
    logger.removeHandler(log_handler)

This approach could fit nicely in the StepLogsStorageContext class.

Additional Context

Proper log handling is crucial for debugging and monitoring pipeline performance, especially when dealing with large-scale data processing in cloud environments.

Code of Conduct

adtygan commented 9 months ago

Hello @strickvl, I'm trying to reproduce this issue but can't. I made a GCS bucket and tried to run the first snippet and got the following error. Please let me know if you need the traceback.

ValueError: No file systems were found for the scheme: gs://. Please make sure that you are using the right path and the all the necessary integrations are properly installed.

The error was raised for the following line,

log_storage = StepLogsStorage(logs_uri=TEST_FILE, max_messages=5)

strickvl commented 9 months ago

Here I'd patch in @bcdurak who I think was most involved with that particular part of the codebase. I think he should be able to help with this. Other things to check:

import gcsfs

fs = gcsfs.GCSFileSystem()
with fs.open('gs://your-bucket-name/test.txt', 'w') as f:
    f.write('Hello, world!')

with fs.open('gs://your-bucket-name/test.txt', 'r') as f:
    print(f.read())

(Replace 'gs://your-bucket-name/test.txt' with a valid path in your GCS bucket.)

adtygan commented 9 months ago

Thank you for the code you provided. I did have some permission issues, which I resolved after trying this code, and the code provided correctly prints Hello, world!. However, the previous error I got persists even now.

ValueError: No file systems were found for the scheme: gs://. Please make sure that you are using the right path and the all the necessary integrations are properly installed.

EDIT:

I think I understand the source of this error. I have attached the traceback below. The code uses fileio to open the URI which raises error. Instead, at this step, gcsfs needs to be used like in the previous code provided.

image
strickvl commented 9 months ago

I think I see what's going on now. Are you running the code with a GCS artifact store configured in your ZenML stack? (fileio will use whatever stack you have configured and set up for ZenML, so if you have a GCS artifact store then it should work).

adtygan commented 9 months ago

I see. I tried to setup a GCS artifact store but am facing some errors. I don't understand a few steps and will first acquaint myself. Could you please assign me to this issue?

adtygan commented 9 months ago

I was able to reproduce the issue. The output I get for the initial code is

I'm log line #10

I will now work on solving the issue.

adtygan commented 9 months ago

@strickvl I have fixed the issue locally and I'm getting the expected output as shown below

image

However I'm facing an issue in following the Contributions guidelines. While running the command mypy --install-types I get the error error: Can't determine which types to install with no files to check (and no cache from previous mypy run). Could you please help with this?

Also, while opening a pull request, I read this pre-requisite: I have added tests to cover my changes. To fix the bug I made a change to src/zenml/logging/step_logging.py. So I think I need to add tests, but I'm not sure how to do this. Request help on this.

strickvl commented 9 months ago

For our cloud integrations, it's enough to demonstrate that you've tested it. We don't currently run integration tests on cloud environments, so basically for something like this it wouldn't be possible to test it locally. Icing on the cake would be to include instructions how someone from the core team could reproduce your local test (code snippet and reminder of what the stack setup would be) in the PR, but beyond that I think you're ok.

Also for mypy I think you can ignore that and just make the PR. Any issues will be revealed there.