zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
3.93k stars 429 forks source link

Improve logging logic to improve/fix GPU performance #2252

Open strickvl opened 8 months ago

strickvl commented 8 months ago

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

support@zenml.io

What happened?

Users have reported a significant drop in GPU utilization (from 95% to 2%) after upgrading ZenML from version 0.32.1 to 0.44.2. This issue was observed while deploying pipelines on GCP VertexAI. Investigations suggest that the performance bottleneck is due to the logging mechanism, especially when using progress bars like tqdm. It appears that logging, particularly frequent updates from progress bars, is substantially slowing down the processing speed.

Task Description

Investigate and optimize the logging logic in ZenML, particularly for scenarios involving high GPU usage. The goal is to ensure that the logging process, including progress bars, does not adversely affect the GPU performance and overall speed of pipeline execution.

Expected Outcome

Steps to Implement

Note that part of the solution might be to expose these global variables / constants better in settings via environment variables.

Additional Context

This issue is critical for users leveraging ZenML for GPU-intensive tasks, as efficient GPU utilization is key to performance in these scenarios. The solution should provide a balance between informative logging and optimal resource utilization.

Code of Conduct

nida-imran173 commented 7 months ago

@strickvl I'm interested in working on this issue. Can I take it up?

strickvl commented 7 months ago

Sure thing, @nida-imran173! I'll assign it to you and let us know if you have any questions. Most basic things should be answered in our CONTRIBUTING.md document.

nida-imran173 commented 7 months ago

Hi @strickvl,

After analyzing the code in 'logging', I've identified a few potential areas that could be causing the reported drop in GPU utilization. Here are the key points:

  1. The code performs file I/O operations (fileio.open, fileio.makedirs, fileio.remove) to read, write, and create directories. These operations can be resource-intensive, especially if there are frequent reads or writes to the file system.
  2. Depending on the logging frequency and the size of the buffer, it could impact performance. If logging occurs too frequently, it may lead to increased file I/O operations, potentially affecting performance.
  3. The remove_ansi_escape_codes function uses a regular expression (re.compile) to remove ANSI escape codes. If this function is called frequently or processes large amounts of data, it might impact performance.

I would greatly appreciate your guidance and any specific insights you might have on tackling this issue. If there are additional aspects I should consider or if you have any preferences regarding the approach, please let me know.

strickvl commented 7 months ago

So first thing I'd say would be to reproduce the issue. I.e. run a step when logging is turned on (i.e. by default). Then either toggle / update STEP_LOGS_STORAGE_INTERVAL_SECONDS env variable, or perhaps by disabling step logs.

When someone is running on a GPU-enabled environment, we could potentially have different behaviour. Also it isn't yet clear to me why logs within a GPU-enabled environment are slower beyond maybe that the task itself generates a certain frequency of logs. So in short, we'll need to dive a bit deeper into the problem I think.

htahir1 commented 7 months ago

@strickvl @nida-imran173 I would just add to this discussion that I think the primary reason for GPU performance degredation is exactly as Nida already said:

The code performs file I/O operations (fileio.open, fileio.makedirs, fileio.remove) to read, write, and create directories. These operations can be resource-intensive, especially if there are frequent reads or writes to the file system. Depending on the logging frequency and the size of the buffer, it could impact performance. If logging occurs too frequently, it may lead to increased file I/O operations, potentially affecting performance.

I would try to tackle this issue first. basically, id run some tests to see how this can effect performance. A very simple test could be to run a pipeline which trains a model using pytorch or tensorflow. These libraries produce progress bars that are then logged and cause a slow down . Once we've verified this, we can work on a fix all together by brainstorming strategies.

But first things first, as @strickvl said, we need a test in place where we can measure things