Improve logging logic to improve/fix GPU performance

strickvl commented 8 months ago

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

support@zenml.io

What happened?

Users have reported a significant drop in GPU utilization (from 95% to 2%) after upgrading ZenML from version 0.32.1 to 0.44.2. This issue was observed while deploying pipelines on GCP VertexAI. Investigations suggest that the performance bottleneck is due to the logging mechanism, especially when using progress bars like tqdm. It appears that logging, particularly frequent updates from progress bars, is substantially slowing down the processing speed.

Task Description

Investigate and optimize the logging logic in ZenML, particularly for scenarios involving high GPU usage. The goal is to ensure that the logging process, including progress bars, does not adversely affect the GPU performance and overall speed of pipeline execution.

Expected Outcome

ZenML should maintain high GPU utilization without being impacted by the logging process.
Users should be able to use progress bars and other logging tools without experiencing a significant slowdown in processing.
Modifications should be made to allow users to control the frequency and verbosity of logs to balance between logging needs and performance.

Steps to Implement

Analyze the current logging mechanism and identify how it interacts with GPU-intensive processes.
Develop solution(s) to optimize logging, particularly when progress bars are used, to reduce their impact on GPU and overall performance.
Implement configurable settings for users to control the logging behavior, such as limiting log frequency or verbosity.
Thoroughly test the changes in scenarios with high GPU usage to ensure that the logging optimizations are effective.
Update documentation to guide users on how to configure logging settings for optimal performance.

Note that part of the solution might be to expose these global variables / constants better in settings via environment variables.

Additional Context

This issue is critical for users leveraging ZenML for GPU-intensive tasks, as efficient GPU utilization is key to performance in these scenarios. The solution should provide a balance between informative logging and optimal resource utilization.

Code of Conduct

[ ] I agree to follow this project's Code of Conduct

nida-imran173 commented 7 months ago

@strickvl I'm interested in working on this issue. Can I take it up?

strickvl commented 7 months ago

Sure thing, @nida-imran173! I'll assign it to you and let us know if you have any questions. Most basic things should be answered in our CONTRIBUTING.md document.

nida-imran173 commented 7 months ago

Hi @strickvl,

After analyzing the code in 'logging', I've identified a few potential areas that could be causing the reported drop in GPU utilization. Here are the key points:

The code performs file I/O operations (fileio.open, fileio.makedirs, fileio.remove) to read, write, and create directories. These operations can be resource-intensive, especially if there are frequent reads or writes to the file system.
Depending on the logging frequency and the size of the buffer, it could impact performance. If logging occurs too frequently, it may lead to increased file I/O operations, potentially affecting performance.
The remove_ansi_escape_codes function uses a regular expression (re.compile) to remove ANSI escape codes. If this function is called frequently or processes large amounts of data, it might impact performance.

I would greatly appreciate your guidance and any specific insights you might have on tackling this issue. If there are additional aspects I should consider or if you have any preferences regarding the approach, please let me know.

strickvl commented 7 months ago

So first thing I'd say would be to reproduce the issue. I.e. run a step when logging is turned on (i.e. by default). Then either toggle / update STEP_LOGS_STORAGE_INTERVAL_SECONDS env variable, or perhaps by disabling step logs.

When someone is running on a GPU-enabled environment, we could potentially have different behaviour. Also it isn't yet clear to me why logs within a GPU-enabled environment are slower beyond maybe that the task itself generates a certain frequency of logs. So in short, we'll need to dive a bit deeper into the problem I think.

htahir1 commented 7 months ago

@strickvl @nida-imran173 I would just add to this discussion that I think the primary reason for GPU performance degredation is exactly as Nida already said:

The code performs file I/O operations (fileio.open, fileio.makedirs, fileio.remove) to read, write, and create directories. These operations can be resource-intensive, especially if there are frequent reads or writes to the file system. Depending on the logging frequency and the size of the buffer, it could impact performance. If logging occurs too frequently, it may lead to increased file I/O operations, potentially affecting performance.

I would try to tackle this issue first. basically, id run some tests to see how this can effect performance. A very simple test could be to run a pipeline which trains a model using pytorch or tensorflow. These libraries produce progress bars that are then logged and cause a slow down . Once we've verified this, we can work on a fix all together by brainstorming strategies.

But first things first, as @strickvl said, we need a test in place where we can measure things

zenml-io / zenml