Closed stefannica closed 1 week ago
[!IMPORTANT]
Review skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yaml
file in this repository. To trigger a single review, invoke the@coderabbitai review
command.You can disable this status message by setting the
reviews.review_status
tofalse
in the CodeRabbit configuration file.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
You might want to report this upstream, there seems to have been a lingering issue since 2021 at least: https://github.com/fsspec/s3fs/issues/481
Describe changes
This PR fixes a memory leak was observed while using the ZenML dashboard to view the logs of a pipeline or the visualization of an artifact that logged through an S3 Artifact Store linked to an AWS Service Connector.
How to reproduce
NOTE: even with this patch applied, the memory usage may still go up until it reaches a stable value. This depends largely on the number of threads that the server is allowed to create (default: 40). It is recommended to configure the number of threads to something low like 2 or 3 (by using the
ZENML_SERVER_THREAD_POOL_SIZE
environment variable when running the server) to reach that memory usage plateau faster.Details
The investigation lead to the following findings:
S3FileSystem
client instances permanently, based on the constructor input arguments. In a more dynamic and long-lived environment like the ZenML server, this means that a newS3FileSystem
is created (and never de-allocated) for every new set of credentials. Given that the AWS Service Connector generates temporary credentials, the number of cachedS3FileSystem
clients slowly grows with every invocation of the pipeline logs or artifact visualization endpoint until the server runs out of memory.Solution:
S3FileSystem
instance caching is a class-level setting. This PR disables instance caching by sub-classingS3FileSystem
and using the sub-class in the ZenML S3 Artifact Store implementation:S3FileSystem
client instances were incorrectly de-allocated by the garbage collector, leading to additional memleaks. The exact root cause of this was not clearly identified even though it was fixed, but a likely explanation is included here.s3fs
is using asyncio at its core, but also provides support for synchronous calls by maintaining its own asyncio event loop thread and converting sync calls in asyncio tasks.s3fs
also usesaiobotocore
instead ofbotocore
for that same reason and the aiobotocore clients are implemented using asynchronous context managers (more details on this below).Normally, an async botocore client is used like this, where the client is created and deallocated in the same code block:
However, in the case of a class like
S3FileSystem
, where the client needs to be reused across different calls happening at different times, the aiobotocore client context manager needs to be entered during initialization and exited during cleanup:The problem comes from when and how the cleanup is called: it needs to run as an asyncio task and therefore it needs to be invoked at a time when an asyncio event loop still exists i.e. before the garbage collector deallocates the event loop resources. The cleanup might also need to be run using the same event loop as the initialization - which is the assumed source for the second memleak: he way
s3fs
was doing de-initialization was buggy in that it was not using the same event loop as the one used for initialization.Side-changes
Some more changes are included that help make the ZenML server leaner and more efficient concerning memory usage:
cleanup
method to the stack component base class to be called in the ZenML server to proactively deallocate the client when done working with it instead of waiting for the garbage collectorPre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes