Closed Yikai-Liao closed 1 month ago
As long as the instance is a global variable or wrapped in a global variable, the warning could be reproduced.
from tokenizers import Tokenizer, models, trainers
from nanobind_example.nanobind_example_ext import Box
bar = Box('') # a global variable of a class bound by nanobind
tokenizer = Tokenizer(models.Unigram())
trainer = trainers.UnigramTrainer()
class MyIter:
def __init__(self) -> None:
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.i >= 10 :
raise StopIteration
self.i += 1
return ""
def my_iter():
for _ in range(10):
yield ""
tokenizer.train_from_iterator(MyIter(), trainer=trainer) # introduce nanobind warnings
tokenizer.train_from_iterator(list(MyIter()), trainer=trainer) # not introduce nanobind warnings
tokenizer.train_from_iterator(my_iter(), trainer=trainer) # not introduce warnings
Dear @Yikai-Liao,
did you read the FAQ entry about these leaks? In general, these are actual genuine leaks, but they are Python's fault rather than being an issue with nanobind itself. They are also benign in the sense that Python fails to free all instances at shutdown time, but it's not a leak that grows over time at runtime. You can disable the leak warnings if they bother you.
I have read the FAQ.
Older Python versions: Very old Python versions (e.g., 3.8) don’t do a good job cleaning up global references when the interpreter shuts down. The following code may leak a reference if it is a top-level statement in a Python file or the REPL. Such a warning is benign and does not indicate an actual leak.
But I'm not using a older python version. I'm testing under python 3.11.
This might be categorized as Interactions with other tools that leak references , but tokenizers
doesn't use the jit, and it is also not mentioned that this situation is benign
And I'll open an issue in tokenizers
In any case, I don't think there is anything that nanobind can do to fix your issue besides disabling the leak warning altogether, which is also undesirable.
Problem description
I found the memory leak warnings in my code long time ago #397, but I can't give a minimum reproducible code at that time.
Now, with the help from one of my lib's user (https://github.com/Yikai-Liao/symusic/issues/49), I can give it, although I have a feeling that the reason for this memory leak warning is different from the previous one.
Reproducible example code
Here is the binding code, just add a simple struct on the top of nanobind_example
To reproduce the warnings, we need huggingface's tokenizers:
Here is the needed python code, we just write an unused function with an instance of
Box
at its default argument, and call thetrain_from_iterator
. Note that only with the iterator defined by a class can lead to the memory leak warnings. Iterator using yield just works fine.Here is the warning we will get.
At first I thought it might have something to do with multi process and bound
__setstate__
and__getstate__
functions toBox
, but that didn't work.