wjakob / nanobind

nanobind: tiny and efficient C++/Python bindings
BSD 3-Clause "New" or "Revised" License
2.25k stars 179 forks source link

[BUG]: Global Variable of class bound by nanobind might cause memory leak warnings #644

Closed Yikai-Liao closed 1 month ago

Yikai-Liao commented 1 month ago

Problem description

I found the memory leak warnings in my code long time ago #397, but I can't give a minimum reproducible code at that time.

Now, with the help from one of my lib's user (https://github.com/Yikai-Liao/symusic/issues/49), I can give it, although I have a feeling that the reason for this memory leak warning is different from the previous one.

Disclaimer: I'm not sure if this is a bug in nanbind, or python or tokenizers. I just wanted to let you know that I've found a way to stably reproduce a memory leak warning issued by nanobind. Due to my personal capacity, I didn't find the root of this problem, so I prioritized submitting the method to the nanobind issue, i.e. here.

Reproducible example code

Here is the binding code, just add a simple struct on the top of nanobind_example

#include <nanobind/nanobind.h>
#include <nanobind/stl/string.h>

namespace nb = nanobind;

using namespace nb::literals;

struct Box {
    std::string data;
    Box(std::string d): data{std::move(d)} {}
};

NB_MODULE(nanobind_example_ext, m) {
    m.doc() = "This is a \"hello world\" example with nanobind";
    m.def("add", [](int a, int b) { return a + b; }, "a"_a, "b"_a);
    nb::class_<Box>(m, "Box")
        .def(nb::init<std::string>())
        .def_rw("data", &Box::data)
    ;
}

To reproduce the warnings, we need huggingface's tokenizers:

pip install tokenizers

Here is the needed python code, we just write an unused function with an instance of Box at its default argument, and call the train_from_iterator. Note that only with the iterator defined by a class can lead to the memory leak warnings. Iterator using yield just works fine.

from tokenizers import Tokenizer, models, trainers
from nanobind_example.nanobind_example_ext import Box

def foo(arg=Box('')):
    pass

tokenizer = Tokenizer(models.Unigram())

trainer = trainers.UnigramTrainer()

class MyIter:
    def __init__(self) -> None:
        self.i = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.i >= 10 :
            raise StopIteration
        self.i += 1
        return ""

def my_iter():
    for _ in range(10):
        yield ""

tokenizer.train_from_iterator(MyIter(), trainer=trainer)        # introduce nanobind warnings
tokenizer.train_from_iterator(list(MyIter()), trainer=trainer)  # not introduce nanobind warnings
tokenizer.train_from_iterator(my_iter(), trainer=trainer)       # not introduce warnings

Here is the warning we will get.

nanobind: leaked 1 instances!
 - leaked instance 0x7f40aecef948 of type "Box"
nanobind: leaked 1 types!
 - leaked type "nanobind_example.nanobind_example_ext.Box"
nanobind: leaked 3 functions!
 - leaked function ""
 - leaked function "__init__"
 - leaked function ""
nanobind: this is likely caused by a reference counting issue in the binding code.

At first I thought it might have something to do with multi process and bound __setstate__ and __getstate__ functions to Box, but that didn't work.

Yikai-Liao commented 1 month ago

Update

As long as the instance is a global variable or wrapped in a global variable, the warning could be reproduced.

from tokenizers import Tokenizer, models, trainers
from nanobind_example.nanobind_example_ext import Box

bar = Box('')  # a global variable of a class bound by nanobind

tokenizer = Tokenizer(models.Unigram())

trainer = trainers.UnigramTrainer()

class MyIter:
    def __init__(self) -> None:
        self.i = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.i >= 10 :
            raise StopIteration
        self.i += 1
        return ""

def my_iter():
    for _ in range(10):
        yield ""

tokenizer.train_from_iterator(MyIter(), trainer=trainer)        # introduce nanobind warnings
tokenizer.train_from_iterator(list(MyIter()), trainer=trainer)  # not introduce nanobind warnings
tokenizer.train_from_iterator(my_iter(), trainer=trainer)       # not introduce warnings
wjakob commented 1 month ago

Dear @Yikai-Liao,

did you read the FAQ entry about these leaks? In general, these are actual genuine leaks, but they are Python's fault rather than being an issue with nanobind itself. They are also benign in the sense that Python fails to free all instances at shutdown time, but it's not a leak that grows over time at runtime. You can disable the leak warnings if they bother you.

Yikai-Liao commented 1 month ago

I have read the FAQ.

Older Python versions: Very old Python versions (e.g., 3.8) don’t do a good job cleaning up global references when the interpreter shuts down. The following code may leak a reference if it is a top-level statement in a Python file or the REPL. Such a warning is benign and does not indicate an actual leak.

But I'm not using a older python version. I'm testing under python 3.11.

This might be categorized as Interactions with other tools that leak references , but tokenizers doesn't use the jit, and it is also not mentioned that this situation is benign

And I'll open an issue in tokenizers

wjakob commented 1 month ago

In any case, I don't think there is anything that nanobind can do to fix your issue besides disabling the leak warning altogether, which is also undesirable.