wjakob / nanobind

nanobind: tiny and efficient C++/Python bindings
BSD 3-Clause "New" or "Revised" License
2.14k stars 161 forks source link

Separate const ndarray from const data; remove view and ro. #498

Closed hpkfft closed 3 months ago

hpkfft commented 3 months ago

For the sake of discussion and in the interest of design space exploration, suppose we eliminated the ndarray view() function. Maybe this is OK for a nano-sized binding library. Developers would have to write code like this:

void fill(nb::ndarray<float, nb::ndim<3>, nb::c_contig, nb::device::cpu> a) {
    const size_t shape[3] = {a.shape(0), a.shape(1), a.shape(2)};
    float* data = a.data();
    for (size_t i = 0; i < shape[0]; ++i) {
        for (size_t j = 0; j < shape[1]; ++j) {
            for (size_t k = 0; k < shape[2]; ++k) {
                *data++ = 100.0f * i + 10.0f * j + k;
            }
        }
    }
}

An advantage of the above is that it takes advantage of the known c_contig and ought to be even more efficient than using a view, because the view operator() contains multiplications:

        for (size_t i = 0; i < Dim; ++i)
            offset += indices_i64[i] * m_strides[i];

The fill function above is probably not suitable for OpenMP parallelization, but that is easily fixed:

void fill(nb::ndarray<float, nb::ndim<3>, nb::c_contig, nb::device::cpu> a) {
    const size_t shape[3] = {a.shape(0), a.shape(1), a.shape(2)};
    const int64_t stride0 = a.stride(0);
    float* data = a.data();
    for (size_t i = 0; i < shape[0]; ++i) {
        float* ptr = data + stride0 * i;
        for (size_t j = 0; j < shape[1]; ++j) {
            for (size_t k = 0; k < shape[2]; ++k) {
                *ptr++ = 100.0f * i + 10.0f * j + k;
            }
        }
    }
}

Above can parallelize over slabs (the i index) and avoid multiplications on each node for the j and k loops.

I guess I'm suggesting that developers write their code without view() and customize for their situation.

The advantage to nb::ndarray is that we don't have to consider what happens when template arguments to view() are examined before the template arguments to ndarray.

Another thought I'd like to offer for your consideration is removing nb::ro and asking developers to write ndarray<const void> instead of ndarray<nb::ro>. This eliminates having to think about the right thing to do if both ro and a scalar element type both appear as template arguments (which can happen in different orders).

wjakob commented 3 months ago

I haven't reviewed this in detail yet, but I am wondering: what's the rationale for removing .view()?

hpkfft commented 3 months ago

The alternate PR addresses what I thought could be non-intuitive aspects of .view(), in particular when template arguments to the function conflict with those of the ndarray. Though, I'm not sure I resolved the conflicts the way others might prefer. Also, it seemed to be getting a little awkward to explain and document: the ro is sticky, other parameters can be overwritten using a .view() template argument, and still others cannot. But the ones that can be overridden can still be problematic. It's correct to "flatten" a c_contig 2D array to 1D using, for example, .view<nb::shape<1024>>() [see test fill_view_6 in the other PR], but that won't work correctly in other cases (because of strides).

I was wondering whether you think that view is pulling its own weight. I don't see myself using it, because I have an existing C++ library for which I want to use nanobind to provide a python interface. So, my C++ library itself already deals with indexing into a memory region. I cannot judge its value to others, but I thought I could be helpful in examining the cost side of the feature by showing what code would be eliminated. [What I would appreciate is eventually upstreaming the const-correctness of .data(). I can create a PR that does only that, but I felt guilty suggesting changes for that without doing the same for the ndarray operator() and .view().]

I think that .view() is not meant to have the generality of numpy.reshape(); it's meant for performance. But in some cases, better performance seems possible by directly reading the shape and size from the ndarray and writing one's own C++ code. The offset calculation loop in view's operator() [see snippet in my initial comment] needs to be unrolled and inlined, and I'm not sure that happens in an extension compiled with MINSIZE. And it still has integer multiplications.

I guess, philosophically, I was struggling with how to conceptualize .view(). It seems more like mdspan. But, if so, and given that standard library class now exists, maybe nanobind doesn't need to provide its own abstraction on top of data(), dtype(), shape_ptr(), and stride_ptr().

Anyway, I'm just brainstorming because it's fun, and I'm happy to help, if I can.

wjakob commented 3 months ago

I don't see how this change improves the library. .view() exists to address two real issues:

  1. Operating directly on the DLPack array/DLPack data structures impedes optimization. I have observed huge speedups (> 50%) simply by switching to .view().
  2. It is sometimes useful to combine mixtures of compile-time and runtime specialization. I find ability the to pick between several different views at runtime quite useful.

You are also right that this is essentially an implementation of std::mdspan. However, nanobind is strictly a C++17 library, while std::mdspan is a C++23 feature. So we are two full C++ standards off, and there is no way to depend on this feature.

wjakob commented 3 months ago

PS: I am open to supporting const void * as a type parameter. But you would need to also incorporate support for the view() feature -- half-finished PRs cause a lot of work for me that I unfortunately don't have (this just one of many side-projects of mine).

hpkfft commented 3 months ago

PR #491 fully supports views and const-correctness. This draft can be closed.

I would be happy to follow up with a separate PR to support nb::ndarray<const void> Do you want to remove nb::ro or support both?

wjakob commented 3 months ago

It's annoying to break other people's code. If possible, I would like to support nb::ro and const void * in parallel for some time.