`scansearch` fails for `scipy.sparse` indices arrays

mtsokol commented 7 months ago

Hi @willow-ahrens @hameerabbasi,

I'm finishing from_scipy constructors and I stumbled upon one issue. I see that in almost all cases the dtype of indices arrays in scipy.sparse is np.int32, which isn't customizable with a parameter (dtype= parameter in coo, csr, and csc SciPy formats refers only to data array dtype, not indices (indices, indptr)).

Due to this reason we have:

import numpy as np
import scipy.sparse as sp

arr2d = np.array([[0, 0, 3, 2, 0], [1, 0, 0, 1, 0], [0, 5, 0, 0, 0]], dtype=np.int64)
csr = sp.csr_matrix(arr2d, dtype=np.int64)
csr.indices.dtype  # np.int32
csr.indptr.dtype  # np.int32

Due to this reason Finch fails during todense() method (for CSR/CSC initialized with Scipy contents) with:

E       juliacall.JuliaError: MethodError: no method matching scansearch(::Vector{Int32}, ::Int64, ::Int32, ::Int64)
E
E       Closest candidates are:
E         scansearch(::Any, ::Any, ::T, !Matched::T) where T<:Integer
E          @ Finch ~/.julia/packages/Finch/fTxKj/src/util/util.jl:522
E
E       Stacktrace:
E         [1] macro expansion
E           @ ~/.julia/packages/Finch/fTxKj/src/tensors/levels/sparselistlevels.jl:312 [inlined]

I see that scansearch(v, x, lo::T, hi::T)::T where T<:Integer forces same type on lo and hi, while these arguments can be here np.int32 and np.int64 (one coming from ptr and the other is just data's dtype).

This matters for no-copy Finch constructors as we don't want to cast (and copy!) SciPy indices arrays.

What would be the solution to this issue?

Ask for/implement idx_dtype parameter for scipy.sparse constructors so this dtype can be customized (not feasible I guess, and it's done this way for a reason).
Make scansearch accept mixed precisions for lo and hi. Would it cause any local copies when int32 + int64?

TODO:

[x] relax scansearch
[ ] Call code_warntype when Finch is in Debug Mode, check for red text and warn when the text is red.

willow-ahrens commented 7 months ago

Finch should certainly not fail when processing mixed integer widths. We should go for number 2 for sure.

However, a detailed analysis of the generated code should also be done to determine whether such a case induces type instability. One thing to do is to add an @inferred check to the finch code generator and issue a warning if the generated finch code is not type inferrable. Mateus, feel free to make a PR that removes the type restriction on scansearch (I'll merge it ASAP), and we'll leave this issue open for another PR later to double-check that finch code is always inferrable.

willow-ahrens commented 7 months ago

I'm going to instead add a performance tip section about @code_warntype.

willow-ahrens / Finch.jl

`scansearch` fails for `scipy.sparse` indices arrays #406