Wma/fix 488 - Githubissues

willow-ahrens commented 4 months ago

fixes #488

willow-ahrens commented 4 months ago

It looks like fsprand is erroring due to integer overflow on 32bit systems with this test case.

willow-ahrens commented 4 months ago

this is due to randsubseq, not Finch. We need to implement our own randsubseq or improve the existing implementation

mtsokol commented 4 months ago

It looks that it's strictly due to 32bit integer being a default one on Windows.

It seems that length(A) gets an overflow and returns -727379968. But then isn't it a bug in Julia?

length(CartesianIndices((1000000, 1000000)))

Gives -727379968 of 32bit integer on windows. But I think it should anticipate which size to use and return 64bit integer.

willow-ahrens commented 4 months ago

It's true that there's an overflow here, but it looks like it's on us to implement a version of the algorithm which uses the correct number types and does not overflow. The issue is basically that we want to randomly sample from (1:int_max, 1:int_max), but we cast this as a CartesianIndices object, which gets treated like an array. So we create an array-like-object with int_max^2 elements, but we access it with an int. I think we need to reimplement randsubseq to do this correctly, or check if someone else has done this already. Unfortunately, I think the algorithms in StatsBase.jl also have this same problem, but we can totally copy their code and repurpose it.

willow-ahrens commented 4 months ago

the main challenge is that all of these algorithms assume the entire collection we are sampling from is representable in memory.

mtsokol commented 4 months ago

I added custom implementation to make sure index variables are Int64 for now. Weirdly it fails with:

BoundsError: attempt to access 1000000×1000000 CartesianIndices{2, Tuple{Base.OneTo{Int32}, Base.OneTo{Int32}}} at index [38056]

which doesn't really make any sense, considering that 38056 fits in Int32.

mtsokol commented 4 months ago

I received a feedback on Julia Discourse. The issue here is that internally it still checks the index against the length (causing 32bit overflow), so it looks like it's going to be problematic on 32-bit anyway.

mtsokol commented 4 months ago

One option is that we can build index ourselves, in the line where the error occurs:

push!(S, A[i += ceil(Int64, s)])

do instead something like:

i += ceil(Int64, s)
i_tuple = Tuple{N}
for idx, dim in enumerate(dimensions):
    i_tuple[idx] = i % dim
    i = i // dim
end
push!(S, A[i_tuple...])

This way we can index A with a tuple instead of linearized index that can exceed 32bit.

willow-ahrens commented 4 months ago

We may want to move towards a custom implementation for tuples of ints, like you've sketched here

willow-ahrens commented 4 months ago

Im also happy to implement this, if you like

mtsokol commented 4 months ago

Let me take care of this! It's a small tweak to the existing implementation.

willow-ahrens commented 4 months ago

I'm concerned we need to use a different algorithm. We can't just call randexp and convert to an integer, I'm not even sure that's correct for Int64. I don't think we can expect it to perform well after the float mantissa loses precision, around $n=2^{53}$ this algorithm is only going to select $i$ equal to multiples of 2, 4, 8, etc.

mtsokol commented 4 months ago

FYI The original implementation does:

s = randexp(r) * L
s >= n - i && return S
push!(S, A[i += ceil(Int,s)])

willow-ahrens commented 4 months ago

Yes, don't think the original algorithm is correct for sparse arrays with large dimension. The problem is that the function f(x) = ceil(Int, x::Float) is not actually capable of producing all possible integers.

willow-ahrens commented 4 months ago

we might want to use something like https://github.com/JuliaStats/StatsBase.jl/blob/60fb5cd400c31d75efd5cdb7e4edd5088d4b1229/src/sampling.jl#L256C1-L278C44. This would allow us to customize the simple algorithm for tuples of ints, with correct behavior even for fsprand(typemax(Int64), typemax(Int64), (10.0/typemax(Int64))/typemax(Int64)).

The above algorithm is for sampling k numbers without replacement. However, we still need to accurately sample the number of nonzeros to generate. We need a binomial distribution, but we need to be careful with how we call that too, since we want to sample from way more than n.

However, it's debatable whether we should implement fsprand(n, m, p) at all when n * m overflows, because even though we expect n * m * p nnzs, we aren't prepared for n * m nnzs, which could be a possibility, especially on 32 bit systems.

We may want to instead introduce a variant of fsprand that allows us to specify the exact number of nonzeros, then use the sampling $k$ without replacement algorithm I linked here.

willow-ahrens commented 4 months ago

Here's a concrete example of an issue:

julia> using Finch
Precompiling Finch
  1 dependency successfully precompiled in 24 seconds. 13 already precompiled.

julia> t = 1000
1000

julia> countstored(fsprand(t, t, (100/t)/t))
108

julia> countstored(fsprand(t, t, (100/t)/t))
108

julia> countstored(fsprand(t, t, (100/t)/t))
99

julia> t = typemax(Int64)
9223372036854775807

julia> countstored(fsprand(t, t, (100/t)/t))
0

julia> countstored(fsprand(t, t, (100/t)/t))
0

julia> countstored(fsprand(t, t, (100/t)/t))
0

julia> t = 1_000_000
1000000

julia> countstored(fsprand(t, t, t, 100/t/t/t))
132

julia> countstored(fsprand(t, t, t, 100/t/t/t))
112

julia> countstored(fsprand(t, t, t, t, 100/t/t/t/t))
0

julia> countstored(fsprand(t, t, t, t, 100/t/t/t/t))
0

julia> countstored(fsprand(t, t, t, t, 100/t/t/t/t))
0

hameerabbasi commented 4 months ago

We have an implementation here if that helps: https://github.com/pydata/sparse/blob/0b58d4b023d299750c324f8250ab2cdb661aeea9/sparse/pydata_backend/_utils.py#L223

willow-ahrens commented 4 months ago

@hameerabbasi I don't think the algorithm you mentioned can handle this either:

willow@Willows-Mac-mini ~ % poetry show 
joblib        1.4.0  Lightweight pipelining with Python functions
llvmlite      0.42.0 lightweight wrapper around basic LLVM functionality
numba         0.59.1 compiling Python code using LLVM
numpy         1.26.4 Fundamental package for array computing in Python
scikit-learn  1.4.2  A set of python modules for machine learning and data mining
scipy         1.13.0 Fundamental algorithms for scientific computing in Python
sparse        0.15.1 Sparse n-dimensional arrays for the PyData ecosystem
threadpoolctl 3.4.0  threadpoolctl
willow@Willows-Mac-mini ~ % poetry run python3
Python 3.11.8 (main, Feb  6 2024, 21:21:21) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sparse
>>> sparse.random([1000000,1000000,1000000,1000000], nnz=4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/willow/Library/Caches/pypoetry/virtualenvs/willow-ENU-A1do-py3.11/lib/python3.11/site-packages/sparse/_utils.py", line 340, in random
    ).reshape(shape)
      ^^^^^^^^^^^^^^
  File "/Users/willow/Library/Caches/pypoetry/virtualenvs/willow-ENU-A1do-py3.11/lib/python3.11/site-packages/sparse/_coo/core.py", line 1029, in reshape
    raise ValueError(f"cannot reshape array of size {self.size} into shape {shape}")
ValueError: cannot reshape array of size 2003764205206896640 into shape (1000000, 1000000, 1000000, 1000000)

willow-ahrens commented 4 months ago

Just fyi, that number is the overflowed version of 1000000^4

julia> t = 1000000
1000000

julia> t * t * t * t
2003764205206896640

willow-ahrens commented 4 months ago

Does this look good to everyone? There's a lot of statistics here so it would be nice to get a review.

hameerabbasi commented 4 months ago

Just fyi, that number is the overflowed version of 1000000^4
julia> t = 1000000

1000000

julia> t * t * t * t

2003764205206896640

Yes, in sparse, we currently don't support arrays of size >= 2 ** 64.

willow-ahrens / Finch.jl

Wma/fix 488 #489