Adjustments for `Finch.jl#562`

mtsokol commented 4 months ago

Hi @willow-ahrens,

Here's a PR with adjustments required by https://github.com/willow-ahrens/Finch.jl/pull/562.

One thing that I'm concerned about is reduced performance of code that uses PlusOneVector (zero-copy SciPy constructors), which is way slower than copy counterparts.

For example, in SDDMM, changing copy to no-copy CSR matrix introduces x10 speed reduction.

mtsokol commented 4 months ago

Overall I'm happy with the change, I would like only to investigate a bit further PlusOneVector.

To repeat one of my comments: I run SDDMM with csr sparse format. And for explicit copy I get 5s execution:

```julia Executing: :(function var"##compute#316"(prgm) begin V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_3 = (((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A0 = V::Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A5 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A8 .= 0.0 for i49 = _ for i47 = _ for i48 = _ A8[i48, i47] << + >>= (*)((*)(A0[i48, i47], A2[i47, i49]), A5[i48, i49]) end end end return A8 end return (swizzle(A8, 2, 1),) end end) ```

And for the no-copy constructor (that differs only in PlusOneVector, used diffchecker.com) it's 100s:

```julia Executing: :(function var"##compute#308"(prgm) begin V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32}, PlusOneVector{Int32}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_3 = (((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A0 = V::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32}, PlusOneVector{Int32}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A5 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A8 .= 0.0 for i49 = _ for i47 = _ for i48 = _ A8[i48, i47] << + >>= (*)((*)(A0[i48, i47], A2[i47, i49]), A5[i48, i49]) end end end return A8 end return (swizzle(A8, 2, 1),) end end) ```

hameerabbasi commented 4 months ago

Ouch -- that's a big perf hit. I would wait to merge this as PlusOneVector is used a lot.

mtsokol commented 4 months ago

Ouch -- that's a big perf hit. I would wait to merge this as PlusOneVector is used a lot.

@hameerabbasi It turns out that PlusOneVector was already slow, but before it wasn't really used internally: https://github.com/willow-ahrens/Finch.jl/pull/562#issuecomment-2121580459

mtsokol commented 4 months ago

I created an issue for it: https://github.com/willow-ahrens/Finch.jl/issues/565

mtsokol commented 4 months ago

@hameerabbasi Would you have a spare moment to review it?

hameerabbasi commented 4 months ago

It's fairly minimal code -- there's not anything to review besides a dependency version bump. But if that dependency introduces a major regression, then we should wait for that regression to be fixed before bumping the dependency.

mtsokol commented 4 months ago

AFAIU it's not necessarily a regression, as PlusOneVectors used to be dropped internally. Latest release just exposed slow performance of PlusOneVector, so zero-copy SciPy constructors in general.

My current opinion is it's better to merge it and provide zero-copy/PlusOneVector with low performance (and keep working on improving it / or drop zero-copy SciPy constructors if it's not feasible) than being blocked on current version that drops PlusOneVector anyway.

hameerabbasi commented 4 months ago

A regression doesn't need to be a bug, it can be anything which practically renders the new version inferior to the old one. If such a situation arises, I'd re-evaluate after each release if the newest version is better for my use-case than whatever version I'm using now, and then upgrade.

mtsokol commented 4 months ago

Then I think I can open a PR which drops PlusOneVector from finch-tensor, which acknowledges that it's not used internally. Then this PR won't be a regression.

willow-ahrens commented 4 months ago

I think plusonevector introduces a type instability which we can fix pretty easily

willow-ahrens commented 4 months ago

But i think we should merge this change because it strictly makes the finch-tensor more robust. This is a bug fix, and we need to fix plusonevector in any case

willow-ahrens commented 4 months ago

Also, changes to the heuristic scheduler should be evaluated on mean improvement to all benchmarks. There will certainly be performance regressions on individual benchmarks as we develop it, but we need to make these changes to eventually reach better performance for the variety of kernels.

willow-ahrens commented 4 months ago

There's a one-line fix for the performance bug in https://github.com/willow-ahrens/Finch.jl/pull/567, with a 100x perf improvement

willow-ahrens commented 4 months ago

I bumped the version in there to 0.6.30

willow-ahrens / finch-tensor

Adjustments for `Finch.jl#562` #57