add heuristic loop order

willow-ahrens commented 1 month ago

fixes https://github.com/willow-ahrens/Finch.jl/issues/468

willow-ahrens commented 1 month ago

@mtsokol This PR makes the lazy interface much smarter, but it seems to break the python tests because they don't expect the lazy interface to return swizzles.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 69.69697% with 70 lines in your changes are missing coverage. Please review.

Project coverage is 76.03%. Comparing base (8c107d1) to head (f15b948).

:exclamation: Current head f15b948 differs from pull request most recent head 601c727

Please upload reports for the commit 601c727 to get more accurate results.

Files	Patch %	Lines
src/scheduler/optimize.jl	86.32%	16 Missing :warning:
src/tensors/combinators/product.jl	0.00%	7 Missing :warning:
src/tensors/combinators/toeplitz.jl	0.00%	7 Missing :warning:
src/interface/traits.jl	0.00%	6 Missing :warning:
src/tensors/combinators/offset.jl	0.00%	6 Missing :warning:
src/tensors/combinators/permissive.jl	0.00%	5 Missing :warning:
src/tensors/combinators/protocolized.jl	0.00%	5 Missing :warning:
src/tensors/combinators/scale.jl	0.00%	5 Missing :warning:
src/tensors/combinators/swizzle.jl	0.00%	4 Missing :warning:
src/tensors/combinators/windowed.jl	0.00%	4 Missing :warning:
... and 3 more

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #562 +/- ## ========================================== + Coverage 75.95% 76.03% +0.07% ========================================== Files 97 97 Lines 8864 9010 +146 ========================================== + Hits 6733 6851 +118 - Misses 2131 2159 +28 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mtsokol commented 1 month ago

Hi @willow-ahrens,

Right, def _reduce wasn't expecting SwizzleArray object from the Julia function call. I prepared a one-line fix for it - I will open a PR once this is merged and released.

But before it, I reran examples in pydata/sparse and it looks like SDDMM got much slower due to these changes:

Finch
Took 80.18670105934143 s.

Numba
Took 19.97915291786194 s.

SciPy
Took 18.3186297416687 s.

Here's the new execution plan:

```julia Executing: :(function var"##compute#308"(prgm) begin V = (((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{PlusOneVector{Int32}, PlusOneVector{Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}} V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_3 = ((((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A0 = V::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{PlusOneVector{Int32}, PlusOneVector{Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}} A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A4 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A8 = Tensor(SparseDict(SparseDict(Element{0.0, Float64}())))::Tensor{SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A8 .= 0.0 for i49 = _ for i47 = _ for i48 = _ A8[i48, i47] << + >>= (*)(A0[i48, i47], (*)(A2[i47, i49], A4[i48, i49])) end end end return A8 end return (swizzle(A8, 2, 1),) end end) ```

For comparison, here's the previous execution plan:

```julia Executing: :(function var"##compute#410"(prgm) begin V = (((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{PlusOneVector{Int32}, PlusOneVector{Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}} V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_3 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A0 = V::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{PlusOneVector{Int32}, PlusOneVector{Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}} A0_2 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A0_2 .= 0.0 for i1 = _ for i0 = _ A0_2[i1, i0] = A0[i0, i1] end end return A0_2 end A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A4 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A8 .= 0.0 for i52 = _ for i51 = _ for i50 = _ A8[i50, i51] << + >>= (*)(A0_2[i50, i51], (*)(A2[1, i52], A4[1, i52])) end end end return A8 end return (A8,) end end) ```

mtsokol commented 1 month ago

@willow-ahrens OTOH I see that with these changes MTTKRP and elementwise * got faster. Can we take a look at SDDMM regression? Maybe there's just different way to express this operation? (use different orders on input arrays to avoid swizzling)

mtsokol commented 1 month ago

@willow-ahrens Sorry for noise. For SDDMM I think I managed to get back to original performance on the Python side. By changing the format of the sparse array from COO to CSR I get again:

Finch
Took 4.882369756698608 s.

Numba
Took 22.508983850479126 s.

SciPy
Took 20.508883953094482 s.

willow-ahrens commented 1 month ago

oh, I think I see what happened. The COO format is very slow, and so permuting that input sped up the computation because it also reformatted

mtsokol commented 1 month ago

I also have a suspicion that PlusOneVector is slow. I run SDDMM with csr sparse format. And for explicit copy I get 5s execution:

```julia Executing: :(function var"##compute#316"(prgm) begin V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_3 = (((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A0 = V::Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A5 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A8 .= 0.0 for i49 = _ for i47 = _ for i48 = _ A8[i48, i47] << + >>= (*)((*)(A0[i48, i47], A2[i47, i49]), A5[i48, i49]) end end end return A8 end return (swizzle(A8, 2, 1),) end end) ```

And for the no-copy constructor (that differs only in PlusOneVector, used diffchecker.com) it's 100s:

```julia Executing: :(function var"##compute#308"(prgm) begin V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32}, PlusOneVector{Int32}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} V_3 = (((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A0 = V::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32}, PlusOneVector{Int32}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A5 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}} A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}} @finch mode = :fast begin A8 .= 0.0 for i49 = _ for i47 = _ for i48 = _ A8[i48, i47] << + >>= (*)((*)(A0[i48, i47], A2[i47, i49]), A5[i48, i49]) end end end return A8 end return (swizzle(A8, 2, 1),) end end) ```

mtsokol commented 1 month ago

Ok, I think that the original performance regression that I described here is caused by PlusOneVector issue. With copy, COO is also faster.

Therefore, I think that PlusOneVector performance regression is introduced here in wma/heuristic branch. In main branch, SDDMM it runs 10s, but here it's more than x10 slower.

willow-ahrens commented 1 month ago

I think that when we reformat, the PlusOneVectors are dropped, so I think that they are always slow, but this PR makes it more obvious because it attempts to avoid reformatting/transposition.

willow-ahrens / Finch.jl

add heuristic loop order #562

Codecov Report