Closed willow-ahrens closed 1 month ago
@mtsokol This PR makes the lazy interface much smarter, but it seems to break the python tests because they don't expect the lazy interface to return swizzles.
Attention: Patch coverage is 69.69697%
with 70 lines
in your changes are missing coverage. Please review.
Project coverage is 76.03%. Comparing base (
8c107d1
) to head (f15b948
).:exclamation: Current head f15b948 differs from pull request most recent head 601c727
Please upload reports for the commit 601c727 to get more accurate results.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi @willow-ahrens,
Right, def _reduce
wasn't expecting SwizzleArray
object from the Julia function call. I prepared a one-line fix for it - I will open a PR once this is merged and released.
But before it, I reran examples in pydata/sparse
and it looks like SDDMM
got much slower due to these changes:
Finch
Took 80.18670105934143 s.
Numba
Took 19.97915291786194 s.
SciPy
Took 18.3186297416687 s.
Here's the new execution plan:
For comparison, here's the previous execution plan:
@willow-ahrens OTOH I see that with these changes MTTKRP and elementwise *
got faster. Can we take a look at SDDMM regression? Maybe there's just different way to express this operation? (use different orders on input arrays to avoid swizzling)
@willow-ahrens Sorry for noise. For SDDMM I think I managed to get back to original performance on the Python side. By changing the format of the sparse array from COO
to CSR
I get again:
Finch
Took 4.882369756698608 s.
Numba
Took 22.508983850479126 s.
SciPy
Took 20.508883953094482 s.
oh, I think I see what happened. The COO format is very slow, and so permuting that input sped up the computation because it also reformatted
I also have a suspicion that PlusOneVector
is slow. I run SDDMM with csr
sparse format. And for explicit copy I get 5s execution:
And for the no-copy constructor (that differs only in PlusOneVector
, used diffchecker.com) it's 100s:
Ok, I think that the original performance regression that I described here is caused by PlusOneVector
issue. With copy, COO is also faster.
Therefore, I think that PlusOneVector
performance regression is introduced here in wma/heuristic
branch. In main
branch, SDDMM
it runs 10s, but here it's more than x10 slower.
I think that when we reformat, the PlusOneVectors are dropped, so I think that they are always slow, but this PR makes it more obvious because it attempts to avoid reformatting/transposition.
fixes https://github.com/willow-ahrens/Finch.jl/issues/468