Sparse weights in conservative method

slevang commented 2 months ago

Potential improvement for #42.

The focus on rectilinear grids for this package, and factorization of regridding along dimensions, makes generating and using dense weights feasible. However, the level of sparsity in the weights matrix is still extremely high for any reasonable size grid. I did some experiments converting the weights to a sparse matrix after creation, and am seeing nice improvements both in compute time and memory footprint.

On the example in https://github.com/xarray-contrib/xarray-regrid/issues/42#issuecomment-2363771715 I get close to a 4x speedup (and better than xesmf):

CPU times: user 42.5 s, sys: 6.01 s, total: 48.5 s
Wall time: 11.6 s
CPU times: user 6min 9s, sys: 41.6 s, total: 6min 51s
Wall time: 59.2 s

BSchilperoort commented 2 months ago

get close to a 4x speedup (and better than xesmf):

Awesome! I did not bother with this originally for the reasons you mentioned. But it's great to see that it's a (relatively easy) way to gain a lot of performance.

Edit: I don't see any significant performance gain compared to the benchmark I ran from #42...

slevang commented 2 months ago

That was a VM with pretty old CPU architecture and I guess just really slow on these particular calculations. I'm also seeing much faster results on my 8 core M1 Mac, about 12s for the skipna=False on current main. With the sparse weights it drops to 5s though. I'm seeing even bigger improvement with skipna=True I think because the sparsity limits the size of the weight array as we track NaNs over each dim.

Mixed results switching between threaded and distributed schedulers, sometimes a bit faster sometimes slower.

slevang commented 2 months ago

I ran the benchmarking test in #42 across several configurations on my 8 core i7 linux desktop. Runtimes to the nearest second:

`chunks={"time": 1}`, ~4MB	skipna=False	threads
sparse	8	14
dense	28	17
xesmf	30	37

skipna=True	threads	distributed
sparse	67	82
dense	327	335
xesmf	55	71

`chunks={"time": 10}`, ~40MB	skipna=False	threads
sparse	6	7
dense	13	12
xesmf	7	6

skipna=True	threads	distributed
sparse	59	72
dense	OOM	OOM
xesmf	10	12

Lots of interesting variation. My takeaways:

sparse weights seem to uniformly benefit run time
sparse weights make the NaN tracking scheme over dimensions feasible, otherwise for larger chunks the matrix size blows up
the only case where the distributed scheduler won was with dense weights and small chunks. Keep in mind this is just the defaults so 2 workers and 4 threads on my machine.
with sparse weights, we're on par or better with xesmf, except skipna=True, where xesmf's scheme of simultaneously computing the NaN fraction over all dimensions is much more efficient. This gets washed out with small chunk sizes but is more apparent for larger ones.

xarray-contrib / xarray-regrid

Sparse weights in conservative method #49