Closed vdevauxchupin closed 2 months ago
Thanks for reporting this and providing such detailed explanations!
TL;DR: Try setting rtol
to a lower value, such as 1e-5
, in the EOFRotator
class.
It seems your intuition about the dataset shapes is correct. The EOF
class has a computational complexity of approximately O(n_samples × n_features × log(n_modes))
. While I’m not sure of the exact complexity for the EOFRotator
class, two key points are relevant here:
In your case, the larger dataset has about 500,000 features, while the smaller one has around 4.5 million. So, although the EOF model might fit faster on the smaller dataset, the rotation step will likely take longer (assuming you're rotating the same number of modes for both datasets).
I did a quick profiling of the _promax
function, which handles the core computation in the EOFRotator
, and it appears to follow a power law (linear on a log-log plot) with a power coefficient $k=0.92$, implying $f(t) \propto t^{0.92}$. **
If this relationship holds for larger datasets, a dataset of your size should take about 10-15 minutes on my (standard) laptop. I also set the relative tolerance (rtol
) for the rotation process to 1e-5
, instead of the default 1e-8
, which might be too strict and unnecessarily increase computation time for large datasets like yours.
Let me know if that helps!
** Actually looks like quite a linear relationship ...
Thank you so much ! It works much faster with a lower rtol
argument.
While I have your attention, I have a few other questions:
Thank you !
Happy that it worked! I really like both ideas and would love to see them come to life someday. Doing something similar to the ROCK-PCA paper sounds great, but I won’t have the time to tackle it on my own anytime soon. My experience with various methods is pretty uneven -- some I use regularly, some I’ve only implemented, and others, like DMD, I’ve just heard of but would love to see integrated down the road. That said, this could be a cool project to collaborate on!
Thanks for your answer ! If I had more DMD experience I would gladly help but I'm still learning how to use the method. Maybe I will feel confident enough at some point to help with this !
GOAL I want to apply the REOF method on a dataarray- stack of SAR images. I have already and successfully done it on another dataarray- stack of surface velocity images.
Reproducible Minimal Working Example
PROBLEM The REOF for the Velocity Stack takes 219.71 seconds to compute. The REOF for the SAR Stack still runs after 30 minutes. The REOF with the toy example (above) is also still running after 30 minutes.
Why is there such a difference in computation time between the two datasets, with the the smallest dataset taking much longer ? Is it related to the shape of the dataset ?
Declaration
Desktop:
xeofs
version 2.4.0Additional context The 'Velocity' stack has a few NaNs, while the 'SAR' has none. They are both 3D numpy arrays of float32. 'SAR' stack has: 47163090 non-NAN values 'Velocity' stack has: 331342488 non-NAN values (7 times more) Here is a screenshot of both: SAR Stack:
Velocity Stack:
PS Thank you for your work on this package, it is one of the most helpful I've had to use !