xCDAT / xcdat

An extension of xarray for climate data analysis on structured grids.
https://xcdat.readthedocs.io/en/latest/
Apache License 2.0
113 stars 12 forks source link

[Enhancement]: Consider how to test for correct weights generation in xCDAT #699

Open tomvothecoder opened 5 hours ago

tomvothecoder commented 5 hours ago

Is your feature request related to a problem?

In PR #689, in the TemporalAccessor._get_weights() method, I removed validation that checks the sums of weights for each time group adds up to 1 (related code).

This was done because:

  1. Assertion code should only be used in testing and debugging (link) -- TLDR: They can be turned off by the user (-O flag), not good practice. I copied this validation code from an Xarray notebook (link) without realizing the implications.
  2. I added unit tests to cover the implementation logic using the same validation code (here)
  3. np.testing.assert_allclose() degrades performance -- I expect the performance degradation to scale up in relation to the size to the data, maybe it loads data into memory with .values?

Describe the solution you'd like

What is an assertion and when to use it

assert is a common idiom in many languages. The purpose is to check conditions that are expected to always be true. A failed assertion indicates a coding bug. Therefore, the method returns nothing normally, and throws an exception if the condition fails. ... Assertions are supposed to pass 100% of the time. An exception is only thrown when the condition unexpectedly fails.

-- https://stackoverflow.com/questions/42837054/are-assert-methods-acceptable-in-production

Option 1: Don't keep this assertion

I think the sum of weights for each group should always be 1.0 (100%) based on our implementation logic. Otherwise, the assertion would indicate a coding error on our behalf rather than bad data (although we can raise an exception if it is indeed actually due to bad time bounds). As far as I can tell, our implementation logic is right and nobody has ran into _get_weights() throwing an AssertionError from the validation code.

The _get_weights() method works by:

  1. Get time lengths by taking the difference between bounds (time_lengths)
  2. Group time lengths (grouped_time_lengths)
  3. Divide grouped time lengths by the sum of each group (grouped_time_length / grouped_time_lengths.sum())

https://github.com/xCDAT/xcdat/blob/e9e73dd98e20b77e690fd3b4cf6df947a9f58e60/xcdat/temporal.py#L1213-L1262

Option 2: Keep this assertion

Maybe the time bounds might not be correct for some reason and it produces incorrect weight for certain groups (e.g., missing data)? Not sure here. I will try experimenting with bad time bounds.

If we want to keep this assertion:

  1. Figure out how to safely use np.testing.assert_allclose() in production since it raises an AssertionError that can be turned off by the user
  2. Find a way to optimize performance for np.testing.assert_allclose()
  3. Raise a RuntimeError if numpy assertion error is raised

Describe alternatives you've considered

No response

Additional context

After this ticket is addressed, we can proceed with releasing v0.7.2.

tomvothecoder commented 5 hours ago

@xCDAT/core-developers Any input would be appreciated here. Thanks.

pochedls commented 5 hours ago

I think you can go with option 1 (drop the assertion). Even though I raised concern about removing this, when I review this code carefully, I don't think it is needed. Do you remember why it was put there in the first place?

I don't think I helped write this section of code, but I sometimes add these kind of tests to make sure NaNs aren't messing things up. I don't think that should be an issue in this instance.

tomvothecoder commented 4 hours ago

Do you remember why it was put there in the first place?

I based my initial implementation of _get_weights() on the Xarray notebook here, which includes that validation code. It was good to have while implementing the logic in xCDAT, but I don't think it is necessary in production.