martijnende commented 8 months ago

Outline

This PR adds a framework for composing complex data processing pipelines by chaining elementary operations. The motivation for introducing composability, is that experienced DAS analysts have already developed their preferred data analysis workflows, and are not likely to adopt new end-to-end workflows over which they have no control. So, instead of providing complex operations with no room for customisation, xdas.Sequence offers a framework for chaining together basic operations (xdas.Atoms) in a user-specified order and with dedicated function arguments. This allows for enhanced optimisation at the level of individual atoms, as well as at the level of the entire pipeline, while the users retain the same flexibility as when creating the pipeline themselves.

The new Sequence objects aims at replacing the old ProcessingChain one.

Usage

In:

import xdas.signal as xp
from xdas import Atom, Sequence

sequence = Sequence(
    [
        Atom(xp.taper, dim="time"),
        Atom(xp.taper, dim="distance"),
    ]
)
print(sequence)
sequence(db)

Out:

Sequence:
  0: taper(..., dim=time)
  1: taper(..., dim=distance)

TODO

[x] Complete implementation of stateful operations (compose.StateAtom)
[x] Fully integrate compose.Sequence.execute with processing.ProcessingChain.process, including chunked processing and stateful operations
[x] Test sequences of numpy and user-defined operations
[x] Test sequences of xdas.signal built-in operations
[ ] Expand documentation and add examples
[ ] Create recipes (FK-analysis, STA/LTA)

martijnende commented 8 months ago

Why not sub-classing `Sequence` from `list` instead of `dict`?

The main difference would be the loss of descriptive keywords, and perform operations by selecting keywords. Since these keywords do not necessarily depend on the position of a given atom in the sequence, you could define modifications of a sequence in a more reusable way. Using indices rather than keywords makes the bookkeeping a lot simpler (no need for unique naming and duplicate checking). So we could reconsider this trade-off between "selectability" and code complexity.

The Keras sequences are not meant to be stored as recipes, which was one of the initial motivations for the xdas composability (from xdas.recipes import fk giving you a predefined sequence). Users might want to modify a pre-defined recipe to suit their needs, which is where the sequence manipulations come in. Defining an order only at declaration time prevents user modifications.

Would it make sense that `Atom` and `StateAtom` subclass `partial`?

What would we gain from subclassing partial?

Would it make sense to have nested Sequences?

Maybe this would make sense for output handling: one Sequence generates one output, so if you want intermediate outputs you'd need to define multiple sequences, each of which are placed in a higher-level sequence. If not for the output, it would make no difference if sequences are nested or concatenated.