xKDR / TSFrames.jl

Timeseries in Julia
MIT License
92 stars 22 forks source link

Consider storing index column name rather than fixing to Index #201

Open ancapdev opened 4 months ago

ancapdev commented 4 months ago

Is there appetite to change the API for TSFrame so it stores the name of the index column, preserving the source dataframe, rather than replacing the column with a new named Index even when user specified?

For context, I'm building a time series system with streaming and batch APIs. In my system the user defines schemas for their time series, these schemas include the time field/column, and preserving the names of fields/columns throughout consistently is important for my use case. The current TSFrame API makes that awkward and I don't want to let the TSFrames column name override govern downstream design and naming decisions.

At a more fundamental level what I would expect TSFrame to be is a pure semantic layer that verifies time ordering of rows in dataframes, guaranteeing that invariant to functions operating on time series, without changing the underlying data the way it currently does.

Now that the design is burned in, I appreciate it may not be possible to change it without breaking assumptions in dependent code, but I thought asking is worth it.

chiraganand commented 4 months ago

I do appreciate the design choice of having the user define the date-time (sorting/matching) column but this is one of those assumptions (having Index as the index column) which provides certainty and somewhat easier maintenance of the TSFrames functions.

One can have:

struct TSFrame
  coredata :: DataFrame
  Index :: String
end

The constructors can default to the name Index in absence of a provided index column (the current behaviour).

Having said that, a lot of code will need to change, and, yes, many other assumptions will also need to be thought about again.

Meanwhile, would it to be possible for your package to compose with a TSFrame and an index string in the package struct? Would that solve your immediate problem?

ancapdev commented 3 months ago

Hi, thanks for replying.

In my use case a lot of the end processing happens on the underlying data frame (coredata) directly, so that's the crux of the issue. I need to preserve the column names in these. For now I'm going with plain DataFrame objects, and in the future we'll either develop our own time series wrapper, or see if TSFrames can move towards an API that doesn't touch the underlying data.

chiraganand commented 3 months ago

I understand. As I said, it will be useful to have this flexibility in the package. I will keep this issue open for now, open for someone to pick it up, submit a PR.