wildlife-dynamics / ecoscope-workflows

An extensible task specification and compiler for local and distributed workflows.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Task: Transformations - Assign from column attribute #51

Open cisaacstern opened 5 days ago

cisaacstern commented 5 days ago

Towards #39

cisaacstern commented 5 days ago

This PR follows the suggestions of ChatGPT (what a world we live in) which I thought were quite reasonable, as pasted in https://github.com/wildlife-dynamics/ecoscope-workflows/issues/39#issuecomment-2189954068, to use pd.DataFrame.assign in a limited way for addition of new columns based on the attributes of other columns in the dataframe.

Opening this now because I believe I will need it to address #28 / #45 ... to explain why: in #45, I want to group a dataset by month that does not already have a "month" column. AFAICT, the existing transformations we have do not support adding the "month" column based on an existing column of pd.Timestamp values. This small PR adds that ability.

walljcg commented 5 days ago

Thanks Charles!

Have you seen the temporal and period indexers we have here?

My assumption has been that we use these functions to create the indexes and then use DataFrame.GroupBy([groupers]) to create a Grouped object which is then passed into the split-apply_combine?

cisaacstern commented 4 days ago

A-ha, I had not been aware of those utils.

So using the linked indexers in ecoscope core, how would I group this dataframe by month?

import pandas as pd

df = pd.DataFrame(
    {
        "recorded_at": [
            pd.Timestamp("2021-01"),
            pd.Timestamp("2021-01"),
            pd.Timestamp("2021-02"),
            pd.Timestamp("2021-02"),
            pd.Timestamp("2021-03"),
        ],
        "value": [5, 6, 7, 8, 9],
    }
)

with this PR, it would be:

from ecoscope_workflows.tasks.transformation import assign_from_column_attribute

df_new = assign_from_column_attribute(
    df, column_name="month", dotted_attribute_name="recorded_at.dt.month"
)
df_new.groupby("month")