Use only one variable as the target but supply many features to models.

rcyost commented 9 months ago

I'm only interested in forecasting one feature. I'd like to speed up the computation time but not waiting on other series to be forecasted. How do users only forecast one series? Do I submit a df with only the series I want to forecast and then submit every other feature as external data? Thanks in advance, this is the best auto time series package I've seen. Super impressive!

winedarksea commented 9 months ago

As always, compliments are appreciated, thanks!

What I would do, and this only works on the wide style data, is this:

regr_train, regr_fcst = create_regressor(
    df,
    forecast_length=forecast_length,
    frequency=frequency,
    drop_most_recent=drop_most_recent,
    scale=True,
    summarize="auto",
    backfill="bfill",
    fill_na="spline",
    holiday_countries={"US": None},  # requires holidays package
    encode_holiday_type=True,
    # datepart_method="simple_2",
)

# remove the first forecast_length rows (because those are lost in regressor)
df = df.iloc[forecast_length:]
regr_train = regr_train.iloc[forecast_length:]

# class def here somewhere, then

model = model.fit(
    df["the_series_I_want"].to_frame(),
    future_regressor=regr_train,
)

# and same to .predict with regr_fcst

Note that it has summarize='auto' which you might want to change if you want to not summarize the other features, pass None.

rcyost commented 8 months ago

Describe the bug

Oddly enough after getting to model 10/10, the code fails at the last minute!

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

To Reproduce

Steps to reproduce the behavior:

raw_data= pd.read_csv('https://files.stlouisfed.org/files/htdocs/fred-md/monthly/current.csv')
raw_data= raw_data.drop(0, axis=0)
raw_data = raw_data[raw_data['sasdate'].notna()]
data= raw_data.dropna(axis=1)
raw_data= raw_data.set_index('sasdate')

raw_data.index= pd.to_datetime(raw_data.index)
df= raw_data

df.tail()

#%%
# https://github.com/winedarksea/AutoTS/issues/201

forecast_length= 6
frequency= 'infer'
drop_most_recent= True
target_series= 'CPIAUCSL'

regr_train, regr_fcst = create_regressor(
    df,
    forecast_length=forecast_length,
    frequency=frequency,
    drop_most_recent=drop_most_recent,
    scale=True,
    summarize="auto",
    backfill="bfill",
    fill_na="spline",
    holiday_countries={"US": None},  # requires holidays package
    encode_holiday_type=True,
    # datepart_method="simple_2",
)

# remove the first forecast_length rows (because those are lost in regressor)
df = df.iloc[forecast_length:]
regr_train = regr_train.iloc[forecast_length:]

#%%

model = AutoTS(
    forecast_length= forecast_length,
    frequency= frequency,

    #0-1, uncertainty range for upper and lower forecasts. Adjust range, but rarely matches actual containment.
    prediction_interval=0.9,

    # 'auto', 'simple', 'distance', 'horizontal', 'horizontal-min', 'horizontal-max', "mosaic", "subsample"
    ensemble='auto',
    # autots -> models -> model_lists.py
    model_list="default",  # "superfast", "default", "fast_parallel"
    transformer_list="superfast",  # "superfast",

    # number of rows to drop
    drop_most_recent=drop_most_recent,

    # Each generation tries new models, taking additional time but improving the accuracy.
    # The nature of genetic algorithms, however, means there is no consistent improvement for each generation,
    # and large number of generations will often only result in minimal performance gains.
    max_generations=10,

    # num_validations is the number of cross validations to be done in addition.
    # In general, the safest approach is to have as many validations as possible,
    # as long as there is sufficient data for it.
    num_validations=5,

    # Backwards cross validation is the safest method and
    # works backwards from the most recent data.
    # First the most recent forecast_length samples are taken,
    # then the next most recent forecast_length samples, and so on.
    # This makes it more ideal for smaller or fast-changing datasets.
    validation_method="backwards",

    current_model_file= 'autots_run2'
)
# class def here somewhere, then

# and same to .predict with regr_fcst
long = False

model = model.fit(
    df[target_series].to_frame(),
    future_regressor=regr_train,
)

prediction = model.predict()
# plot a sample
prediction.plot(model.df_wide_numeric,
                series=model.df_wide_numeric.columns[0],
                start_date="2019-01-01")
# Print the details of the best model
print(model)

# point forecasts dataframe
forecasts_df = prediction.forecast
# upper and lower forecasts
forecasts_up, forecasts_low = prediction.upper_forecast, prediction.lower_forecast

# accuracy of all tried model results
model_results = model.results()
# and aggregated from cross validation
validation_results = model.results("validation")

Expected behavior A clear and concise description of what you expected to happen.

Screenshots I see no NaNs in the data I use as well.

Desktop (please complete the following information):

OS: [e.g. iOS]
Package Versions [e.g. 22]

Additional context Could you please help me understand why it fails at the end? Thanks!

winedarksea commented 8 months ago

Thanks for the detailed bug details.

Best guess as to what is going wrong is that the chosen best_model is able to work on cross validation but not on the actual prediction on the full data. This would be .fit() executing completely but then .predict failing.

The NaN values are probably being introduced by some of the preprocessing transformers.

https://fred.stlouisfed.org/series/CPIAUCSL looks like a pretty clean series so it's unlikely the issue is with the data.

Could you print these out

model.best_model_name
model.best_model_params
model.best_model_transformation_params

or copy here the 'autots_run2.json' file which it looks like you are saving.

If you want a short term fix, try changing the random seed or the model_list, it's likely this bug can be avoided that way, but it would still be useful if you can send me more details.

rcyost commented 8 months ago

Can you try running the code to reproduce the error? In the meanwhile I'll try and you suggest and return with what I find. Thanks!

rcyost commented 8 months ago

drop_most_recent= True maybe its this ? curious why it didn't throw an error because there is type hinting

winedarksea commented 8 months ago

I haven't had time to fully look at it yet, but that is an interesting possibility. I'll add better checking on that variable to raise an error for bad input (type hinting is just hinting, it won't raise errors). But True evaluates as 1 (test this int(True)) and appears to work just fine as an input, equal to inputting 1

winedarksea commented 8 months ago

view the processed data at model.df_wide_numeric and you'll see it is dropping 1 row of data and nothing more

winedarksea commented 8 months ago

your code ran fine for me. I think the error was the particular model chosen

rcyost commented 8 months ago

Thanks, I re-ran and it worked fine. Interesting !

winedarksea / AutoTS

Use only one variable as the target but supply many features to models. #201