Their is No Detail about train Data and Test data and no information about [ forecast_length ] in documentation??

winedarksea / AutoTS

Automated Time Series Forecasting

MIT License

1.12k stars 100 forks source link

Their is No Detail about train Data and Test data and no information about [ forecast_length ] in documentation?? #120

Closed faridelya closed 2 years ago

faridelya commented 2 years ago

Hi Dear. hope you all doing good i want to ask some question about some basic things like how give train data ? how we can give test data ? how we can give target colum ? As i read doc but i never i understand forecast_length what this use for? last question is that as i read in doc catgegorical data must label encode before to feed and is it must to use all int dtype or we can use float dtype columns as well please helpme to answere i would realy appreciate your help. Thanks

winedarksea commented 2 years ago

It seems that you are expecting a general purpose AutoML solution like TPOT or auto-sklearn. AutoTS is only for time series forecasting. It requires a slightly different data input than you seem to be expecting.

You pass only one dataframe, every column of which is a target, to .fit(df=) including all of your historical data. To adjust the test period, which is based on time, you select the validation_method. A custom validation method can be used to specify exact dates of the time period. Then forecast_length is what is often called the 'horizon' or how many periods ahead to forecast into the future. For any data which is not itself a target, those can be handled as parallel series or as regressors, see the extended_tutorial.md for that. There are plenty of examples in the extended_tutorial using sample datasets which you can view for examples. Yes, ideally data is already of datatype float, although categorical data can be encoded or passed in an the built-in ordinal encoding used. Forecasting categorical data is uncommon, as most time series are numeric.

faridelya commented 2 years ago

sorry for interption but i am noob to this domain Time series and also to this AutoTs well i am using time series data [ Dataframe ] and here is the snap: total row around 2lac Screenshot from 2022-05-19 17-56-20 target is dependent variable so how i can put this in the below example: suppose my X = [features] y = df['Target] so i put this according to my data from autots import AutoTS from autots.datasets import load_hourly

df_wide = load_hourly(long=False)

here we care most about traffic volume, all other series assumed to be weight of 1 weights_hourly = {'traffic_volume': 20}

model_list = [ 'LastValueNaive', 'GLS', 'ETS', 'AverageValueNaive', 'TensorflowSTS ', 'TFPRegression',

]

model = AutoTS( forecast_length=49, frequency='infer', prediction_interval=0.95, ensemble=['simple', 'horizontal-min'], max_generations=5, num_validations=2, validation_method='seasonal 168', model_list=model_list, transformer_list='all', models_to_validate=0.2, drop_most_recent=1, n_jobs='auto', )

model = model.fit( df_wide, weights=weights_hourly, )

prediction = model.predict() forecasts_df = prediction.forecast prediction.long_form_results()

if model.best_model_ensemble == 2: model.plot_horizontal()

any solution would be high appreciated Thanks @winedarksea

winedarksea commented 2 years ago

To start with, I would not use the TensorflowSTS and TFPRegression for production. Looks like you are looking for neural nets. 'WindowRegression' (and the other Regressions) will automatically use Tensorflow models if it is installed locally. Also GluonTS, which if you like neural nets is a good library to check out.

for simplicity sake, I would recommend you pass model_list='fast' and not tune any other parameters until you've got it working in the simplest form.

It looks like you have multiple series being input in a long style period.

from autots import AutoTS

model = AutoTS(
    forecast_length=49,
    model_list='fast'
)
model = model.fit(df, date_col='Date', value_col='Target', id_col='SecuritiesCode')
prediction = model.predict()
forecasts_df = prediction.forecast

# Print the best model
print(model)

This isn't using any of the other columns, nor does it need those. However, if are able to get this working, then including the other columns as parallel series or as regressors is possible.

adanebehr commented 2 years ago

First off, thank you so much for this awesome package! Its so cool that you've integrated so many different models into a single place! I've been blabbing on and on to my wife about how cool it is. What was your motivation for developing this? Did it come out of a masters or PHD project?

Do you have any recommendation for making forecasts that are of "un-recommended" lengths? I would like to forecast out by 2-4X my historical data length. I tried decreasing the min_allowed_trainpercent parameter to zero (and tried very small numbers) as recommended by the error "ValueError: forecast_length is too large for training data, alter min_allowed_trainpercent to override". But this did not help it to run.

winedarksea commented 2 years ago

Thanks! This project comes partly out of work on a master's and out of work at two different companies plus a sabbatical. My motivation was simply that no existing forecasting packages worked like I wanted them to work, and so here we are.

It is simply impossible as written to forecast out beyond the length of training data using the AutoTS class. You should be able to use model_forecast for that (untested but should work for most models), but AutoTS object requires at least one train/test sample, and it needs a full length of history for that. Another solution might be to duplicate your data. For example, if you only had data from 2020, you could make two copies of it, and then label them as 2019 and 2018, giving you more history. Not perfect, but would probably work.

adanebehr commented 2 years ago

I'm trying to use your package to forecast companies cashflows for use in present value assessments. The pattern of how companies grow their earnings is super unique to the company/industry, so assuming linear cash flow growth or fitting exponentials universally, for all companies, doesn't do a good job. Each projection a company's earnings needs a tailored model, which makes your tool a perfect solution.

I bet having this project in your toolbox makes you SUPER valuable to employers. Do you know of any similar products that automatically train, calibrate, validate, and test for other applications other than time series forecasting (like classification, for example)?

Duplicating the data sounds likes an easy and viable solution for me. Thanks!

Also, just out of curiosity, what other real world applications have you found this AutoTS package especially useful for?

winedarksea commented 2 years ago

There are countless AutoML solutions that handle more generic classification and regression. Data Robot is a good paid version. All the cloud companies, AWS, Azure, and Google have flavors of it. H2O.ai has a free version. TPOT, auto-sklearn, pycaret, and more are out there in the open source world. But none of them handle time series data particularly well. And some of them are rather limited in their performance. But I don't think there are any well established automl libraries that do a good job of time series classification (like speech recognition).

I actually did some work on internal cashflow forecasting for a previous company but never got very far with the project. For them, the work was more of a goaling exercise ('how much should we spend') and forecasting can't really replace goaling (although it can help guide the goaling decisions).

AutoTS was built around a particular use case: product sales for a distributor that was selling three very different categories of products with very different patterns of sales. Some were a nice sine wave type line of seasonality, others were intermittent, some were linear. Forecasting product sales and inventory level is probably one of the most common uses in business still. A lot of hobbyists seem to like trying to forecast stock prices with it - although it would be very hard to do that well.

I would say in general resource forecasting is the main use: web traffic and server compute demand, call center call volumes, hourly shoppers at a store or attendees at an amusement park. All with the goal of better managing equipment or human employees to handle appropriate demand without overspending.

adanebehr commented 2 years ago

I appreciate you taking the time to answer all my questions!

And just to follow up on the issue of forecasting beyond the historical data length, in case anyone else needs to do that in the future: Copying the data to lengthen the time series did not work because it introduced artificial periodicity/seasonality into the training record. However, I was able to use the AutoTS class to solve for the best model and find its parameters and transformations, and then use model_forecast (as you suggested) to forecast out around 50% of the historical data length. I only forecasted out 50% of the historical data length because I found that there were limitations on forecasting beyond the "data window size" with model_forecast (I guess maybe this depends on whether the model used is autoregressive or not?). Then, I was able to take the results of the model_forecast and concatenate them with the historical data to lengthen the training data. I then used the lengthened timeseries as training input for model_forecast and was able to forecast further into the future. I repeated this process until the training dataset was long enough to forecast to my desired horizon.

My day job is as a water resources/ environmental engineer. I think this package will come in handy for a lot of applications, for me personally and my colleagues, and will be sure to give you credit and spread the word about your project wherever I can.

winedarksea commented 2 years ago

Thanks! I opened #121 which is for making it easier to forecast beyond the length of the training data. No idea when I will get to it, but should happen eventually.