Outline - Githubissues

wkdavis commented 2 years ago

Introduction
- Business problem: Forecast bike demand. Up to t+28. 28 hours ahead would allow us to generate a forecast at 8pm for the entire next day. 8pm represents the last period of sustained demand in a day before declining until midnight. Thus, the managers of the bikeshare could receive the forecast and begin repositioning bikes after 8pm in order to prepare for the next day's demand without impacting the current day's peak demand. This is based on the "Average Bike Demand by Hour of Day" plot.
- Review of current literature:
  - https://www.tandfonline.com/doi/pdf/10.1080/22797254.2020.1725789
  - https://www.sciencedirect.com/science/article/abs/pii/S0140366419318997
  - https://link.springer.com/content/pdf/10.1007/978-3-030-94751-4_25.pdf
  - http://homepages.warwick.ac.uk/staff/D.Barkley/Teaching/MA124/Machine_Learning.html
  - None of the papers seem to discuss assessing model assumptions, especially for linear regression. Is this because they are only interested in prediction and not inference? They don't even check for heteroskedasticity of the residuals.
- Basic dataset description: names of variables and variable definitions, number of observations, time period covered.
Exploratory Data Analysis
- Basic line plot(s) by variable
- QQ plot, histogram, GOF distribution with MLEs for Bike Demand
- Bike usage by...
  - hour of day, hour of day by season
  - box plot, line plot
- Corrplots, pairplots, feature plots
- Missing data (non-functional days)
- Time-series analysis: Autocorrelation, stationarity, seasonality, etc
Feature Engineering
- Missing data interpolation
- PCA, correlation filtering, etc
Modeling
- Time-series methods
  - time-series regression
  - regression with ARIMA errors
  - Facebook Prophet
  - NNETAR
  - fasster
- Machine Learning
  - Boosted Tree
  - LSTM
  - RNN
Evaluation
- time-series cross-validation.
  - Per Taylor & Letham, want to have a large step to avoid correlation in forecast errors.
  - We should use the t+28 horizon mentioned above, calculated every day at 8pm for at least 50+ days. Show accuracy by horizon.
  - Mention limitation - accuracy calculation assumes that weather values for next day are known. This is unlikely. Therefore, error will likely be higher due to error in weather forecasts that are inputs to the modle
- Compare performance to baseline models - (s)naive, RW, etc.
- Compare and contrast with accuracy achieved in paper
Conclusion
- Recommendations to city of Seoul
- Recommendations for additional research

maxwkut commented 2 years ago

Literature: Not sure why they didn’t check assumptions, but when I was reading through some of the papers I got the sense that they weren’t very high quality.

Feature Engineering: Deal with outliers such as Humidity = 0 values that aren’t realistic

Modeling: I’d like to get some more experience with deep learning methods and given the nature of the dataset I agree that some kind of RNN would be a good choice.

Evaluation: Just to make sure I understand the time series cv that you propose, does this look correct?

Pick an initial training set length, say day 1 to day 30 8pm (30 x 24 – 4 obs)
Pick a step size, say 5 days (5x24 obs)
Make predictions for the 28 hours following day 35 8pm --- record error metric
Add observations until day 31 8pm to train set (24 obs)
Make predictions for the 28 hours following day 36 8pm --- record error metric
Add observations until day 32 8pm to train set (24 obs)
etc

Thanks!

wkdavis commented 2 years ago

Literature: agreed. I think we can spend some time in the introduction poking holes in the existing literature. Then in the modeling piece we can talk about how/why our chosen methods are better.

Modeling: Awesome! Feel free to take the ML piece and run with it.

Evaluation: that's correct. I think in our case we can use a step-size of 24 observations (hours), so basically everyday at 8pm forecast the next 27 hours. Once we have some modeling ideas/code I'll generate the training and test sets for us and we'll run it just like this example: https://otexts.com/fpp3/tscv.html

PeiYinY commented 2 years ago

Literature: We should access the model assumptions for sure.

Feature Engineering: Miss data Imputation: https://cran.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf Humidity = 0 is not realistic for sure. However, I was searching for similar projects that people have done. I found some people use data from Washington D.C. and the minimum value of humidity is also 0. We can talk more during the meeting.

Modeling: I'd like to add the SARIMA model to time series methods, which is adding seasonal components to ARIMA (Autoregressive Integrated Moving Average) model. I can do ML - LSTM and other algorithms, too.

wkdavis commented 2 years ago

I like the imputeTS approach!

The R implementation of ARIMA in the fable package automatically checks for seasonality/seasonal ARIMA.

I will work on some code to create the CV folds for model training.

wkdavis commented 2 years ago

@PeiYinY Boruta feature selection: https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a

maxwkut commented 2 years ago

Modeling:

Added XGBoost
I use Caret's train function for most ML algorithms and it looks like it has an option to do time series cv ... I think I was able to recreate the same folds you created in your helper script if you want to take a look at that. For reference: https://stackoverflow.com/questions/24758218/time-series-data-splitting-and-model-evaluation

Imputation:

Did some preliminary imputation for humidity values using KNN impute
After going through my 656 notes again I don't think it makes sense to impute the values of the supervisor (BikeCount)

Other:

Not sure how to work with the bikets dataset because I wasn't able to select any features without also selecting the time, so I used the regular bike dataset where I extracted some additional date features from bikets.

wkdavis / STAT685

Outline #5