wkdavis / STAT685

TAMU STAT 685 Summer 2022 Project
2 stars 1 forks source link

Outline #5

Open wkdavis opened 2 years ago

wkdavis commented 2 years ago
  1. Introduction
  2. Exploratory Data Analysis
    • Basic line plot(s) by variable
    • QQ plot, histogram, GOF distribution with MLEs for Bike Demand
    • Bike usage by...
      • hour of day, hour of day by season
      • box plot, line plot
    • Corrplots, pairplots, feature plots
    • Missing data (non-functional days)
    • Time-series analysis: Autocorrelation, stationarity, seasonality, etc
  3. Feature Engineering
    • Missing data interpolation
    • PCA, correlation filtering, etc
  4. Modeling
  5. Evaluation
    • time-series cross-validation.
      • Per Taylor & Letham, want to have a large step to avoid correlation in forecast errors.
      • We should use the t+28 horizon mentioned above, calculated every day at 8pm for at least 50+ days. Show accuracy by horizon.
      • Mention limitation - accuracy calculation assumes that weather values for next day are known. This is unlikely. Therefore, error will likely be higher due to error in weather forecasts that are inputs to the modle
    • Compare performance to baseline models - (s)naive, RW, etc.
    • Compare and contrast with accuracy achieved in paper
  6. Conclusion
    • Recommendations to city of Seoul
    • Recommendations for additional research
maxwkut commented 2 years ago

Literature: Not sure why they didn’t check assumptions, but when I was reading through some of the papers I got the sense that they weren’t very high quality.

Feature Engineering: Deal with outliers such as Humidity = 0 values that aren’t realistic

Modeling: I’d like to get some more experience with deep learning methods and given the nature of the dataset I agree that some kind of RNN would be a good choice.

Evaluation: Just to make sure I understand the time series cv that you propose, does this look correct?

  1. Pick an initial training set length, say day 1 to day 30 8pm (30 x 24 – 4 obs)
  2. Pick a step size, say 5 days (5x24 obs)
  3. Make predictions for the 28 hours following day 35 8pm --- record error metric
  4. Add observations until day 31 8pm to train set (24 obs)
  5. Make predictions for the 28 hours following day 36 8pm --- record error metric
  6. Add observations until day 32 8pm to train set (24 obs)
  7. etc

Thanks!

wkdavis commented 2 years ago

Literature: agreed. I think we can spend some time in the introduction poking holes in the existing literature. Then in the modeling piece we can talk about how/why our chosen methods are better.

Modeling: Awesome! Feel free to take the ML piece and run with it.

Evaluation: that's correct. I think in our case we can use a step-size of 24 observations (hours), so basically everyday at 8pm forecast the next 27 hours. Once we have some modeling ideas/code I'll generate the training and test sets for us and we'll run it just like this example: https://otexts.com/fpp3/tscv.html

PeiYinY commented 2 years ago

Literature: We should access the model assumptions for sure.

Feature Engineering: Miss data Imputation: https://cran.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf Humidity = 0 is not realistic for sure. However, I was searching for similar projects that people have done. I found some people use data from Washington D.C. and the minimum value of humidity is also 0. We can talk more during the meeting.

Modeling: I'd like to add the SARIMA model to time series methods, which is adding seasonal components to ARIMA (Autoregressive Integrated Moving Average) model. I can do ML - LSTM and other algorithms, too.

wkdavis commented 2 years ago

I like the imputeTS approach!

The R implementation of ARIMA in the fable package automatically checks for seasonality/seasonal ARIMA.

I will work on some code to create the CV folds for model training.

wkdavis commented 2 years ago

@PeiYinY Boruta feature selection: https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a

maxwkut commented 2 years ago

Modeling:

Imputation:

Other: