The overall goal of this project is to build a hotel room rate prediction system that helps customers to evaluate the price and determine the best time to book a room for traveling. Several questions that we would like to answer include:
We use Personalize Expedia hotel searches – ICDM 2013 from online Kaggle competition ( > 4GB), which includes a wide variety of data on User, Property, time, competitor, etc. It contained nearly 10 million historical hotel search results representing approximately 400 thousand unique search queries on the popular travel booking website Expedia.com.
First, we compute the skewness for each numeric variable. We defined variables with skewness > 0.75 as "highly skewed", and we would log transformed those variables with high skewness to make them more normaly distributed.
Hotel room rates can have as low as $0.2/night, and as high as $+5m/night; we remove those outliers that are significantly deviated from the rest of hotel room rate distribution
For categorical variables with more than 100 instances, e.g., country_id, destination_id, property_id, etc. it wouldn't make sense to one-hot encoding them all; so what we did is to compute the popularity, i.e., how many times each instance of a category ever appear in the dataset to represent the instance itself. For example, for property_id = 116942, we count how many records with property_id = 116942 are there in the dataset, and use that continuous number to represent property_id = 116942. Same logic and transformation is applied to country_id and destination_id as well as other categorical variables.
Our ultimate goal is to predict hotel room rate for one property listing in one single day. However, from the Expedia dataset, it only lists the data per user search and potential at multiple timestamps within a day, so we would need to aggregate the data by day.
Sort data by time, and split data into training, validation and test set.
In order to understand the importance of each feature, we use XGBoost to get the importance of each feature:
From which we can tell that prop_country_id, prop_log_historical_price and prop_review_score are the top 3 most importance features. This diagram gives us an understanding of what are the important features in terms of building model for the next stage.
We applied a multi-layer modeling approach to resolve the complexity of the problem, by dividing them into several subproblems that are easier to tackle. First, we try to divide features by its nature into several feature groups, including User, Property, Time, and Competitors. We would then build model for each of the feature group (which refered as "first layer modeling"). After modeling selection for each feature group, including hyperparameter tuning and cross validation, we are able to get the best predictions based on each feature group. Then we would concatenate the predictions from each feature group modeling, and use as input to fit a second-layer model.
How to implement such modeling pipeline using Python? After we get prediction from each of the model, first we need to pay attention to its format. If it is formatted as an ndarray, we would need to reshape it into array with shape (-1,1), and stack each formatted prediction vertically by column. Illustration below may further explain the entire process.
python codes snippet: _(complete code please see: ts_modelingv2.py)
regression_y_pred_val = self.regression_y_pred_val.reshape(-1,1)
regression_y_pred_test = self.regression_y_pred_test.reshape(-1,1)
ARIMA_val_predictions = np.array(self.ARIMA_val_predictions).reshape(-1,1)
ARIMA_test_predictions = np.array(self.ARIMA_test_predictions).reshape(-1,1)
X_train = np.concatenate(( regression_y_pred_val, ARIMA_val_predictions), axis=1)
X_test = np.concatenate(( regression_y_pred_test, ARIMA_test_predictions), axis=1)
If you want to know more about this project, please check our poster: