Open wrzdylan opened 2 years ago
Pour les outliers regarder le quantile des données et faire un value_counts()
def min_max(dataframe, col, lower, upper): """drops outliers that may contain misinformation"""
lower_bound = dataframe[dataframe[col] >= lower]
# define the upper boundry
upper_bound = lower_bound[lower_bound[col] <= upper]
# return the results
return upper_bound[col]
housing_df['baths'] = min_max(housing_df, 'baths', lower=1, upper=6)
For regression problems, the go-to evaluation metric is either Root Mean Square Error(RMSE) or Root Mean Square Logistic Error (RMLSE).
RMSE: It is a measure of the squared difference between the prediction from our model and the actual value.
Root Mean Square Error formula where y’: predicted value, y: actual value
RMSLE: It is a measure of the squared difference between the log of the prediction from our model and the log of the actual value.
Root Mean Square Logistic Error where y’: predicted value, y: actual value
The RMSLE might be a better evaluation metric as (1) it is robust enough to deal with outliers, which we saw is present in our dataset (2) RMSLE incurs a larger penalty for underestimation of the actual value. If we put ourselves in the sellers perspective, we do not want to underestimate the price as it would result in losses. However, for this project we shall not take anyone’s side and shall choose the RMSE as the evaluation metric as we would be using the Random Forest model which is immune to outliers.
https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/
Pourquoi calculer skew est important (Linear model + outliers)
Dans XGBoost, on peut tester l'importance de chaque features, voir https://medium.com/@mahithas/example-for-xgboost-2f3abceeeaf7