wrzdylan commented 2 years ago

Dans XGBoost, on peut tester l'importance de chaque features, voir https://medium.com/@mahithas/example-for-xgboost-2f3abceeeaf7

wrzdylan commented 2 years ago

Pour les outliers regarder le quantile des données et faire un value_counts()

wrzdylan commented 2 years ago

def min_max(dataframe, col, lower, upper): """drops outliers that may contain misinformation"""

define the lower boundry

lower_bound = dataframe[dataframe[col] >= lower]
# define the upper boundry
upper_bound = lower_bound[lower_bound[col] <= upper]
# return the results
return upper_bound[col]

set the lower boundry to one bathroom and the upper boundry to 6.

housing_df['baths'] = min_max(housing_df, 'baths', lower=1, upper=6)

wrzdylan commented 2 years ago

For regression problems, the go-to evaluation metric is either Root Mean Square Error(RMSE) or Root Mean Square Logistic Error (RMLSE).

RMSE: It is a measure of the squared difference between the prediction from our model and the actual value.

Root Mean Square Error formula where y’: predicted value, y: actual value

RMSLE: It is a measure of the squared difference between the log of the prediction from our model and the log of the actual value.

Root Mean Square Logistic Error where y’: predicted value, y: actual value

The RMSLE might be a better evaluation metric as (1) it is robust enough to deal with outliers, which we saw is present in our dataset (2) RMSLE incurs a larger penalty for underestimation of the actual value. If we put ourselves in the sellers perspective, we do not want to underestimate the price as it would result in losses. However, for this project we shall not take anyone’s side and shall choose the RMSE as the evaluation metric as we would be using the Random Forest model which is immune to outliers.

wrzdylan commented 2 years ago

https://medium.com/@eswansonwebdev/predicting-housing-prices-from-the-kaggle-ames-iowa-data-set-for-general-assembly-dsi-fall-2021-529dd920fe44 Pour l'écrit

wrzdylan commented 2 years ago

https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/

Pourquoi calculer skew est important (Linear model + outliers)

wrzdylan / HousingPrice

XGBClassifier #1

define the lower boundry

set the lower boundry to one bathroom and the upper boundry to 6.