Problem: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard
參考 https://www.kaggle.com/apapiu/regularized-linear-models,現在算是模仿階段,所以盡量參考別人做法,接下來才有機會談創新,巨人肩膀好用。
Hyperparameters 改良為: preds = 0.85 * lasso_preds + 0.15 * xgb_preds,分數有進步 (0.12086 -> 0.12049),但名次還是很差。
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
改善計畫:
# 移除 outliers: (仔細看題目給的文件)
There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students.
不過文件也說:
A second issue closely related to the intended use of the model, is the handling of outliers and unusual observations. In general, I instruct my students to never throw away data points simply because they do not match a priori expectations (or other data points). I strongly make this point in the situation where data are being analyzed for research purposes that will be shared with a larger audience. Alternatively, if the purpose is to once again create a common use model to estimate a “typical” sale, it is in the modeler’s best interest to remove any observations that do not seem typical (such as foreclosures or family sales).
有點兩難,https://www.kaggle.com/humananalog/xgboost-lasso/code 做法是把所有 GrLivArea > 4000 的濾掉,然後調整 y_pred = 0.4 * y_pred_xgb + 0.6 * y_pred_lasso,分數大有進步,下面仔細研究怎麼做 feature engineering。
# Feature Engineering
然後在另外一台電腦重跑程式,分數會變爛(0.11481)。
# Imputation
# Stacked Regression
# Feature Engineering
然後在另外一台電腦重跑程式,分數會變爛(0.11481)。
# Imputation
# Stacked Regression
沒有留言:
張貼留言