2017年9月24日 星期日

[Python] mlxtend


mlxtend: http://rasbt.github.io/mlxtend/

StackingCVRegressor: http://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor/



Example (Improvement of https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard):


1. Add new feature: Age of the house
  • all_data['HouseAge'] = 2012 - all_data['YearBuilt']
  • all_data['HouseAge'] = 2011 - all_data['YearBuilt']

According to the age of this paper: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf, we choose 2011 (or 2012) as the base year.


2. Replace the implementation of StackingAveragedModels
  • from mlxtend.regressor import StackingCVRegressor
  • stacking_regressor = StackingCVRegressor(regressors=(ENet, GBoost, KRR), meta_regressor=lasso)

Results: 0.11300



2017年9月6日 星期三

[Python] Install lightgbm in MacOS

If you cannot pip install lightgbm, try this:
https://github.com/Microsoft/LightGBM/wiki/Installation-Guide

If you cannot cmake .. to build lightgbm, try this: brew install gcc@7


[Python] Install xgboost in MacOS

By executing pip install xgboost in MacOS, we may get the following error message:
List of candidates:
/private/var/folders/.../xgboost/libxgboost.so /private/var/folders/.../xgboost/../../lib/libxgboost.so /private/var/folders/.../xgboost/./lib/libxgboost.so
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/.../

2017年9月3日 星期日

[Kaggle] House Prices: Advanced Regression Techniques


Problem: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard

參考 https://www.kaggle.com/apapiu/regularized-linear-models,現在算是模仿階段,所以盡量參考別人做法,接下來才有機會談創新,巨人肩膀好用。

Hyperparameters 改良為: preds = 0.85 * lasso_preds + 0.15 * xgb_preds,分數有進步 (0.12086 -> 0.12049),但名次還是很差。




https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard


改善計畫:

# 移除 outliers: (仔細看題目給的文件)

There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students.

不過文件也說:

A second issue closely related to the intended use of the model, is the handling of outliers and unusual observations. In general, I instruct my students to never throw away data points simply because they do not match a priori expectations (or other data points). I strongly make this point in the situation where data are being analyzed for research purposes that will be shared with a larger audience. Alternatively, if the purpose is to once again create a common use model to estimate a “typical” sale, it is in the modeler’s best interest to remove any observations that do not seem typical (such as foreclosures or family sales).

有點兩難,https://www.kaggle.com/humananalog/xgboost-lasso/code 做法是把所有 GrLivArea > 4000 的濾掉,然後調整 y_pred = 0.4 * y_pred_xgb + 0.6 * y_pred_lasso,分數大有進步,下面仔細研究怎麼做 feature engineering。



# Feature Engineering


然後在另外一台電腦重跑程式,分數會變爛(0.11481)。

# Imputation

# Stacked Regression