Plover Tsai: 9月 2017

2017年9月24日星期日

[Python] mlxtend

mlxtend: http://rasbt.github.io/mlxtend/

StackingCVRegressor: http://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor/

Example (Improvement of https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard):

1. Add new feature: Age of the house

all_data['HouseAge'] = 2012 - all_data['YearBuilt']
all_data['HouseAge'] = 2011 - all_data['YearBuilt']

According to the age of this paper: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf, we choose 2011 (or 2012) as the base year.

2. Replace the implementation of StackingAveragedModels

from mlxtend.regressor import StackingCVRegressor
stacking_regressor = StackingCVRegressor(regressors=(ENet, GBoost, KRR), meta_regressor=lasso)

Results: 0.11300

2017年9月11日星期一

Stacking Models for Improved Predictions

Link:

一張圖解釋所有事情。

2017年9月6日星期三

[Python] Install lightgbm in MacOS

If you cannot pip install lightgbm, try this:

https://github.com/Microsoft/LightGBM/wiki/Installation-Guide

If you cannot cmake .. to build lightgbm, try this: brew install gcc@7

[Python] Install xgboost in MacOS

By executing pip install xgboost in MacOS, we may get the following error message:

List of candidates:
/private/var/folders/.../xgboost/libxgboost.so /private/var/folders/.../xgboost/../../lib/libxgboost.so /private/var/folders/.../xgboost/./lib/libxgboost.so
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/.../

My solution:

$ brew install gcc@5
$ pip install xgboost

Model:

2017年9月3日星期日

[Kaggle] House Prices: Advanced Regression Techniques

Problem: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard

參考 https://www.kaggle.com/apapiu/regularized-linear-models，現在算是模仿階段，所以盡量參考別人做法，接下來才有機會談創新，巨人肩膀好用。

Hyperparameters 改良為: preds = 0.85 * lasso_preds + 0.15 * xgb_preds，分數有進步 (0.12086 -> 0.12049)，但名次還是很差。

https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

改善計畫：

# 移除 outliers: (仔細看題目給的文件)

There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students.

不過文件也說：

A second issue closely related to the intended use of the model, is the handling of outliers and unusual observations. In general, I instruct my students to never throw away data points simply because they do not match a priori expectations (or other data points). I strongly make this point in the situation where data are being analyzed for research purposes that will be shared with a larger audience. Alternatively, if the purpose is to once again create a common use model to estimate a “typical” sale, it is in the modeler’s best interest to remove any observations that do not seem typical (such as foreclosures or family sales).

有點兩難，https://www.kaggle.com/humananalog/xgboost-lasso/code 做法是把所有 GrLivArea > 4000 的濾掉，然後調整 y_pred = 0.4 * y_pred_xgb + 0.6 * y_pred_lasso，分數大有進步，下面仔細研究怎麼做 feature engineering。

# Feature Engineering

然後在另外一台電腦重跑程式，分數會變爛（0.11481）。

# Imputation

# Stacked Regression

2017年9月24日 星期日