2017年9月24日 星期日

[Python] mlxtend


mlxtend: http://rasbt.github.io/mlxtend/

StackingCVRegressor: http://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor/



Example (Improvement of https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard):


1. Add new feature: Age of the house
  • all_data['HouseAge'] = 2012 - all_data['YearBuilt']
  • all_data['HouseAge'] = 2011 - all_data['YearBuilt']

According to the age of this paper: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf, we choose 2011 (or 2012) as the base year.


2. Replace the implementation of StackingAveragedModels
  • from mlxtend.regressor import StackingCVRegressor
  • stacking_regressor = StackingCVRegressor(regressors=(ENet, GBoost, KRR), meta_regressor=lasso)

Results: 0.11300



2017年9月6日 星期三

[Python] Install lightgbm in MacOS

If you cannot pip install lightgbm, try this:
https://github.com/Microsoft/LightGBM/wiki/Installation-Guide

If you cannot cmake .. to build lightgbm, try this: brew install gcc@7


[Python] Install xgboost in MacOS

By executing pip install xgboost in MacOS, we may get the following error message:
List of candidates:
/private/var/folders/.../xgboost/libxgboost.so /private/var/folders/.../xgboost/../../lib/libxgboost.so /private/var/folders/.../xgboost/./lib/libxgboost.so
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/.../

2017年9月3日 星期日

[Kaggle] House Prices: Advanced Regression Techniques


Problem: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard

參考 https://www.kaggle.com/apapiu/regularized-linear-models,現在算是模仿階段,所以盡量參考別人做法,接下來才有機會談創新,巨人肩膀好用。

Hyperparameters 改良為: preds = 0.85 * lasso_preds + 0.15 * xgb_preds,分數有進步 (0.12086 -> 0.12049),但名次還是很差。




https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard


改善計畫:

# 移除 outliers: (仔細看題目給的文件)

There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students.

不過文件也說:

A second issue closely related to the intended use of the model, is the handling of outliers and unusual observations. In general, I instruct my students to never throw away data points simply because they do not match a priori expectations (or other data points). I strongly make this point in the situation where data are being analyzed for research purposes that will be shared with a larger audience. Alternatively, if the purpose is to once again create a common use model to estimate a “typical” sale, it is in the modeler’s best interest to remove any observations that do not seem typical (such as foreclosures or family sales).

有點兩難,https://www.kaggle.com/humananalog/xgboost-lasso/code 做法是把所有 GrLivArea > 4000 的濾掉,然後調整 y_pred = 0.4 * y_pred_xgb + 0.6 * y_pred_lasso,分數大有進步,下面仔細研究怎麼做 feature engineering。



# Feature Engineering


然後在另外一台電腦重跑程式,分數會變爛(0.11481)。

# Imputation

# Stacked Regression


2017年8月23日 星期三

[FWD] Data Science Tutorial


https://www.kaggle.com/helgejo/an-interactive-data-science-tutorial



關鍵就是第一步 Business understanding,這讓我想起《怎樣解題》這本書,所以請老婆幫忙買中譯本回來研究。事實上工作也常碰到類似情境,解決問題之前要先了解問題


例子:https://www.kaggle.com/c/titanic

「目標」是預測怎樣的人可以存活下來,如果有看《鐵達尼號》這部電影,大概會有小孩和女人容易存活的印象,但電影有幾幕拍到年邁的老男人偷偷混到救生艇求生,以 machine learning 的術語就是 outlier,也就是說,leaderboard 上百分百正確預測率不見得是用 machine learning 方法做出來的。

以這份資料來說:https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic,我們知道二等艙的小孩存活率百分之百,這個根據是歷史資料不是亂編的,所以當 test data 滿足以上條件,大概可以說 survived = 1。

以上是歷史統計資料。


如果是歷史乘客資料呢?如果我們知道所有乘客姓名等資料,train/test data 中 Name field 看似沒什麼用途,其實這是最重要的 feature。(當然還有某些用途,例如透過 surname 確定某些家人,假設有爸爸、媽媽、小孩,小孩存活率比較高,而爸爸或媽媽存活率比較低,這就是父母親的愛)

只要把乘客姓名丟到 Google search,前幾名網頁點進去看一下,比對性別年齡等等資訊,幾乎可以確定生還或死亡。不過因為 Google search API 要錢,Bing search API 試用期間不用錢(但沒有 Google 準確),所以先用 Bing 查一遍,不確定的再手動查。以下是細節:

https://github.com/Meng-Gen/KaggleStudy/blob/master/c/titanic/study.ipynb


以上。

完全沒有用到任何 machine learning 的東西。因此 https://www.kaggle.com/c/digit-recognizer/leaderboard 完美預測的結果,或許也是某種 data leak age。