Plover Tsai: 5月 2018

2018年5月27日星期日

[competitive-data-science] Week 5

[competitive-data-science] Week 4

Hyperparameter tuning

KazAnova's Competition Pipeline (https://github.com/kaz-Anova)

Understand the problem
Exploratory analysis
Define CV strategy
Feature engineering
Modelling
Ensembling

Understand broadly the problem

Type of problem
How BIG is the data
Hardware needed (CPUs, GPUs, RAM, Disk space)
Software needed (Tensorflow, Keras, sklearn, LightGBM, xgboost)
What is the metric being tested on?
Previous code relevant?

Do some EDA

Plot histograms of variables. Check that a feature looks similar between train and test.
Plot features vs the target variable and vs time.
Consider univariate predictability metrics (IR, R, AUC).
Binning numerical features and correlation metrics.

Decide CV strategy

The step is critical. Its success is a good indication for what is going to happen in the competition.
People have won by just selecting the right way to validate.
Is time important? Split by time. Time-based validation.
Different entities than the train. Stratified validation.
Is it completely random? Random validation (random K-fold).
Combination of all the above.
Use test LB to test.

Feature engineering

Image classification: Scaling, shifting, rotations, CNNs. (Data Science Bowls)
Sound classification: Fourier, MFCC, specgrams, scaling. (Tensorflow speech recognition)
Text classification: TF-IDF, SVD, stemming, spell checking, stop word's removal, x-grams (StumbleUpon Evergreen Classification)
Time series: Lags, weighted averaging, exponential smoothing. (Walmart recruitment)
Categorial: Target encoding, frequency, one-hot, ordinal, label encoding. (Amazon employee)
Numerical: Scaling, binning, derivatives, outlier removals, dimensionality reduction. (Africa soil)
Interactions: Multiplications, divisions, group-by features, concatenations. (Homesite)
Recommenders: Features on transactional history, item popularity, frequency of purchase. (Acquire Valued Shoppers)

Modeling

Image classification: CNNs (Resnet, VGG, DenseNet, ...)
Sound classification: CNNs (CRNN), LSTM
Text classification: GBMs, Linear, DL, Naive Bayes, kNNs, LibFM, LIBFFM
Time series: Autoregressive models, ARIMA, Linear, GBMs, DL, LSTM
Categorial: GBMs, Linear, DL, LibFM, LIBFFM
Numerical: GBMs, Linear, DL, SVMs
Interactions: GBMs, Linear, DL
Recommenders: CF, DL, LibFM, LIBFFM, GBMs

Ensembling

All this time, predictions on internal validation and test are saved.
Different ways to combine from averaging to multilayer stacking.
Small data requires simpler ensemble techniques.
Help to average a few low-correlated predictions with good scores.
Stacking process repeats the modeling process.

Tips on collaboration

More fun.
Learn more.
Better score.
Gain in at least 2 ways.

You can cover more ground.
Every person seizes the problem from different angles leading to more though solutions.

Start collaborating after getting some experience to understand the dynamics.
Start with people around your "rank".
Look for people that are likely to do different things well or that specialize in certain areas.

Selection final submissions

Normally select the best submission locally and best on LB.
It is good to monitor correlations. If submissions exist with high scores but significantly lower correlations, they could be considered too.

Final tips

You never lose. You may not win prize money, but you always gain in terms of knowledge, experience, meeting/collaborating with talented people in the field, boost your CV.
The Kaggle community may be the most kind, helpful community I have ever experienced in any social context.
After the competition look for people sharing approaches.
Create a notebook with useful method and update it.

Additional Links:

[competitive-data-science] Week 3

Ranking

Learning to Rank using Gradient Descent -- original paper about pairwise method for AUC optimization
Overview of further developments of RankNet
RankLib (implemtations for the 2 papers from above)
Learning to Rank Overview

Clustering

Evaluation metrics for clustering

[competitive-data-science] Week 2

Exploratory Data Analysis (EDA): what and why?

Better understand the data
Build an intuition about the data
Generate hypothesizes
Find insights
Please, do not start with stacking...

Dato Winners' Interview: 1st place, Mad Professors: http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/

With EDA we can:

Get comfortable with the data
Find magic features
Do EDA first. Do not immediately dig into modeling.

2018年5月24日星期四

[competitive-data-science] Week 1

As in any competitive field, you need to work very hard to get a prize in a competition.

Real World ML Pipeline:

Understanding of business problem
Problem formalization
Data collecting
Data preprocessing
Modelling
Way to evaluate model in real life
Way to deploy model

Real World Aspect:

Competition Problem formalization
Choice of target metric
Deployment issues
Inference speed
Data collecting
Model complexity
Target metric value

Competition Aspect:

~~Competition Problem formalization (N)~~
~~Choice of target metric (N)~~
~~Deployment issues (N)~~
~~Inference speed (N)~~
Data collecting (Y/N)
Model complexity (Y/N)
Target metric value (Y)

Recap of main ML algorithms:

Overview of ML methods:

Scikit-Learn (or sklearn) library
Overview of k-NN (sklearn's documentation)
Overview of Linear Models (sklearn's documentation)
Overview of Decision Trees (sklearn's documentation)
Overview of algorithms and parameters in H2O documentation

Additional Tools:

Vowpal Wabbit repository
XGBoost repository
LightGBM repository
Interactive demo of simple feed-forward Neural Net
Frameworks for Neural Nets: Keras, PyTorch, TensorFlow, MXNet, Lasagne
Example from sklearn with different decision surfaces
Arbitrary order factorization machines

Stack and packages:

Basic SciPy stack (ipython, numpy, pandas, matplotlib)
Jupyter Notebook
Stand-alone python tSNE package
Libraries to work with sparse CTR-like data: LibFM, LibFFM
Another tree-based method: RGF (implemetation, paper)
Python distribution with all-included packages: Anaconda
Blog "datas-frame" (contains posts about effective Pandas usage)

Feature preprocessing:

Feature generation:

Feature extraction from text:

Bag of words

Word2vec

NLP Libraries:

Feature extraction from images:

Pretrained models

Finetuning

2018年5月16日星期三

最近找工作的過程與心得

繼續舊工作是個選擇，找新工作也是個選擇。

先寫有結論的部分，沒結論的部分想寫也沒得寫（兩家公司，然後有些也想去看看）。

1. Google、Microsoft

結論：完全不理人，相當合理，仔細想想履歷也沒什麼亮點可說。

試著想辦法讓自己變好，試著結合自己的興趣。很自然想到 Coursera 上找一些課程來加強學習，除此之外還想增加自己的曝光度，畢竟這個世界就很現實，你要有一些成績讓人家看見，人家才會對你感興趣。基於這樣的想法：

How to Win a Data Science Competition: Learn from Top Kagglers (可以自己找)
Kaggle: https://www.kaggle.com/

只能不斷的跟自己戰鬥。

2. Amazon

結論：Software Development Engineer 感謝函 + Cloud Support Engineer 暫停面試流程。

有簽 NDA 無法透露細節。只能就公開的工作內容討論。Cloud Support Engineer 工作內容跟我目前工作內容差異極大，我投錯了。不過我很喜歡 Amazon 的 leadership principles 裡面講的東西：

Leadership Principles: https://www.amazon.jobs/principles

然後也試著瞭解工作曾經遇過的困難，總要停下腳步才看得清過往的錯誤。粗略看許多 DevOps 的東西，發現 Google 對於 DevOps 的理解與眾不同：

Site Reliability Engineering: https://landing.google.com/sre/

例如講到 the evolution of automation，他會介紹流程是怎麼進行：

No automation
Externally maintained system-specific automation
Externally maintained generic automation
Internally maintained system-specific automation
Systems that don’t need any automation

還有講到很多東西，本身才疏學淺，讀起來特別有收穫。Google 這麼做，或許我們可以做個參考，可是這終究是 Google 的經驗，有時候面試想要吹噓幾句還是無法駕輕就熟。

3. One DevOps Startup

結論：Offer get.

想到這個點子，把點子變成賺錢機器，真的很強，台灣新創公司真的很厲害。透過面試再次理解 DevOps，新創公司眼中是寶的東西 (Jenkins、Ansible、Kubernetes），並不是理所當然的存在，只是自己沒有深入瞭解 DevOps 裡頭的東西。

（有衝動想把這些 tech stacks 架起來玩，等我有空的時候）

要說有什麼不足，可能是差強人意的薪水，雖然這產業的薪水相較於台商已經夠好了。面試過程沒有刁難的技術問題，這意味著許多事情。從出社會到現在是安安拔拔，就算是差強人意的薪水，自己也沒本事自己出來創業當老闆，也能發給員工那麼多薪水，所以還是要感謝公司願意給你那麼多薪水（非資方打手）。如果將來有新東家願意給你更多薪水，那就要非常感謝新東家。

如果資本主義本質是貪婪，那我們就該如此貪婪。

（沒啥溫度的真理）

（幻想把勞動法修完湊學分，幻想自己可以認真唸書考律師。幻想翻開民訴口述講義不會想睡覺）

（但一切都經不起貪婪真理的檢驗）

每個時刻都有每個時刻的困惑，但總是想讓自己快樂點。

希望沿莉寶貝放鬆一點，我也放鬆一點。