2018年5月27日 星期日

[competitive-data-science] Week 5

[competitive-data-science] Week 4

Hyperparameter tuning

KazAnova's Competition Pipeline (https://github.com/kaz-Anova)
  • Understand the problem
  • Exploratory analysis
  • Define CV strategy
  • Feature engineering 
  • Modelling 
  • Ensembling

Understand broadly the problem
  • Type of problem
  • How BIG is the data
  • Hardware needed (CPUs, GPUs, RAM, Disk space)
  • Software needed (Tensorflow, Keras, sklearn, LightGBM, xgboost)
  • What is the metric being tested on?
  • Previous code relevant?

Do some EDA
  • Plot histograms of variables. Check that a feature looks similar between train and test.
  • Plot features vs the target variable and vs time.
  • Consider univariate predictability metrics (IR, R, AUC).
  • Binning numerical features and correlation metrics.

Decide CV strategy
  • The step is critical. Its success is a good indication for what is going to happen in the competition.
  • People have won by just selecting the right way to validate.
  • Is time important? Split by time. Time-based validation.
  • Different entities than the train. Stratified validation.
  • Is it completely random? Random validation (random K-fold).
  • Combination of all the above.
  • Use test LB to test.

Feature engineering
  • Image classification: Scaling, shifting, rotations, CNNs. (Data Science Bowls)
  • Sound classification: Fourier, MFCC, specgrams, scaling. (Tensorflow speech recognition)
  • Text classification: TF-IDF, SVD, stemming, spell checking, stop word's removal, x-grams (StumbleUpon Evergreen Classification)
  • Time series: Lags, weighted averaging, exponential smoothing. (Walmart recruitment)
  • Categorial: Target encoding, frequency, one-hot, ordinal, label encoding. (Amazon employee)
  • Numerical: Scaling, binning, derivatives, outlier removals, dimensionality reduction. (Africa soil)
  • Interactions: Multiplications, divisions, group-by features, concatenations. (Homesite)
  • Recommenders: Features on transactional history, item popularity, frequency of purchase. (Acquire Valued Shoppers)

Modeling
  • Image classification: CNNs (Resnet, VGG, DenseNet, ...)
  • Sound classification: CNNs (CRNN), LSTM
  • Text classification: GBMs, Linear, DL, Naive Bayes, kNNs, LibFM, LIBFFM
  • Time series: Autoregressive models, ARIMA, Linear, GBMs, DL, LSTM
  • Categorial: GBMs, Linear, DL, LibFM, LIBFFM
  • Numerical: GBMs, Linear, DL, SVMs
  • Interactions: GBMs, Linear, DL
  • Recommenders: CF, DL, LibFM, LIBFFM, GBMs

Ensembling
  • All this time, predictions on internal validation and test are saved.
  • Different ways to combine from averaging to multilayer stacking.
  • Small data requires simpler ensemble techniques.
  • Help to average a few low-correlated predictions with good scores.
  • Stacking process repeats the modeling process.

Tips on collaboration
  • More fun.
  • Learn more.
  • Better score.
  • Gain in at least 2 ways.
    • You can cover more ground.
    • Every person seizes the problem from different angles leading to more though solutions.
  • Start collaborating after getting some experience to understand the dynamics.
  • Start with people around your "rank".
  • Look for people that are likely to do different things well or that specialize in certain areas. 

Selection final submissions
  • Normally select the best submission locally and best on LB.
  • It is good to monitor correlations. If submissions exist with high scores but significantly lower correlations, they could be considered too.

Final tips
  • You never lose. You may not win prize money, but you always gain in terms of knowledge, experience, meeting/collaborating with talented people in the field, boost your CV.
  • The Kaggle community may be the most kind, helpful community I have ever experienced in any social context.
  • After the competition look for people sharing approaches. 
  • Create a notebook with useful method and update it.

Additional Links:

[competitive-data-science] Week 3



Ranking

Clustering

[competitive-data-science] Week 2

Exploratory Data Analysis (EDA): what and why?
  • Better understand the data
  • Build an intuition about the data
  • Generate hypothesizes
  • Find insights
  • Please, do not start with stacking...

2018年5月24日 星期四

[competitive-data-science] Week 1


As in any competitive field, you need to work very hard to get a prize in a competition.


Real World ML Pipeline:
  • Understanding of business problem
  • Problem formalization
  • Data collecting
  • Data preprocessing
  • Modelling
  • Way to evaluate model in real life
  • Way to deploy model

Real World Aspect:
  • Competition Problem formalization
  • Choice of target metric
  • Deployment issues
  • Inference speed
  • Data collecting
  • Model complexity
  • Target metric value

Competition Aspect:
  • Competition Problem formalization (N)
  • Choice of target metric (N)
  • Deployment issues (N)
  • Inference speed (N)
  • Data collecting (Y/N)
  • Model complexity (Y/N)
  • Target metric value (Y)

Recap of main ML algorithms:

Overview of ML methods:

Additional Tools:

Stack and packages:

Feature preprocessing:

Feature generation:

Feature extraction from text:

NLP Libraries:

Feature extraction from images:

2018年5月16日 星期三

最近找工作的過程與心得


繼續舊工作是個選擇,找新工作也是個選擇。

先寫有結論的部分,沒結論的部分想寫也沒得寫(兩家公司,然後有些也想去看看)。


1. Google、Microsoft

結論:完全不理人,相當合理,仔細想想履歷也沒什麼亮點可說。

試著想辦法讓自己變好,試著結合自己的興趣。很自然想到 Coursera 上找一些課程來加強學習,除此之外還想增加自己的曝光度,畢竟這個世界就很現實,你要有一些成績讓人家看見,人家才會對你感興趣。基於這樣的想法:

只能不斷的跟自己戰鬥。



2. Amazon

結論:Software Development Engineer 感謝函 + Cloud Support Engineer 暫停面試流程。

有簽 NDA 無法透露細節。只能就公開的工作內容討論。Cloud Support Engineer 工作內容跟我目前工作內容差異極大,我投錯了。不過我很喜歡 Amazon 的 leadership principles 裡面講的東西:

然後也試著瞭解工作曾經遇過的困難,總要停下腳步才看得清過往的錯誤。粗略看許多 DevOps 的東西,發現 Google 對於 DevOps 的理解與眾不同:

例如講到 the evolution of automation,他會介紹流程是怎麼進行:
  1. No automation
  2. Externally maintained system-specific automation
  3. Externally maintained generic automation
  4. Internally maintained system-specific automation
  5. Systems that don’t need any automation

還有講到很多東西,本身才疏學淺,讀起來特別有收穫。Google 這麼做,或許我們可以做個參考,可是這終究是 Google 的經驗,有時候面試想要吹噓幾句還是無法駕輕就熟。



3. One DevOps Startup

結論:Offer get.

想到這個點子,把點子變成賺錢機器,真的很強,台灣新創公司真的很厲害。透過面試再次理解 DevOps,新創公司眼中是寶的東西 (Jenkins、Ansible、Kubernetes),並不是理所當然的存在,只是自己沒有深入瞭解 DevOps 裡頭的東西。

(有衝動想把這些 tech stacks 架起來玩,等我有空的時候)

要說有什麼不足,可能是差強人意的薪水,雖然這產業的薪水相較於台商已經夠好了。面試過程沒有刁難的技術問題,這意味著許多事情。從出社會到現在是安安拔拔,就算是差強人意的薪水,自己也沒本事自己出來創業當老闆,也能發給員工那麼多薪水,所以還是要感謝公司願意給你那麼多薪水(非資方打手)。如果將來有新東家願意給你更多薪水,那就要非常感謝新東家。

如果資本主義本質是貪婪,那我們就該如此貪婪。

(沒啥溫度的真理)

(幻想把勞動法修完湊學分,幻想自己可以認真唸書考律師。幻想翻開民訴口述講義不會想睡覺)

(但一切都經不起貪婪真理的檢驗)



每個時刻都有每個時刻的困惑,但總是想讓自己快樂點。



希望沿莉寶貝放鬆一點,我也放鬆一點。

2018年5月10日 星期四