2018年5月27日 星期日
[competitive-data-science] Week 4
Hyperparameter tuning
Understand broadly the problem
Tips on collaboration
- Tuning the hyper-parameters of an estimator (sklearn)
- Optimizing hyperparameters with hyperopt
- Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
KazAnova's Competition Pipeline (https://github.com/kaz-Anova)
- Understand the problem
- Exploratory analysis
- Define CV strategy
- Feature engineering
- Modelling
- Ensembling
Understand broadly the problem
- Type of problem
- How BIG is the data
- Hardware needed (CPUs, GPUs, RAM, Disk space)
- Software needed (Tensorflow, Keras, sklearn, LightGBM, xgboost)
- What is the metric being tested on?
- Previous code relevant?
Do some EDA
- Plot histograms of variables. Check that a feature looks similar between train and test.
- Plot features vs the target variable and vs time.
- Consider univariate predictability metrics (IR, R, AUC).
- Binning numerical features and correlation metrics.
Decide CV strategy
- The step is critical. Its success is a good indication for what is going to happen in the competition.
- People have won by just selecting the right way to validate.
- Is time important? Split by time. Time-based validation.
- Different entities than the train. Stratified validation.
- Is it completely random? Random validation (random K-fold).
- Combination of all the above.
- Use test LB to test.
Feature engineering
Modeling
Ensembling
- Image classification: Scaling, shifting, rotations, CNNs. (Data Science Bowls)
- Sound classification: Fourier, MFCC, specgrams, scaling. (Tensorflow speech recognition)
- Text classification: TF-IDF, SVD, stemming, spell checking, stop word's removal, x-grams (StumbleUpon Evergreen Classification)
- Time series: Lags, weighted averaging, exponential smoothing. (Walmart recruitment)
- Categorial: Target encoding, frequency, one-hot, ordinal, label encoding. (Amazon employee)
- Numerical: Scaling, binning, derivatives, outlier removals, dimensionality reduction. (Africa soil)
- Interactions: Multiplications, divisions, group-by features, concatenations. (Homesite)
- Recommenders: Features on transactional history, item popularity, frequency of purchase. (Acquire Valued Shoppers)
Modeling
- Image classification: CNNs (Resnet, VGG, DenseNet, ...)
- Sound classification: CNNs (CRNN), LSTM
- Text classification: GBMs, Linear, DL, Naive Bayes, kNNs, LibFM, LIBFFM
- Time series: Autoregressive models, ARIMA, Linear, GBMs, DL, LSTM
- Categorial: GBMs, Linear, DL, LibFM, LIBFFM
- Numerical: GBMs, Linear, DL, SVMs
- Interactions: GBMs, Linear, DL
- Recommenders: CF, DL, LibFM, LIBFFM, GBMs
- All this time, predictions on internal validation and test are saved.
- Different ways to combine from averaging to multilayer stacking.
- Small data requires simpler ensemble techniques.
- Help to average a few low-correlated predictions with good scores.
- Stacking process repeats the modeling process.
Tips on collaboration
- More fun.
- Learn more.
- Better score.
- Gain in at least 2 ways.
- You can cover more ground.
- Every person seizes the problem from different angles leading to more though solutions.
- Start collaborating after getting some experience to understand the dynamics.
- Start with people around your "rank".
- Look for people that are likely to do different things well or that specialize in certain areas.
Selection final submissions
- Normally select the best submission locally and best on LB.
- It is good to monitor correlations. If submissions exist with high scores but significantly lower correlations, they could be considered too.
Final tips
- You never lose. You may not win prize money, but you always gain in terms of knowledge, experience, meeting/collaborating with talented people in the field, boost your CV.
- The Kaggle community may be the most kind, helpful community I have ever experienced in any social context.
- After the competition look for people sharing approaches.
- Create a notebook with useful method and update it.
Additional Links:
[competitive-data-science] Week 3
Ranking
- Learning to Rank using Gradient Descent -- original paper about pairwise method for AUC optimization
- Overview of further developments of RankNet
- RankLib (implemtations for the 2 papers from above)
- Learning to Rank Overview
Clustering
[competitive-data-science] Week 2
Exploratory Data Analysis (EDA): what and why?
- Better understand the data
- Build an intuition about the data
- Generate hypothesizes
- Find insights
- Please, do not start with stacking...
- Dato Winners' Interview: 1st place, Mad Professors: http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/
- With EDA we can:
- Get comfortable with the data
- Find magic features
- Do EDA first. Do not immediately dig into modeling.
2018年5月24日 星期四
[competitive-data-science] Week 1
Real World ML Pipeline:
- Understanding of business problem
- Problem formalization
- Data collecting
- Data preprocessing
- Modelling
- Way to evaluate model in real life
- Way to deploy model
Real World Aspect:
- Competition Problem formalization
- Choice of target metric
- Deployment issues
- Inference speed
- Data collecting
- Model complexity
- Target metric value
Competition Aspect:
Competition Problem formalization (N)Choice of target metric (N)Deployment issues (N)Inference speed (N)- Data collecting (Y/N)
- Model complexity (Y/N)
- Target metric value (Y)
Recap of main ML algorithms:
Overview of ML methods:
- Scikit-Learn (or sklearn) library
- Overview of k-NN (sklearn's documentation)
- Overview of Linear Models (sklearn's documentation)
- Overview of Decision Trees (sklearn's documentation)
- Overview of algorithms and parameters in H2O documentation
Additional Tools:
- Vowpal Wabbit repository
- XGBoost repository
- LightGBM repository
- Interactive demo of simple feed-forward Neural Net
- Frameworks for Neural Nets: Keras, PyTorch, TensorFlow, MXNet, Lasagne
- Example from sklearn with different decision surfaces
- Arbitrary order factorization machines
- Basic SciPy stack (ipython, numpy, pandas, matplotlib)
- Jupyter Notebook
- Stand-alone python tSNE package
- Libraries to work with sparse CTR-like data: LibFM, LibFFM
- Another tree-based method: RGF (implemetation, paper)
- Python distribution with all-included packages: Anaconda
- Blog "datas-frame" (contains posts about effective Pandas usage)
Feature preprocessing:
- Preprocessing in Sklearn
- Andrew NG about gradient descent and feature scaling
- Feature Scaling and the effect of standardization for machine learning algorithms
Feature generation:
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Discussion of feature engineering on Quora
- Bag of words
- Word2vec
NLP Libraries:
Feature extraction from images:
2018年5月16日 星期三
最近找工作的過程與心得
繼續舊工作是個選擇,找新工作也是個選擇。
先寫有結論的部分,沒結論的部分想寫也沒得寫(兩家公司,然後有些也想去看看)。
1. Google、Microsoft
結論:完全不理人,相當合理,仔細想想履歷也沒什麼亮點可說。
試著想辦法讓自己變好,試著結合自己的興趣。很自然想到 Coursera 上找一些課程來加強學習,除此之外還想增加自己的曝光度,畢竟這個世界就很現實,你要有一些成績讓人家看見,人家才會對你感興趣。基於這樣的想法:
- How to Win a Data Science Competition: Learn from Top Kagglers (可以自己找)
- Kaggle: https://www.kaggle.com/
只能不斷的跟自己戰鬥。
2. Amazon
結論:Software Development Engineer 感謝函 + Cloud Support Engineer 暫停面試流程。
有簽 NDA 無法透露細節。只能就公開的工作內容討論。Cloud Support Engineer 工作內容跟我目前工作內容差異極大,我投錯了。不過我很喜歡 Amazon 的 leadership principles 裡面講的東西:
- Leadership Principles: https://www.amazon.jobs/principles
然後也試著瞭解工作曾經遇過的困難,總要停下腳步才看得清過往的錯誤。粗略看許多 DevOps 的東西,發現 Google 對於 DevOps 的理解與眾不同:
- Site Reliability Engineering: https://landing.google.com/sre/
例如講到 the evolution of automation,他會介紹流程是怎麼進行:
- No automation
- Externally maintained system-specific automation
- Externally maintained generic automation
- Internally maintained system-specific automation
- Systems that don’t need any automation
還有講到很多東西,本身才疏學淺,讀起來特別有收穫。Google 這麼做,或許我們可以做個參考,可是這終究是 Google 的經驗,有時候面試想要吹噓幾句還是無法駕輕就熟。
3. One DevOps Startup
結論:Offer get.
想到這個點子,把點子變成賺錢機器,真的很強,台灣新創公司真的很厲害。透過面試再次理解 DevOps,新創公司眼中是寶的東西 (Jenkins、Ansible、Kubernetes),並不是理所當然的存在,只是自己沒有深入瞭解 DevOps 裡頭的東西。
(有衝動想把這些 tech stacks 架起來玩,等我有空的時候)
- Jenkins: https://jenkins.io/
- Ansible: https://www.ansible.com/
- Kubernetes: https://kubernetes.io/
要說有什麼不足,可能是差強人意的薪水,雖然這產業的薪水相較於台商已經夠好了。面試過程沒有刁難的技術問題,這意味著許多事情。從出社會到現在是安安拔拔,就算是差強人意的薪水,自己也沒本事自己出來創業當老闆,也能發給員工那麼多薪水,所以還是要感謝公司願意給你那麼多薪水(非資方打手)。如果將來有新東家願意給你更多薪水,那就要非常感謝新東家。
如果資本主義本質是貪婪,那我們就該如此貪婪。
(沒啥溫度的真理)
(幻想把勞動法修完湊學分,幻想自己可以認真唸書考律師。幻想翻開民訴口述講義不會想睡覺)
(但一切都經不起貪婪真理的檢驗)
每個時刻都有每個時刻的困惑,但總是想讓自己快樂點。
希望沿莉寶貝放鬆一點,我也放鬆一點。
2018年5月10日 星期四
[Note] Competitive Data Science
Coursera: How to Win a Data Science Competition: Learn from Top Kagglers
2018年5月2日 星期三
Notes on Firebase
Notes:
Firebase Database REST API: https://firebase.google.com/docs/reference/rest/database/
Firebase Database REST API: https://firebase.google.com/docs/reference/rest/database/
- access_token: https://firebase.google.com/docs/database/rest/auth
訂閱:
文章 (Atom)